Balena Devices Supervisor Dies but Still on the Network

We are seeing some odd behavior from Balena supervisor on a RPi CM3 system. The system keeps dropping offline, and will come back after a reboot, but then goes away eventually. This happens even if there is no running container on the system (i.e. the application has been stopped through the dashboard). The device can still be pinged using its external IP address, but it seems like the supervisor has died. Is there any circumstance in which this would happen ?

Thanks,

Kevin

That is indeed strange!

Can you help us debug this

  • share the device diagnostics
  • get the supervisor logs from the device using journalctl
  • if the device is rebooting - for example if your local network ssh connection keeps disconnecting - can you enable persistent logging and share the boot logs using journalctl again?

These things will help us figure what’s going on with the device

Thanks. This device is currently offline, I will get it restarted and share the information you have requested. Persistent logging is enabled.

Kevin

OK, I finally got some log information about this problem. Looks like NetworkManager is getting into a state where it can’t connect. This from journalctl -u NetworkManager. It hangs not connecting (for hours), but works on reboot. The wifi environment is fine, many other units are connected at the same time. Is this some kind of race condition ?

Hey Kevin. That’s interesting. I think we need to look at the device diagnostics to get a better picture. Can you post the health check info on the device, and then run diagnostics and post the results? (I think we need to consider the possibility of a hardware issue.)

Hi Kevin, just checking in to see if you have managed to take a look at those device diagnostics for us so we can get a better look at whats going on here?