Supervisor restating over and over

Hi all,
one of my Raspberry Pi based device seems to be rebooting over and over, as shown in the logs:

14.10.19 00:49:58 (+0200) Supervisor starting
14.10.19 00:50:23 (+0200) Killing service 'main sha256:c0806c7044c132051dc4ecd8b9266c9b38c3d0fe7a6bb900c379f43ed2ebb554'
14.10.19 16:56:10 (+0200) Killed service 'main sha256:c0806c7044c132051dc4ecd8b9266c9b38c3d0fe7a6bb900c379f43ed2ebb554'
14.10.19 16:56:10 (+0200) Rebooting
14.10.19 16:56:10 (+0200) Service exited 'main sha256:c0806c7044c132051dc4ecd8b9266c9b38c3d0fe7a6bb900c379f43ed2ebb554'

The device is reported to be offline in the dashboard, but it actually goes online for a few seconds. Maybe to fews seconds to open a web terminal onto the Host OS:

Connecting to 148f24edb9e59025e95fcc666a937db1...

Host OS is balenaOS 2.32.0+rev1, Supervisor is 9.14.0. The board is a Rasbperry Pi 3 B and it connects to the Internet via a MS2131 USB dongle.

Is there some voodoo I can practice to recover my device?

Regards,
Danilo

The device is connecting every minute, but then it reboots.
I think that balenaOS is booting and the ModemManager on it is able to connect to the Internet via the MS2131 3G dongle.

Does this also means that the Supervisor is getting started?
Do I have any chance to ask the device to stay up with the Supervisor only?

I’ve also tried to move the device to another application, but the result is the same. The attempt was made in order to try to replace the “broken” application container with another one.

Hi @daghemo is the device a .prod or .dev and are you in the same network as the device. If you are then you can probably get into the device via SSH before it reboots again and quickly stop the balena-engine.

The supervisor is definitely getting started because it will be the one sending the logs to the dashboard every boot. So is it currently assumed that the application container is broken and is the one rebooting the device constantly?

Hi @shaunmulligan,
the remote device uses a Production image and it connects to the Internet via a MS2131 3G dongle and I have no local access to it via LAN or WiFi.

The application container used to work until yesterday. Don’t know what’s happened. Maybe the filesystem on the microSD got damaged.

Is there something I can set into the Dashboard in order to ask the device to not reboot (or reboot just after N minutes) as long as the application container is broken? I would then be able to access the host OS via the VPN, am I right?

@daghemo unfortunately there isn’t currently a way to ask the device to start up and not run any of the containers, because the containers are started by the container engine which in turn starts the supervisor (which is also a container), this means that your containers will always start side by side with the supervisor and do whatever they are doing before the supervisor has a chance to affect any changes. We might be able to implement a feature in the future to offer this kind of functionality, but it needs some designing.

It also depends on what is causing the device to reboot. If its rebooting because of a corrupt SD or filesystem, then its difficult to know if anything can help. The only thing I can think of is to try write a script that constantly tries to ssh into the device (using the balena cli) and then execute a systemctl stop balena to stop the engine from starting anthing, but my guess this will be difficult to catch because the modem takes longer to connect than the containers do to start up :frowning:

@shaunmulligan I will try with with balena cli and systemctl stop balena.

I think that even just allowing to set a timeout before rebooting the device would be greater.
This could also help the Supervisor to not only send logs to the balenaCloud, but ask it what to do, e.g. restart the application, move to a different application (i.e. change the faulty container with another one), purge the /data or whatever else.

Is there any way to get a more verbose logging?

@daghemo yes, i think a feature like that will be helpful and I’ve raised it internally for discussion. However it many cases a reboot delay won’t help, especially if its something causing processes or the kernel to panic (which often happens in corrupt FS) because in that case the reboot is not really under the control of the supervisor so it can’t delay it. The delay will only help in the case a users code which uses the reboot API has gone rogue, but its still useful.

Unfortunately if you need more verbose logging from the supervisor its only currently possible to access if you can get into the OS and run journalctl -u resin-supervisor --no-pager. We don’t yet remotely log all the journal because many of our users are running on cellular and are very sensitive to data usage, so we need to implement it in an opt in way.

I know. But I believe that adding a (configurable?) delay before booting may help in a lot of situations.

I’ll may be back in a few days with some new findings: I’ve asked my friend to ship the device back to me for inspection.
I’m expecting interesting findings here, just because the device has started to work again! That is, the application container is communicating with my server again. No problem on the server before, not even attempt to communicate with it (both logs and tcpdump).

The problem now is that I’ve deleted the device from my dashboard. Is there a way I can readd the device to my application without reprovisioning it or touching too many things (also altering the tests)?

Hi,

Its interesting that the device is working again and managed to get out of its boot loop.

For ‘undeleting’ a device, its a bit manual at the momentCan you share the uuid of the device? Is it the same as 148f24edb9e59025e95fcc666a937db1 as in the first post in the thread?

Also, for random reboot loops that went away, whatever was causing the reboot is probably not software. So at this point, I would strongly suspect something is about to fail.
It is usually the sd card. We recommend the Sandisk Extreme Pro. In some cases it could be battery/power supply too.

Regards
ZubairLK

Hi @zubairlk,
I’ll be back as soon as I receive the device from my friend.
I’m using Samsung EVO Plus, but I will consider the SandDisk Extreme PRO or microSD based on SLC flash memory.