Supervisor restating over and over

Hi all,
one of my Raspberry Pi based device seems to be rebooting over and over, as shown in the logs:

14.10.19 00:49:58 (+0200) Supervisor starting
14.10.19 00:50:23 (+0200) Killing service 'main sha256:c0806c7044c132051dc4ecd8b9266c9b38c3d0fe7a6bb900c379f43ed2ebb554'
14.10.19 16:56:10 (+0200) Killed service 'main sha256:c0806c7044c132051dc4ecd8b9266c9b38c3d0fe7a6bb900c379f43ed2ebb554'
14.10.19 16:56:10 (+0200) Rebooting
14.10.19 16:56:10 (+0200) Service exited 'main sha256:c0806c7044c132051dc4ecd8b9266c9b38c3d0fe7a6bb900c379f43ed2ebb554'

The device is reported to be offline in the dashboard, but it actually goes online for a few seconds. Maybe to fews seconds to open a web terminal onto the Host OS:

Connecting to 148f24edb9e59025e95fcc666a937db1...

Host OS is balenaOS 2.32.0+rev1, Supervisor is 9.14.0. The board is a Rasbperry Pi 3 B and it connects to the Internet via a MS2131 USB dongle.

Is there some voodoo I can practice to recover my device?

Regards,
Danilo

The device is connecting every minute, but then it reboots.
I think that balenaOS is booting and the ModemManager on it is able to connect to the Internet via the MS2131 3G dongle.

Does this also means that the Supervisor is getting started?
Do I have any chance to ask the device to stay up with the Supervisor only?

I’ve also tried to move the device to another application, but the result is the same. The attempt was made in order to try to replace the “broken” application container with another one.

Hi @daghemo is the device a .prod or .dev and are you in the same network as the device. If you are then you can probably get into the device via SSH before it reboots again and quickly stop the balena-engine.

The supervisor is definitely getting started because it will be the one sending the logs to the dashboard every boot. So is it currently assumed that the application container is broken and is the one rebooting the device constantly?

Hi @shaunmulligan,
the remote device uses a Production image and it connects to the Internet via a MS2131 3G dongle and I have no local access to it via LAN or WiFi.

The application container used to work until yesterday. Don’t know what’s happened. Maybe the filesystem on the microSD got damaged.

Is there something I can set into the Dashboard in order to ask the device to not reboot (or reboot just after N minutes) as long as the application container is broken? I would then be able to access the host OS via the VPN, am I right?

@daghemo unfortunately there isn’t currently a way to ask the device to start up and not run any of the containers, because the containers are started by the container engine which in turn starts the supervisor (which is also a container), this means that your containers will always start side by side with the supervisor and do whatever they are doing before the supervisor has a chance to affect any changes. We might be able to implement a feature in the future to offer this kind of functionality, but it needs some designing.

It also depends on what is causing the device to reboot. If its rebooting because of a corrupt SD or filesystem, then its difficult to know if anything can help. The only thing I can think of is to try write a script that constantly tries to ssh into the device (using the balena cli) and then execute a systemctl stop balena to stop the engine from starting anthing, but my guess this will be difficult to catch because the modem takes longer to connect than the containers do to start up :frowning:

@shaunmulligan I will try with with balena cli and systemctl stop balena.

I think that even just allowing to set a timeout before rebooting the device would be greater.
This could also help the Supervisor to not only send logs to the balenaCloud, but ask it what to do, e.g. restart the application, move to a different application (i.e. change the faulty container with another one), purge the /data or whatever else.

Is there any way to get a more verbose logging?

@daghemo yes, i think a feature like that will be helpful and I’ve raised it internally for discussion. However it many cases a reboot delay won’t help, especially if its something causing processes or the kernel to panic (which often happens in corrupt FS) because in that case the reboot is not really under the control of the supervisor so it can’t delay it. The delay will only help in the case a users code which uses the reboot API has gone rogue, but its still useful.

Unfortunately if you need more verbose logging from the supervisor its only currently possible to access if you can get into the OS and run journalctl -u resin-supervisor --no-pager. We don’t yet remotely log all the journal because many of our users are running on cellular and are very sensitive to data usage, so we need to implement it in an opt in way.

I know. But I believe that adding a (configurable?) delay before booting may help in a lot of situations.

I’ll may be back in a few days with some new findings: I’ve asked my friend to ship the device back to me for inspection.
I’m expecting interesting findings here, just because the device has started to work again! That is, the application container is communicating with my server again. No problem on the server before, not even attempt to communicate with it (both logs and tcpdump).

The problem now is that I’ve deleted the device from my dashboard. Is there a way I can readd the device to my application without reprovisioning it or touching too many things (also altering the tests)?

Hi,

Its interesting that the device is working again and managed to get out of its boot loop.

For ‘undeleting’ a device, its a bit manual at the momentCan you share the uuid of the device? Is it the same as 148f24edb9e59025e95fcc666a937db1 as in the first post in the thread?

Also, for random reboot loops that went away, whatever was causing the reboot is probably not software. So at this point, I would strongly suspect something is about to fail.
It is usually the sd card. We recommend the Sandisk Extreme Pro. In some cases it could be battery/power supply too.

Regards
ZubairLK

Hi @zubairlk,
I’ll be back as soon as I receive the device from my friend.
I’m using Samsung EVO Plus, but I will consider the SandDisk Extreme PRO or microSD based on SLC flash memory.

Hi @shaunmulligan, hi @zubairlk,
any update from you on this?
Even just on my proposal about “allowing to set a timeout before rebooting the device”.

I got my hands on the device in the fall of 2019, but… I’ve mixed up the microSD cards! :slight_smile: Then, had no time to furted investigate. :frowning:

But now a friend I’ve introduced to balenaOS is suffering the same problem.

Again a Raspberry Pi 3 B+.
This time with balenaOS 2.36.0+rev2 and Supervisor 9.15.0.
Same error pattern here: Supervisor starting… Killing service… Killed service… Rebooting… Service exited… repeat…
So, I know, it could be a broken microSD. But…

Something is working here… so I really would love to be able to SSH into the Host OS remotely and run some magic balena command.

Hi,
Regarding the feature, allowing to set a timeout before rebooting the device I will check in with the team if there has been an update on the same. Reading back on the issue as @zubairlk, mentioned it could be a broken SD card so would be good to check that out. Also, consider updating the BalenaOS to the latest version.

Thank you @vipulgupta2048! :slight_smile:

Any suggestion to update the balenaOS remotely in this condition (Supervisor starting… Killing service… Killed service… Rebooting… Service exited… repeat…)?

That is, any tip or trick to prevent the device from rebooting, over and over. Maybe by configuring a “magic” RESIN_HOST_CONFIG_* variable, pushing a new application to it, moving the device to another application, or whatever else. Anything that we can do remotely, via the VPN or the balenaCloud.

Hi there,

While you won’t be able to update the host OS while it is rebooting like this, we have on our roadmap providing a “safe mode” to prevent the engine from starting up in cases like this. The good news is that I do have some concrete steps for you to try:

  1. As noted in https://github.com/balena-io/balena-cli/issues/1482, you can echo a command into balena ssh to send a command: https://github.com/balena-io/balena-cli/issues/1482#issuecomment-574385467.
  2. You can combine that with a shell builtin like until to repeatedly try sending the command to stop the engine (systemctl stop balena). Hopefully this will work and prevent the device from rebooting!

Please let us know how you fare with these suggestions. It could be that something else entirely is causing these reboots, in which case this workaround may not help. There are tons of further suggestions available in our device debugging masterclass as well: https://www.balena.io/docs/learn/more/masterclasses/device-debugging/

Thank you @xginn8! :smiley:
It sounds good and I’m going to try right now! Really hope to help my friend and also find a solution for my old problem.
I will get soon back to you with an update.

Hi @xginn8,
I’m back with a very small update:

# exit=0 ; until [ "$exit" -ge 1 ]; do (echo "systemctl stop balena" | ./balena ssh d9f393d --host 2>/dev/null) && exit=1 || echo "`date +%Y%m%d-%H%M%S` Still trying..." ; done
20200611-132752 Still trying...
20200611-132753 Still trying...
20200611-132755 Still trying...
20200611-132756 Still trying...
20200611-132758 Still trying...
20200611-132759 Still trying...
20200611-132801 Still trying...
20200611-132803 Still trying...
20200611-132804 Still trying...
20200611-132806 Still trying...
20200611-132807 Still trying...
20200611-132809 Still trying...
20200611-132811 Still trying...
20200611-132812 Still trying...
20200611-132814 Still trying...
20200611-132815 Still trying...
20200611-132817 Still trying...
20200611-132818 Still trying...
20200611-132820 Still trying...
20200611-132822 Still trying...
20200611-132823 Still trying...
20200611-132825 Still trying...
20200611-132826 Still trying...
20200611-132828 Still trying...
20200611-132829 Still trying...
20200611-132831 Still trying...
20200611-132833 Still trying...
20200611-132834 Still trying...
20200611-132836 Still trying...
20200611-132837 Still trying...
error: host error:

That is, all we get is an “error: host error:” right now. :frowning:

Hi @daghemo

Thanks for the update. I have a few questions:

  • What is the status of the device on the dashboard now, does it stay online? I.e. has the rebooting stopped?
  • Are persistent logs enabled for the device (as described here )?
  • What do you find when running the diagnostics as described here?

Kind regards
Alida

The status of the device in the dashboard is unchanged.
That is, as before, it may be shown as being online for a couple of seconds, then offline again.
Supervisor starting… Killing service… Killed service… Rebooting… Service exited… repeat… you know. :frowning:

Having said that, can’t run the Device Health Checks , just because the system is offline.
I managed to click that but just once, but got:
Error sending checks command: w: Request error: [object Object]

Persistence logging is not enabled.

Hey, just to confirm, this ticket was created for your own device but now we want to troubleshoot why your friend’s device is exhibiting the same behavior. Is this new device also running a production image? I ask because I’d like you to try and SSH directly to the device which removes the Proxy service from the equation.

The command to SSH to a production or development image is the same (balena ssh ) however, for production, if you did not add your SSH key to config.json already you’ll have to do that to SSH to it directly. This can be done by pulling the device’s SD card and plugging it into a machine that can modify the contents. Following https://www.balena.io/docs/learn/more/masterclasses/host-os-masterclass/#12-advanced-editing-configjson-by-hand you can edit the SSH keys as seen in https://www.balena.io/docs/reference/OS/configuration/#sshkeys. You will find the config.json file in the resin-boot partition of the SD card.

If this allows us to SSH to the device we have a lot more options to debug this so I’m keen on us trying everything we can to SSH before moving on to other steps.

Thanks!