Balena "Supervisor" process does not restart after a device is "Shut Down" from BalenaCloud

I had to power down one of my balenasense devices on my starter plan a bit ago. I noticed that there was a “shut down” option in the device UI in BalenaCloud, so I tried that and noted that it powered off successfully.

However, upon attempting to restart the device and begin collecting data again, I noticed that the device shows as “Online” but the applications were not starting and could not be restarted. I ran diagnostics and noticed that the Supervisor process was not started and it’s not clear that there is a way to start that process either from the balena CLI, from the device’s terminal (connected to Host OS), or via console (keyboard + screen).

The device in-question is a Raspberry Pi Zero W. Any thoughts or ideas? Is this a bug or a known issue?

Hello and welcome to the forums!

What version of balenaOS are you running?
Regarding the supervisor, it is very uncommon that it needs to be manually restarted as it usually recovers automatically. To obtain more information about the failure please check its logfile with journalctl -b -a -f -u resin-supervisor from the hostOS.

Unfortunately the only method to correct the situation was to completely destroy the application, re-flash the device’s storage, and rebuild/redeploy. If I get time this weekend, I’ll try to replicate the issue.

Thanks for the update. Let us know if you encounter this issue again and how to reproduce it on our end.
Cheers…

Hi guys,

let me please re-open this thread. I’ve faced exactly the same issue: Two balenasense app devices (Raspberry Zero W + balenaOS 2.48.0+rev1). After powerloss device (any of two) shows only Supervisor starting in logs and nothing else happens. I did reproduce this issue a few times and the only way to correct it is to re-flash the device.
Please find logfile from the hostOS:
root@b0ecd1d:~# journalctl -b -a -f -u resin-supervisor
– Logs begin at Fri 2020-01-31 15:19:41 UTC. –
Jul 06 04:49:18 b0ecd1d resin-supervisor[21400]: activating
Jul 06 04:49:18 b0ecd1d systemd[1]: resin-supervisor.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
Jul 06 04:49:18 b0ecd1d systemd[1]: resin-supervisor.service: Failed with result ‘timeout’.
Jul 06 04:49:18 b0ecd1d systemd[1]: Failed to start Balena supervisor.
Jul 06 04:51:21 b0ecd1d systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
Jul 06 04:51:21 b0ecd1d systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Jul 06 04:51:22 b0ecd1d resin-supervisor[22302]: deactivating
Jul 06 04:51:22 b0ecd1d systemd[1]: resin-supervisor.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
Jul 06 04:51:22 b0ecd1d systemd[1]: resin-supervisor.service: Failed with result ‘timeout’.
Jul 06 04:51:22 b0ecd1d systemd[1]: Failed to start Balena supervisor.

Any ideas why it might happen? Can you please help me to resolve this issue?

Hey @Seva

After powerloss

Can you confirm if you are shutting down the device from the dashboard or simply pulling the power supply?

Thanks

Hey Rahul, I’ve tried both scenarios - shutdown from the dashboard and simply pulling off the power supply. Both resulting in the same scenario described above.

Hi,

You can manually interact with the supervisor from the HostOS service with the following commands on multicontainer applications on devices running OS > 2.9.0:

systemctl stop resin-supervisor
systemctl stop balena
rm -rf /var/lib/docker/{aufs,overlay,containers,image,tmp}
systemctl start balena
systemctl restart resin-supervisor

As my colleague Rahul mentioned previously, restarting the device is usually successful, but this is something you can try directly.

John

Hi John,

thanks for the follow up! So today after another power outage I had the same experience with both Pi0 W (balenaOS 2.48.0+rev1, supervisor 10.8.0, multicontainer app balena-sense installed).

I’ve followed your advice and it helped! Thanks a lot! After executing suggested commands both devices re-downloaded docker images, re-installed services and went live with all services working properly.

I have another device, Pi4 with Host OS 2.51.1+rev1 and it has restarted normally after power loss.

I never observed Pi0 W successful restart after power loss and such weird behavior is pretty consistent (I did re-flash my Pi Zeros W about 5 times).

After fixing the issue using your commands I’ve rebooted one properly working Pi0 using dashboard and again same bad behavior. Here is an output:

07.07.20 21:43:31 (-0700) Killing service 'influxdb sha256:744ee41fb7b015a4bf6c466217303c61a24f07c16a280c24fa096fae898ceb7a'
07.07.20 21:43:40 (-0700) Killed service 'influxdb sha256:744ee41fb7b015a4bf6c466217303c61a24f07c16a280c24fa096fae898ceb7a'
07.07.20 21:43:40 (-0700) Service exited 'influxdb sha256:744ee41fb7b015a4bf6c466217303c61a24f07c16a280c24fa096fae898ceb7a'
07.07.20 21:43:41 (-0700) Killing service 'sensor sha256:625dc78417a7843010310dc855ce3cd9731d23cf6f1dc2ae9e8e03e27ab77a56'
07.07.20 21:44:00 (-0700) Killed service 'sensor sha256:625dc78417a7843010310dc855ce3cd9731d23cf6f1dc2ae9e8e03e27ab77a56'
07.07.20 21:44:00 (-0700) Service exited 'sensor sha256:625dc78417a7843010310dc855ce3cd9731d23cf6f1dc2ae9e8e03e27ab77a56'
07.07.20 21:44:00 (-0700) Killing service 'grafana sha256:58bf40673b4b91be0d9b0fc254816a0a5162ea4619b6b561baf5110397a20fd1'
07.07.20 21:44:08 (-0700) Killed service 'grafana sha256:58bf40673b4b91be0d9b0fc254816a0a5162ea4619b6b561baf5110397a20fd1'
07.07.20 21:44:08 (-0700) Service exited 'grafana sha256:58bf40673b4b91be0d9b0fc254816a0a5162ea4619b6b561baf5110397a20fd1'
07.07.20 21:44:08 (-0700) Killing service 'mqtt sha256:a2b55301913b48c01c2420a59fdb3cc0eb6252edda9441c9421dd478f31eb8ea'
07.07.20 21:44:15 (-0700) Killed service 'mqtt sha256:a2b55301913b48c01c2420a59fdb3cc0eb6252edda9441c9421dd478f31eb8ea'
07.07.20 21:44:15 (-0700) Service exited 'mqtt sha256:a2b55301913b48c01c2420a59fdb3cc0eb6252edda9441c9421dd478f31eb8ea'
07.07.20 21:44:16 (-0700) Killing service 'telegraf sha256:876dc7af6bbafe57966938252d2ebb15e9b7692e243cff248f9052c5920203fc'
07.07.20 21:44:27 (-0700) Killed service 'telegraf sha256:876dc7af6bbafe57966938252d2ebb15e9b7692e243cff248f9052c5920203fc'
07.07.20 21:44:28 (-0700) Rebooting
07.07.20 21:44:28 (-0700) Service exited 'telegraf sha256:876dc7af6bbafe57966938252d2ebb15e9b7692e243cff248f9052c5920203fc'
07.07.20 21:50:07 (-0700) Supervisor starting

30 mins passed since then and no changes. Will restart supervisor following your working advise.

Any ideas why such behavior might occur? How can I fix it? It’s a bit annoying to restart/reset supervisor manually and redownload all dockers after every power loss / reset / shutdown.

thanks,
Seva

Hi!

Did you get that output before or after rebooting? I understand is after the reboot, but want to be sure

It would help a lot if you could provide the diagnostics output since there’s much more info there. You can access this feature by navigating to the device summary page, and scrolling to the bottom to select “Diagnostics (Experimental)”. There, select “Device diagnostics” and click on “Run diagnostics”. It should take around a couple of minutes. Once finished, please download the file, remove all sensitive information and provide the output :). thanks!

Last line of previously shared output (21:50:07) Supervisor starting is after reboot, others before reboot.
Sure, happy to share diagnostics. Just tired to do it but unfortunatly I can’t attach txt file to the post, and can’t paste it as pre-formatted text as forum post is limited to 32 k characters and diagnostics output generated 1970k characters. How can I share with you diagnostics output or can you please specify what exact parts of diagnostics you want to look at?

Hi there,

It might be easier to just enable support access for the device and paste the device ID here, and we can look at the diagnostics output directly.

Thanks,
James.

I’m also running a Pi Zero W with the Balena Sense project and experiencing exactly the same problem. If I deploy the new image everything works fine until the next restart whereupon the supervisor hangs on starting.

Running
systemctl stop resin-supervisor
systemctl stop balena
rm -rf /var/lib/docker/{aufs,overlay,containers,image,tmp}
systemctl start balena
systemctl restart resin-supervisor

twice seems to resolve but in effect is just downloading everything. Was a solution found?

Can you please post the supervisor logs from the device? The best way to do this would be to supply the device diagnostics file from the diagnostics tab on the dashboard. If your application logs contain anything sensitive it’s better send a personal message to a team member with a link to this thread and we can attach it to the ticket.

Hi,

Sorry for the delay. Unfortunately it won’t let me attach the log as you can’t attach txt files (file type not allowed) and I can’t paste the log as it breaks your character link. Any other options?

Hello! you can use a github gist if it’s ok with you.

Hi there – thanks very kindly for the additional information. We’re examining this right now, and will get back to you in a bit.

All the best,
Hugh

Hi there – thanks again for sending us the diagnostic logs. After reviewing them, it seems that your device is stuck in a restart loop:

  • containers are being restarted at boot time…
  • but this process takes longer than the timeout we have for engine startup…
  • as a result, the engine is restarted…
  • and we’re back at square one.

My colleague has written up an excellent summary of the situation here, which is well worth reading.

We’re currently tracking this in two separate issues: one for whether we need to adjust the timeout setting, and one for performance problems for balena Sense on the Pi Zero. You can keep an eye on both to see our progress.

One thing you can try is upgrading the OS on your device to the latest version in production (2.54.2+rev1 as I write this). This version includes support for ZRAM, which may help performance on your device. Can you give this a try and let us know how it works for you?

All the best,
Hugh

Thanks for your reply. I updated to 2.54.2+rev1 and the problem looks to be solved. Assume the ZRAM improvements have made the difference in start up times. Thanks for your help.

Will