Rebooting a device using the supervisor API vs DBUS

bartvanarnhem · June 20, 2024, 12:39pm

Hello!

We have implemented a nightly reboot on our fleet of devices, which is currently using the supervisor API method as documented here Interacting with the balena Supervisor - Balena Documentation.

Occasionally we notice devices only partially shutting down, closing containers and going offline, but not completing the full reboot cycle and requiring a hard reset. The issue seems to be due to the supervisor not being able to fully kill one of our containers. We have gone through great lengths to make sure our app cleanly exits, and releases all hardware locks. Which has improved it a lot, however we would like to eliminate the chances of this happening completely.

In our testing we have found that the DBUS method of directly rebooting the host OS (as documented here Communicate outside the container - Balena Documentation) does not cause this reboot stall. (Or at least, we have not seen it yet.)

My questions: is there anything wrong with not using the supervisor API to reboot? Can the DBUS method be detrimental over time in any way?

Thanks!

Bart

mtoman · August 26, 2024, 1:56pm

Hi,

basically your understanding is correct - using the supervisor API will try to stop the containers before issuing the reboot, however if the container refuses to stop gracefully, the supervisor will wait for it forever. The d-bus reboot on the other hand will make systemd to stop all the system services, including balenaEngine and all therefore the containers. The difference is that this will also forcefully kill all the processes that refuse to exit within a certain time range, which for balenaEngine is 90 seconds. But this might mean lost data, if e.g. the container was not shuting down because it was busy writing to disk.

There are a few possible ways you can tackle this:

If you are confident enough, that your containers exit quickly, and that if something hangs on a HW call for over a minute it will basically hang forever, you can use the d-bus method more or less safely. Everything should be stopped gracefully and only the last container will be killed.
You can also implement some logic around the problematic container - first try to stop it using the supervisor API, and use the supervisor or the d-bus API to reboot, based on whether the container can be stopped or not.
Theoretically you can also use this as “reboot tiers”, first try the supervisor reboot, and then fall back to d-bus reboot, however this might not work if the container asking for reboot is not the one that stays up last.

Topic		Replies	Views
Stopping balena container Product support	3	87	August 9, 2024
How to restart the Balena device from the container itself balenaOS raspberrypi3	2	934	February 18, 2022
'balena ps' shows container restarting for many hours; reboot required to clear condition balenaEngine	30	2550	January 26, 2021
Don't reboot device with supervisor Product support	4	292	May 20, 2021
Turning off balenaOS from container balenaOS	7	638	July 8, 2019

Rebooting a device using the supervisor API vs DBUS

Related topics