Occasionally we notice devices only partially shutting down, closing containers and going offline, but not completing the full reboot cycle and requiring a hard reset. The issue seems to be due to the supervisor not being able to fully kill one of our containers. We have gone through great lengths to make sure our app cleanly exits, and releases all hardware locks. Which has improved it a lot, however we would like to eliminate the chances of this happening completely.
In our testing we have found that the DBUS method of directly rebooting the host OS (as documented here Communicate outside the container - Balena Documentation) does not cause this reboot stall. (Or at least, we have not seen it yet.)
My questions: is there anything wrong with not using the supervisor API to reboot? Can the DBUS method be detrimental over time in any way?
basically your understanding is correct - using the supervisor API will try to stop the containers before issuing the reboot, however if the container refuses to stop gracefully, the supervisor will wait for it forever. The d-bus reboot on the other hand will make systemd to stop all the system services, including balenaEngine and all therefore the containers. The difference is that this will also forcefully kill all the processes that refuse to exit within a certain time range, which for balenaEngine is 90 seconds. But this might mean lost data, if e.g. the container was not shuting down because it was busy writing to disk.
There are a few possible ways you can tackle this:
If you are confident enough, that your containers exit quickly, and that if something hangs on a HW call for over a minute it will basically hang forever, you can use the d-bus method more or less safely. Everything should be stopped gracefully and only the last container will be killed.
You can also implement some logic around the problematic container - first try to stop it using the supervisor API, and use the supervisor or the d-bus API to reboot, based on whether the container can be stopped or not.
Theoretically you can also use this as “reboot tiers”, first try the supervisor reboot, and then fall back to d-bus reboot, however this might not work if the container asking for reboot is not the one that stays up last.