I found an article describing how to enable the Raspberry Pi hardware watchdog timer to have devices recover from any potential hardware or kernel lockups:
What is your use case? balenaOS already utilizes the hardware watchdog as part of a chain of health checks to ensure the device is always available and responsive. If you’re looking to restart your application when it becomes unresponsive, you might be better off looking into docker-compose file’s “healthcheck” directive that balenaOS also supports: Compose file version 2 reference | Docker Documentation
We are sporadically seeing issues with the camera module that we are using on the Pi which looks similar to the one posted here: Experiencing hung raspberry pi while using camera - Arducam. Sometimes this seems to lead to our devices losing connection.
I am already in contact with your colleagues in another support thread trying to get to the root cause, but I figured it would be good to see if there are any measures we can take to at least make the device recover if it gets into this state.
Good to know that balenaOS already utilizes the hardware watchdog. That answers my question. Thanks!
Hi, Jason, you can find more about docker-compose health checks here, but basically, you run a script within your container, and if it returns an error or cannot execute, it restarts your container.
I’d like for it to reset my board, not my container.
The point of a hardware watchdog is that if, for any reason, you don’t perform the action telling the hardware watchdog everything is fine on a regular periodic basis, the board resets.
Why are we resetting containers when the board needs to be reset?
Watchdog is already implemented and managed by BalenaOS. This utilizes the hardware watchdog to reset the board if the kernel is unresponsive for the given amount of time. Docker also has healthchecks that can restart a container if your container if it fails a given test.
So, there are two separate processes ensuring your application/board are up and running.
You could run a container with the balena socket mapped into it, which would then give you a docker API inside the container. You could write a script/app to periodically run a status check (i.e. docker ps) to check on the health of the running containers and if your conditions for a system reboot are met, issue a reboot from inside the container.