No more hardware or kernel lock-ups: system watchdog in 1.24.0

We’ve just released resinOS 1.24.0 (for Raspberry Pi and BeagleBone devices at the moment, more coming out step by stem), which comes with a system watchdog enabled by default. This change addresses some practical problems that people were having out in the field, as it protects the device from hardware lockups and kernel hangs. It is a first step towards a more comprehensive solution and thus does not cover all the bases (protecting the application or connectivity, for example), but a necessary step to try out how watchdogs like this work in practice for resin.io fleets. :dog2:

See more details in our announcement blogpost: Keeping Your System Running with a Host OS Watchdog!

Also, if you have any watchdog or reliability stories, share it with us here! :pencil2:

2 Likes

Hi Gergely,

Since this post is getting somewhat outdated now. What is the status of watchdog currently in Balena os 2.x?

Thanks,
Fokko

hey @fokko, thanks for the follow-up :slight_smile: Indeed a lot changed

We have a lot more healthchecks inside balenaOS (health checks on the supervisor, on the balenaEngine itself, as well as a basic watchdog). We will need to update our documentation on this and expand a lot more.

Also, you can add healthchecks to your application containers as well, which uses balenaEngine’s functionality to keep the application healthy.

Is there anything specific you are looking for in the meantime? Will let you know when we have more complete docs as well soon. :slight_smile:

Health checks that work together with balenaEngine? That sounds interesting! So, how do we implement a health check on our container?

We are now building a system that is more critical than we have made up till now. So a system watchdog was to check first, learning it is already implemented :slightly_smiling_face:

Next thing we still have to act on is network connection. We have seen a lot of internet failure cases where for reasons the network is connected, internet should be available but there is no connection to the outside world. A point when NetworkManger thinks all is good, but none of our systems our Balena cloud is connected.

@fokko We support the docker-compose standard healthchecks, which you can read more about here: https://docs.docker.com/compose/compose-file/compose-file-v2/#healthcheck

If a healthcheck fails, do you restart just the container? Can you enable a system reboot?

Hey Jason,
this is currently not possible. From the point of view of the engine, the health checks are only used to mark the state of the container, the reason the container is restarted when marked unhealthy is that the supervisor is monitoring the containers and takes care of restarting them if they are marked as unhealthy or if they exit.
Theoretically this could be implemented in the supervisor, but it is a tricky one as it could lead to some situations where the device is stuck in a reboot-loop if the container keeps getting marked as unhealthy. I wonder if there is a simpler solution to the problem you might be facing, would you mind sharing a bit more about why you would need this?

If I cannot connect to a network with access to the VPN for 30 minutes, I’d like to reboot the system.

Hello, perhaps a very simple privileged container running a test loop checking for network connectivity and issuing a system reboot via reboot command if specific network loss threshold is reached would suffice?