No more hardware or kernel lock-ups: system watchdog in 1.24.0


#1

We’ve just released resinOS 1.24.0 (for Raspberry Pi and BeagleBone devices at the moment, more coming out step by stem), which comes with a system watchdog enabled by default. This change addresses some practical problems that people were having out in the field, as it protects the device from hardware lockups and kernel hangs. It is a first step towards a more comprehensive solution and thus does not cover all the bases (protecting the application or connectivity, for example), but a necessary step to try out how watchdogs like this work in practice for resin.io fleets. :dog2:

See more details in our announcement blogpost: Keeping Your System Running with a Host OS Watchdog!

Also, if you have any watchdog or reliability stories, share it with us here! :pencil2:


#2

Hi Gergely,

Since this post is getting somewhat outdated now. What is the status of watchdog currently in Balena os 2.x?

Thanks,
Fokko


#5

hey @fokko, thanks for the follow-up :slight_smile: Indeed a lot changed

We have a lot more healthchecks inside balenaOS (health checks on the supervisor, on the balenaEngine itself, as well as a basic watchdog). We will need to update our documentation on this and expand a lot more.

Also, you can add healthchecks to your application containers as well, which uses balenaEngine’s functionality to keep the application healthy.

Is there anything specific you are looking for in the meantime? Will let you know when we have more complete docs as well soon. :slight_smile:


#8

Health checks that work together with balenaEngine? That sounds interesting! So, how do we implement a health check on our container?

We are now building a system that is more critical than we have made up till now. So a system watchdog was to check first, learning it is already implemented :slightly_smiling_face:

Next thing we still have to act on is network connection. We have seen a lot of internet failure cases where for reasons the network is connected, internet should be available but there is no connection to the outside world. A point when NetworkManger thinks all is good, but none of our systems our Balena cloud is connected.


#9

@fokko We support the docker-compose standard healthchecks, which you can read more about here: https://docs.docker.com/compose/compose-file/compose-file-v2/#healthcheck