No more hardware or kernel lock-ups: system watchdog in 1.24.0

imrehg · December 6, 2016, 3:06pm

We’ve just released resinOS 1.24.0 (for Raspberry Pi and BeagleBone devices at the moment, more coming out step by stem), which comes with a system watchdog enabled by default. This change addresses some practical problems that people were having out in the field, as it protects the device from hardware lockups and kernel hangs. It is a first step towards a more comprehensive solution and thus does not cover all the bases (protecting the application or connectivity, for example), but a necessary step to try out how watchdogs like this work in practice for resin.io fleets.

See more details in our announcement blogpost: Keeping Your System Running with a Host OS Watchdog!

Also, if you have any watchdog or reliability stories, share it with us here!

fokko · November 29, 2018, 1:04pm

Hi Gergely,

Since this post is getting somewhat outdated now. What is the status of watchdog currently in Balena os 2.x?

Thanks,
Fokko

imrehg · November 29, 2018, 4:31pm

hey @fokko, thanks for the follow-up Indeed a lot changed

We have a lot more healthchecks inside balenaOS (health checks on the supervisor, on the balenaEngine itself, as well as a basic watchdog). We will need to update our documentation on this and expand a lot more.

Also, you can add healthchecks to your application containers as well, which uses balenaEngine’s functionality to keep the application healthy.

Is there anything specific you are looking for in the meantime? Will let you know when we have more complete docs as well soon.

fokko · November 30, 2018, 9:16am

Health checks that work together with balenaEngine? That sounds interesting! So, how do we implement a health check on our container?

We are now building a system that is more critical than we have made up till now. So a system watchdog was to check first, learning it is already implemented

Next thing we still have to act on is network connection. We have seen a lot of internet failure cases where for reasons the network is connected, internet should be available but there is no connection to the outside world. A point when NetworkManger thinks all is good, but none of our systems our Balena cloud is connected.

CameronDiver · December 6, 2018, 5:43pm

@fokko We support the docker-compose standard healthchecks, which you can read more about here: https://docs.docker.com/compose/compose-file/compose-file-v2/#healthcheck

jkridner · March 24, 2021, 5:50pm

If a healthcheck fails, do you restart just the container? Can you enable a system reboot?

nazrhom · April 2, 2021, 2:29pm

Hey Jason,
this is currently not possible. From the point of view of the engine, the health checks are only used to mark the state of the container, the reason the container is restarted when marked unhealthy is that the supervisor is monitoring the containers and takes care of restarting them if they are marked as unhealthy or if they exit.
Theoretically this could be implemented in the supervisor, but it is a tricky one as it could lead to some situations where the device is stuck in a reboot-loop if the container keeps getting marked as unhealthy. I wonder if there is a simpler solution to the problem you might be facing, would you mind sharing a bit more about why you would need this?

jkridner · April 18, 2021, 5:18am

If I cannot connect to a network with access to the VPN for 30 minutes, I’d like to reboot the system.

ab77 · April 30, 2021, 7:52pm

Hello, perhaps a very simple privileged container running a test loop checking for network connectivity and issuing a system reboot via reboot command if specific network loss threshold is reached would suffice?

Topic		Replies	Views
Enabling hardware watchdog timer on the Raspberry Pi CM3 General	8	1520	April 30, 2021
How is the RPi HW watchdog implemented and is it configurable balenaOS raspberrypi4	1	305	August 3, 2023
How do do I rectify this issue in the image pleae Product support	10	229	May 23, 2023
Container Watchdog balenaOS	8	917	August 25, 2022
Watchdog Timer for RPi series Product support	0	752	December 15, 2017

No more hardware or kernel lock-ups: system watchdog in 1.24.0

Related topics