Healthcheck and watchdog

I’ve tried various methods to make my network connections more reliable. I think it has to be expected that routers reboot, routing tables change, DHCP services eventually get saturated, etc. What has proven reliable is manually resetting boards when they don’t show back up on the network.

I tried manually using the watchdog, but it seems you already have some things tied into it and I ended up with a board that continuously reset without ever getting back on the network. I could probably keep going down this route, but I read somewhere you were doing something with Docker HEALTHCHECK.

So, I added

HEALTHCHECK --interval=2m --timeout=20s --retries=15 --start-period=10m CMD "wget --spider http://google.com > /dev/null 2>&1"

to my Dockerfile.

Currently, I have a board that is not on the VPN and not responding on the LAN either. What is it I need to do to make sure it reboots when it cannot reach the Internet for 30 minutes constant?

As far as I understand, healthchecks are informational only. Our diagnostics, for example, will flag unhealthy services: Check Descriptions - Balena Documentation. You could add something like docker-autoheal to your compose file to restart unhealthy services, but that doesn’t restart the device or anything else.

The ideal solution is actually understanding why the device loses connectivity. For that you might want to enable persistent logging for the device. With that, you can see the logs from the previous boot which should help debugging.

But to build what you’ve asked, you need a service that periodically checks whether the device is online or not, and then uses the supervisor API to reboot the device when necessary. Take a look at the reboot endpoint.

@Ereski while that is the case for Docker, that hasn’t been my experience with Balena Engine. I may be wrong, and am doubting myself now, but pretty sure when I have had health checks setup on Balena devices, and the status doesn’t change to healthy the supervisor has restarted my container.

A few examples of health checks I’m currently using for reference:

HEALTHCHECK --start-period=20s --timeout=30s CMD curl --silent --fail http://127.0.0.1:9090/ || exit 1

HEALTHCHECK --timeout=10s CMD curl --silent --fail http://127.0.0.1:8081/fpm-ping

Here is the Docker docs with an example: Dockerfile reference | Docker Documentation

If you use WGET instead of Curl, make sure it is installed in your container.

Hey, the Supervisor does not restart a service if it becomes unhealthy. Once the service has been installed (container created) and started (running) the Suprevisor does not monitor the service. If it stops the engine will apply the restart policy and the same applies if it’s unhealthy.

If the device is losing connectivity I don’t think restarting the whole device is the solution. I would say to enable persistent logging and restart the device to recover it. After then, check the network manager service with journalctl -fn 200 -u NetworkManager.service and look for errors.

1 Like

ugh. healthcheck was mentioned related to watchdog. How do I use the watchdog???

Would you have a link for that? I am not aware of any balenaOS feature that ties healthchecks with the watchdog, unless you’re referring to this: GitHub - balena-os/healthdog-rs: Helper program that connects external periodic heathchecks with systemd's watchdog support

1 Like

If you mean the Linux Watchdog, that usually benefits from kernel interaction and probably overly complex for a network connection check. Did you provide it some sort of access to the Balena Kernel? Would be interested in the setup, I imagine by default it just tries to restart the container?

Not sure what the setup is you have, but would probably be better to build a check into whatever app or process is running in your container. Could then call a device restart via the Balena Supervisor Interacting with the balena Supervisor - Balena Documentation.

Here is how these guys do internet checks for Wi-Fi connect via shell that may give some ideas: wifi-connect/start.sh at master · balena-os/wifi-connect · GitHub

The health dog solution above sounds interesting too, I will be checking that out.

Hi, just reading through this thread. To clear up the confusion, I confirm that unhealthy containers will be restarted by balenaEngine.

Also, I strongly suggest you spend some time debugging the network issues before falling back to a device reboot approach. BalenaOS uses NetworkManager’s connectivity checks and it will try to access an external URL every hour. It will then adjust the routing table metrics of the different interfaces to reflect their connectivity status. So one solution could be to add a different network interface, like WiFi, to provide redundancy and let NetworkManager perform the recovery for you.

Anyway, to implement a reboot I would suggest a container service that checks for connectivity and on failure performs a reboot using the supervisor API described in Interacting with the balena Supervisor - Balena Documentation.

1 Like

It is somewhat exhausting to try to retrace how when looking for watchdog support tied into BalenaOS it points to healthchecks and not to this healthdog application. Anyway, this is what I’m looking for and I’ll give it a shot. Enabling this in the system seems unreasonably convoluted and irregular.

I must be in a bad mood. Can we get a bit of consistency in the instructions? I’m relived I didn’t try to build and install healthdog on my own, because it is already installed!

    Spawning shell...
=============================================================
    Welcome to balenaOS
=============================================================
root@a49547f:~# healthdog
Required option 'healthcheck' missing

Usage: healthdog [-p PID] -c COMMAND [-h]

Options:
    -p, --pid PID       pid to send watchdog events for
    -c, --healthcheck COMMAND
                        Set healthcheck command
    -h, --help          Print this help menu

This continues to be confusing. Apparently systemd has various definitions of the term “watchdog”, or at least it is not used exclusively for the hardware watchdog, which just makes it all meaningless to read.

See systemd-system.conf

So, healthdog on its own may be addressing systemd’s own software watchdog and not actually trigger a hardware reboot (or, somewhat more ideally, shutdown and reboot) if the dog’s not pet.

All these software layers that aren’t documented well between me and the hardware watchdog are driving me up the wall.

Despite whatever advice, I simply want to do a hardware reboot if I haven’t been on a network that can reach the VPN for 30 minutes. I’d appreciate any direct steps on creating that configuration.

Oh, right, read-only file system, so I cannot just modify /etc/systemd/system.conf. Goodnight.

Hallo Jason,

BalenaOS does in fact contain the healthdog binary and of course SystemD has concept for all this, but BalenaOS has no feature that does what you want - i.e. reboot the whole device on some condition.
We want our devices to be as stable as possible and therefore prefer fixing the underlying issue.

If you’re convinced that rebooting the device is the best option it should be as simple as adding a second container to your app which has the io.balena.features.supervisor-api: 1 label set and create a script that does something like this:

sleep 300 # give device 5min after boot to establish connection and allow killing this container if there's an issue
while true; do
    curl http://google.com || curl -X POST --header "Content-Type:application/json" \
    "$BALENA_SUPERVISOR_ADDRESS/v1/reboot?apikey=$BALENA_SUPERVISOR_API_KEY"
    sleep 60
done

See also Interacting with the balena Supervisor - Balena Documentation

Hope this helps

1 Like

It very much helps, but documenting usage of the hardware watchdog might be vastly superior.

Sometimes works. Sometimes gets Updates are locked: EEXIST: file already exists, open '/mnt/root/tmp/balena-supervisor/services/1793520/main/updates.lock

Hi,

Thanks for getting back. If you are looking for healthdog specific documentation, we don’t capture it in its own section, but do have its usage highlighted with supervisor here: Balena Device Debugging Masterclass - Balena Documentation

Also, you can find more information about healthdog here: GitHub - balena-os/healthdog-rs: Helper program that connects external periodic heathchecks with systemd's watchdog support

As per you second comment, is this for the script which @hades32 mentioned where you come across this error?

Hi Jason, I am one of the balenaOS maintainers and just want to make sure that we understand your request. From what I read, you would like to have an OS feature that will reboot the device if the network is not accessible in a given amount of time.
We don’t currently have such a feature, and our recommendation has been to provide it in an application container that will query a network endpoint and trigger a reboot via the supervisor API.
The rationale for this is that BalenaOS is meant to be a light hypervisor, so functionality that can be implemented in user applications is not moved down to the OS to keep it light, efficient and simple.

There is also some discussion above about watchdogs. The big picture is probably not well documented as you have found out, but basically we have systemd running a system hardware-based watchdog (so the system will reboot if systemd does not refreshes the watchdog), and the unit watchdogs for system services, so systemd will restart them if they become unresponsive. We have extended systemd software watchdog with healthdog, that allows to run external periodic custom healthchecks and connect them to systemd software watchdog support. There is no customization offered to this watchdog functionality.

In summary, does the proposal of running a network monitoring service work for your use case?

Hi Jason,

Did you have a chance to read my teammate Alex’s message? Can you solve this issue by running a network monitoring service as part of your application?

If not, please let us know of why not. We appreciate your feedback.

On another note. Can you please elaborate when you saw this following error?

Updates are locked: EEXIST: file already exists, open '/mnt/root/tmp/balena-supervisor/services/1793520/main/updates.lock

It’d help us if you could provide more context and maybe even illustrate it with an example to reproduce it. So we could provide more insights about that.

Cheers…