Healthcheck and watchdog

I’ve tried various methods to make my network connections more reliable. I think it has to be expected that routers reboot, routing tables change, DHCP services eventually get saturated, etc. What has proven reliable is manually resetting boards when they don’t show back up on the network.

I tried manually using the watchdog, but it seems you already have some things tied into it and I ended up with a board that continuously reset without ever getting back on the network. I could probably keep going down this route, but I read somewhere you were doing something with Docker HEALTHCHECK.

So, I added

HEALTHCHECK --interval=2m --timeout=20s --retries=15 --start-period=10m CMD "wget --spider > /dev/null 2>&1"

to my Dockerfile.

Currently, I have a board that is not on the VPN and not responding on the LAN either. What is it I need to do to make sure it reboots when it cannot reach the Internet for 30 minutes constant?

As far as I understand, healthchecks are informational only. Our diagnostics, for example, will flag unhealthy services: Check Descriptions - Balena Documentation. You could add something like docker-autoheal to your compose file to restart unhealthy services, but that doesn’t restart the device or anything else.

The ideal solution is actually understanding why the device loses connectivity. For that you might want to enable persistent logging for the device. With that, you can see the logs from the previous boot which should help debugging.

But to build what you’ve asked, you need a service that periodically checks whether the device is online or not, and then uses the supervisor API to reboot the device when necessary. Take a look at the reboot endpoint.

@Ereski while that is the case for Docker, that hasn’t been my experience with Balena Engine. I may be wrong, and am doubting myself now, but pretty sure when I have had health checks setup on Balena devices, and the status doesn’t change to healthy the supervisor has restarted my container.

A few examples of health checks I’m currently using for reference:

HEALTHCHECK --start-period=20s --timeout=30s CMD curl --silent --fail || exit 1

HEALTHCHECK --timeout=10s CMD curl --silent --fail

Here is the Docker docs with an example: Dockerfile reference | Docker Documentation

If you use WGET instead of Curl, make sure it is installed in your container.

Hey, the Supervisor does not restart a service if it becomes unhealthy. Once the service has been installed (container created) and started (running) the Suprevisor does not monitor the service. If it stops the engine will apply the restart policy and the same applies if it’s unhealthy.

If the device is losing connectivity I don’t think restarting the whole device is the solution. I would say to enable persistent logging and restart the device to recover it. After then, check the network manager service with journalctl -fn 200 -u NetworkManager.service and look for errors.

1 Like

ugh. healthcheck was mentioned related to watchdog. How do I use the watchdog???

Would you have a link for that? I am not aware of any balenaOS feature that ties healthchecks with the watchdog, unless you’re referring to this: GitHub - balena-os/healthdog-rs: Helper program that connects external periodic heathchecks with systemd's watchdog support

If you mean the Linux Watchdog, that usually benefits from kernel interaction and probably overly complex for a network connection check. Did you provide it some sort of access to the Balena Kernel? Would be interested in the setup, I imagine by default it just tries to restart the container?

Not sure what the setup is you have, but would probably be better to build a check into whatever app or process is running in your container. Could then call a device restart via the Balena Supervisor Interacting with the balena Supervisor - Balena Documentation.

Here is how these guys do internet checks for Wi-Fi connect via shell that may give some ideas: wifi-connect/ at master · balena-os/wifi-connect · GitHub

The health dog solution above sounds interesting too, I will be checking that out.

Hi, just reading through this thread. To clear up the confusion, I confirm that unhealthy containers will be restarted by balenaEngine.

Also, I strongly suggest you spend some time debugging the network issues before falling back to a device reboot approach. BalenaOS uses NetworkManager’s connectivity checks and it will try to access an external URL every hour. It will then adjust the routing table metrics of the different interfaces to reflect their connectivity status. So one solution could be to add a different network interface, like WiFi, to provide redundancy and let NetworkManager perform the recovery for you.

Anyway, to implement a reboot I would suggest a container service that checks for connectivity and on failure performs a reboot using the supervisor API described in Interacting with the balena Supervisor - Balena Documentation.