I’ve tried various methods to make my network connections more reliable. I think it has to be expected that routers reboot, routing tables change, DHCP services eventually get saturated, etc. What has proven reliable is manually resetting boards when they don’t show back up on the network.
I tried manually using the watchdog, but it seems you already have some things tied into it and I ended up with a board that continuously reset without ever getting back on the network. I could probably keep going down this route, but I read somewhere you were doing something with Docker HEALTHCHECK.
Currently, I have a board that is not on the VPN and not responding on the LAN either. What is it I need to do to make sure it reboots when it cannot reach the Internet for 30 minutes constant?
As far as I understand, healthchecks are informational only. Our diagnostics, for example, will flag unhealthy services: Check Descriptions - Balena Documentation. You could add something like docker-autoheal to your compose file to restart unhealthy services, but that doesn’t restart the device or anything else.
The ideal solution is actually understanding why the device loses connectivity. For that you might want to enable persistent logging for the device. With that, you can see the logs from the previous boot which should help debugging.
But to build what you’ve asked, you need a service that periodically checks whether the device is online or not, and then uses the supervisor API to reboot the device when necessary. Take a look at the reboot endpoint.
@Ereski while that is the case for Docker, that hasn’t been my experience with Balena Engine. I may be wrong, and am doubting myself now, but pretty sure when I have had health checks setup on Balena devices, and the status doesn’t change to healthy the supervisor has restarted my container.
A few examples of health checks I’m currently using for reference:
Hey, the Supervisor does not restart a service if it becomes unhealthy. Once the service has been installed (container created) and started (running) the Suprevisor does not monitor the service. If it stops the engine will apply the restart policy and the same applies if it’s unhealthy.
If the device is losing connectivity I don’t think restarting the whole device is the solution. I would say to enable persistent logging and restart the device to recover it. After then, check the network manager service with journalctl -fn 200 -u NetworkManager.service and look for errors.
If you mean the Linux Watchdog, that usually benefits from kernel interaction and probably overly complex for a network connection check. Did you provide it some sort of access to the Balena Kernel? Would be interested in the setup, I imagine by default it just tries to restart the container?
Not sure what the setup is you have, but would probably be better to build a check into whatever app or process is running in your container. Could then call a device restart via the Balena Supervisor Interacting with the balena Supervisor - Balena Documentation.
Hi, just reading through this thread. To clear up the confusion, I confirm that unhealthy containers will be restarted by balenaEngine.
Also, I strongly suggest you spend some time debugging the network issues before falling back to a device reboot approach. BalenaOS uses NetworkManager’s connectivity checks and it will try to access an external URL every hour. It will then adjust the routing table metrics of the different interfaces to reflect their connectivity status. So one solution could be to add a different network interface, like WiFi, to provide redundancy and let NetworkManager perform the recovery for you.
It is somewhat exhausting to try to retrace how when looking for watchdog support tied into BalenaOS it points to healthchecks and not to this healthdog application. Anyway, this is what I’m looking for and I’ll give it a shot. Enabling this in the system seems unreasonably convoluted and irregular.
I must be in a bad mood. Can we get a bit of consistency in the instructions? I’m relived I didn’t try to build and install healthdog on my own, because it is already installed!
Spawning shell...
=============================================================
Welcome to balenaOS
=============================================================
root@a49547f:~# healthdog
Required option 'healthcheck' missing
Usage: healthdog [-p PID] -c COMMAND [-h]
Options:
-p, --pid PID pid to send watchdog events for
-c, --healthcheck COMMAND
Set healthcheck command
-h, --help Print this help menu
This continues to be confusing. Apparently systemd has various definitions of the term “watchdog”, or at least it is not used exclusively for the hardware watchdog, which just makes it all meaningless to read.
So, healthdog on its own may be addressing systemd’s own software watchdog and not actually trigger a hardware reboot (or, somewhat more ideally, shutdown and reboot) if the dog’s not pet.
All these software layers that aren’t documented well between me and the hardware watchdog are driving me up the wall.
Despite whatever advice, I simply want to do a hardware reboot if I haven’t been on a network that can reach the VPN for 30 minutes. I’d appreciate any direct steps on creating that configuration.
BalenaOS does in fact contain the healthdog binary and of course SystemD has concept for all this, but BalenaOS has no feature that does what you want - i.e. reboot the whole device on some condition.
We want our devices to be as stable as possible and therefore prefer fixing the underlying issue.
If you’re convinced that rebooting the device is the best option it should be as simple as adding a second container to your app which has the io.balena.features.supervisor-api: 1 label set and create a script that does something like this:
sleep 300 # give device 5min after boot to establish connection and allow killing this container if there's an issue
while true; do
curl http://google.com || curl -X POST --header "Content-Type:application/json" \
"$BALENA_SUPERVISOR_ADDRESS/v1/reboot?apikey=$BALENA_SUPERVISOR_API_KEY"
sleep 60
done
Sometimes works. Sometimes gets Updates are locked: EEXIST: file already exists, open '/mnt/root/tmp/balena-supervisor/services/1793520/main/updates.lock
Thanks for getting back. If you are looking for healthdog specific documentation, we don’t capture it in its own section, but do have its usage highlighted with supervisor here: Balena Device Debugging Masterclass - Balena Documentation
Hi Jason, I am one of the balenaOS maintainers and just want to make sure that we understand your request. From what I read, you would like to have an OS feature that will reboot the device if the network is not accessible in a given amount of time.
We don’t currently have such a feature, and our recommendation has been to provide it in an application container that will query a network endpoint and trigger a reboot via the supervisor API.
The rationale for this is that BalenaOS is meant to be a light hypervisor, so functionality that can be implemented in user applications is not moved down to the OS to keep it light, efficient and simple.
There is also some discussion above about watchdogs. The big picture is probably not well documented as you have found out, but basically we have systemd running a system hardware-based watchdog (so the system will reboot if systemd does not refreshes the watchdog), and the unit watchdogs for system services, so systemd will restart them if they become unresponsive. We have extended systemd software watchdog with healthdog, that allows to run external periodic custom healthchecks and connect them to systemd software watchdog support. There is no customization offered to this watchdog functionality.
In summary, does the proposal of running a network monitoring service work for your use case?
Did you have a chance to read my teammate Alex’s message? Can you solve this issue by running a network monitoring service as part of your application?
If not, please let us know of why not. We appreciate your feedback.
On another note. Can you please elaborate when you saw this following error?
Updates are locked: EEXIST: file already exists, open '/mnt/root/tmp/balena-supervisor/services/1793520/main/updates.lock
It’d help us if you could provide more context and maybe even illustrate it with an example to reproduce it. So we could provide more insights about that.