Hey all,
I am seeing a pretty verbose and mysterious crash of the balenad process. I posted the full traceback here:
I am not sure how to read it, but it was the best lead I could find. I -think- that balenad is hanging on something, and getting killed by the watchdog process.
Just to complement my previous reply, it looks like your Raspberry Pi Zero got into heavy load.
We have a workaround to try to solve this, but remember it’s not a true fix. You might consider increasing the watchdog timeout. In some cases, 12 minutes might not be enough if the device is under significant load, especially on some low-powered devices such as the Pi Zero. In other cases 12 min is way longer than we want to wait to restart the engine when issues are encountered.
Instead, we are approaching this a couple ways:
adjust critical hostOS services like the engine to always take priority over non-critical applications when under very heavy load
change the requirements of the healthcheck to complete with fewer resources, but still accurately reflect whether the engine is “healthy”
In the short term, for your application you could set the set the cpu_shares: value for your high I/O services to 512 or similar in your docker-compose file.
Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit. --cpu-shares does not prevent containers from being scheduled in swarm mode. It prioritizes container CPU resources for the available CPU cycles. It does not guarantee or reserve any specific CPU access.
This will reduce the stress on the hostOS when resources are constrained, hopefully with minimal impact to your application performance.
Additionally, increasing the watchdog interval might help to mitigate this issue while we don’t have a proper solution in place. The default is 360 seconds, but this can be changed:
Remount the root filesystem as read-write: mount -o remount,rw /
Run systemctl edit --full balena
Add a WatchdogSec entry to the Service section of the unit file. You may need to create the section if it is not present. Like this:
[Service]
WatchdogSec=720
Remount the root filesystem as read-only: mount -o remount,ro /
I am not sure if it is necessary to restart the service.