Mysterious balenad crash

mattyg · January 24, 2022, 9:46pm

Hey all,
I am seeing a pretty verbose and mysterious crash of the balenad process. I posted the full traceback here:

gist.github.com

https://gist.github.com/meawoppl/295e70162176341d5932298880472a8e

Balenad Crash Dump

Jan 24 21:39:09 49afb35 systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jan 24 21:39:09 49afb35 systemd[1]: balena.service: Killing process 3391 (balenad) with signal SIGABRT.
Jan 24 21:39:09 49afb35 balenad[3391]: SIGABRT: abort
Jan 24 21:39:09 49afb35 balenad[3391]: PC=0x4618d4 m=0 sigcode=0
Jan 24 21:39:09 49afb35 balenad[3391]: goroutine 0 [idle]:
Jan 24 21:39:09 49afb35 balenad[3391]: runtime.futex(0x2ce1088, 0x80, 0x0, 0x0, 0x7f00000000, 0x4388b0, 0x0, 0x0, 0x7ffcd98468, 0x437018, ...)
Jan 24 21:39:09 49afb35 balenad[3391]:         /usr/lib/go/src/runtime/sys_linux_arm64.s:417 +0x1c
Jan 24 21:39:09 49afb35 balenad[3391]: runtime.futexsleep(0x2ce1088, 0x0, 0xffffffffffffffff)
Jan 24 21:39:09 49afb35 balenad[3391]:         /usr/lib/go/src/runtime/os_linux.go:46 +0x3c
Jan 24 21:39:09 49afb35 balenad[3391]: runtime.notesleep(0x2ce1088)

This file has been truncated. show original

I am not sure how to read it, but it was the best lead I could find. I -think- that balenad is hanging on something, and getting killed by the watchdog process.

Any leads appreciated.

mpous · April 22, 2022, 9:53am

Hello @mattyg first of all apologizes for the late message.

Did you solve this issue? i hope so

Having said that, do you remember how did you arrive to this situation? did you perform any action to the balenaEngine? Thanks!

mpous · April 22, 2022, 10:06am

Just to complement my previous reply, it looks like your Raspberry Pi Zero got into heavy load.
We have a workaround to try to solve this, but remember it’s not a true fix. You might consider increasing the watchdog timeout. In some cases, 12 minutes might not be enough if the device is under significant load, especially on some low-powered devices such as the Pi Zero. In other cases 12 min is way longer than we want to wait to restart the engine when issues are encountered.

Instead, we are approaching this a couple ways:

adjust critical hostOS services like the engine to always take priority over non-critical applications when under very heavy load
change the requirements of the healthcheck to complete with fewer resources, but still accurately reflect whether the engine is “healthy”

In the short term, for your application you could set the set the cpu_shares: value for your high I/O services to 512 or similar in your docker-compose file.

Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit. --cpu-shares does not prevent containers from being scheduled in swarm mode. It prioritizes container CPU resources for the available CPU cycles. It does not guarantee or reserve any specific CPU access.

This will reduce the stress on the hostOS when resources are constrained, hopefully with minimal impact to your application performance.

Additionally, increasing the watchdog interval might help to mitigate this issue while we don’t have a proper solution in place. The default is 360 seconds, but this can be changed:

Remount the root filesystem as read-write: mount -o remount,rw /
Run systemctl edit --full balena
Add a WatchdogSec entry to the Service section of the unit file. You may need to create the section if it is not present. Like this:

	[Service]
	WatchdogSec=720

Remount the root filesystem as read-only: mount -o remount,ro /
I am not sure if it is necessary to restart the service.

Let us know if that worked

Topic		Replies	Views
SIGSEGV in balenad balenaOS	2	624	May 6, 2019
Why is my service aborting? Crash log Product support	4	199	January 3, 2023
Balena Crash / Balena Reliability Product support	13	1183	October 30, 2018
Balena Crash Loop balenaOS docker	3	541	November 11, 2019
Balena-engine crashing during livepush balenaFin raspberrypi3	5	969	July 7, 2020

Mysterious balenad crash

Related topics