Balena engine start failure

Device type: Raspberry Pi (v1 / Zero / Zero W)
OS version: balenaOS 2.46.1+rev1
Supervisor version: 10.6.27

Hi, We have had several instances where the device stops responding. It still shows up as online but we cannot connect to it and it seems that at least one of the containers is no longer functioning.

After having the customer power cycle the device, we look at the previous boot journal via journalctl -b -1 and can see:

Apr 02 16:54:55 3bff1dd balenad[1390]: Failed to start containerd: timeout waiting for containerd to start
Apr 02 16:54:56 3bff1dd resin-supervisor[1379]: Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?
Apr 02 16:54:56 3bff1dd systemd[1]: balena.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 16:54:56 3bff1dd systemd[1]: balena.service: Failed with result 'exit-code'.
Apr 02 16:54:57 3bff1dd wpa_supplicant[876]: wlan0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
Apr 02 16:54:56 3bff1dd systemd[1]: Failed to start Balena Application Container Engine.
Apr 02 16:54:58 3bff1dd systemd[1]: resin-supervisor.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
Apr 02 16:54:58 3bff1dd systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.
Apr 02 16:54:58 3bff1dd systemd[1]: Failed to start Balena supervisor.
Apr 02 16:54:59 3bff1dd resin-supervisor[1875]: activating

This repeats. It’s not often viable to have to ask customer to restart so looking for ideas on how to debug this and make the production more stable with our containers.

Thanks.

Hi there, sorry to hear about these troubles. The original Raspberry Pi 1 and the Pi Zero don’t have a lot of compute power due to their older, single core processor, so that may be the culprit here if you are running multiple containers, and, those containers are running even moderate workloads.

With that said, could you share the output of ‘journalctl -u balena.service -t balenad’ ? That will give us a bit more visibility into what is occurring. Thanks!

Thanks for the response. It happens sporadically and unfortunately we have lost logs for that instance so I will capture the output next time it happens.

One question: when I issue “top” to look at resource usage, I see that CPU usage is rarely greater than 40%, and memory usage is about 90% (i.e. 10% free). Is that indicative of resource constraints? I have not seen out of memory errors in the logs. Also, is a single-service installation more efficient resource-wise than a multi-service one? If so we can do some experiments to see if we see this when running a single container.

Thanks

Hi there,
yes, multi service requires more resources than single since more resources will be spent to get multiple things concurrently.
So given the low computing power of the device, I would suggest you to try with a single container where possible.