We recently had to power cycle a device due to some of the hardware components losing connection and when the device came back online the balena engine was not able to startup. When running any balena commands such as balena container ls
we would get the following error.
Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running
In order to understand the issue a little better we looked at the systemctl status balena
.
balena.service - Balena Application Container Engine
Loaded: loaded (/lib/systemd/system/balena.service; enabled; vendor preset: enabled)
Active: inactive (dead) (Result: timeout) since Thu 2021-03-11 12:03:20 UTC; 3h 59min ago
Docs: https://www.balena.io/docs/getting-started
Process: 1441 ExecStart=/usr/bin/healthdog --healthcheck=/usr/lib/balena/balena-healthcheck /usr/bin/balenad --experimental --log-driver=journald -s overlay2 -H>
Main PID: 1441 (code=exited, status=0/SUCCESS)
Tasks: 11 (limit: 4296)
CGroup: /system.slice/balena.service
└─2039 balena-engine-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a09e176ffab64b3bd53>
Mar 11 16:02:01 0a084f9 systemd[1]: Dependency failed for Balena Application Container Engine.
Mar 11 16:02:01 0a084f9 systemd[1]: balena.service: Job balena.service/start failed with result 'dependency'.
Given the above we tried to restart the service using systemctl restart balena
. However the following error was returned
A dependency job for balena.service failed. See 'journalctl -xe' for details.
The next step of debugging was to look at journalctl -xe
Mar 11 16:10:07 0a084f9 systemd[1]: resin-supervisor.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
Mar 11 16:10:07 0a084f9 systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.
Mar 11 16:10:07 0a084f9 systemd[1]: Failed to start Balena supervisor.
Mar 11 16:10:12 0a084f9 systemd[1]: balena-engine.socket: Failed to create listening socket (/var/run/balena-engine.sock): Address already in use
Mar 11 16:10:12 0a084f9 systemd[1]: balena-engine.socket: Failed to listen on sockets: Address already in use
Mar 11 16:10:12 0a084f9 systemd[1]: balena-engine.socket: Failed with result 'resources'.
Mar 11 16:10:12 0a084f9 systemd[1]: Failed to listen on Docker Socket for the API.
Mar 11 16:10:12 0a084f9 systemd[1]: Dependency failed for Balena Application Container Engine.
Mar 11 16:10:12 0a084f9 systemd[1]: balena.service: Job balena.service/start failed with result 'dependency'.
We are a little lost when it comes to why the service is claiming that the address is in use. The only balena socket found using netstat -a | grep balena
is the balena host socket.
unix 2 [ ACC ] STREAM LISTENING 18094 /var/run/balena-host.sock
What could the issue with the balena engine be? Should I just attempt to reboot the device?
To get a complete picture of the system here is the output from the failing lines of the device health checks. Additionally the supervisor is not running but this makes sense as the balena container engine is not running.
Given the local disk is reported as having errors I followed the section on storage media corruption in Balena Device Debugging Masterclass and it looks like there might be some issues with our SD card.
/etc/resin-supervisor/supervisor.conf: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
So is the resolve to the above to swap out the SD card? If this were an issue that occurred on a customer site how might we get some early warning signs of this failure? Or perhaps is the issue with the localdisk a red herring and actually there is a deeper issue with the balena engine?