Balena engine not starting due to socket busy error

We recently had to power cycle a device due to some of the hardware components losing connection and when the device came back online the balena engine was not able to startup. When running any balena commands such as balena container ls we would get the following error.

Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running

In order to understand the issue a little better we looked at the systemctl status balena.

balena.service - Balena Application Container Engine
   Loaded: loaded (/lib/systemd/system/balena.service; enabled; vendor preset: enabled)
   Active: inactive (dead) (Result: timeout) since Thu 2021-03-11 12:03:20 UTC; 3h 59min ago
     Docs: https://www.balena.io/docs/getting-started
  Process: 1441 ExecStart=/usr/bin/healthdog --healthcheck=/usr/lib/balena/balena-healthcheck /usr/bin/balenad --experimental --log-driver=journald -s overlay2 -H>
 Main PID: 1441 (code=exited, status=0/SUCCESS)
    Tasks: 11 (limit: 4296)
   CGroup: /system.slice/balena.service
           └─2039 balena-engine-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a09e176ffab64b3bd53>

Mar 11 16:02:01 0a084f9 systemd[1]: Dependency failed for Balena Application Container Engine.
Mar 11 16:02:01 0a084f9 systemd[1]: balena.service: Job balena.service/start failed with result 'dependency'.

Given the above we tried to restart the service using systemctl restart balena. However the following error was returned

A dependency job for balena.service failed. See 'journalctl -xe' for details.

The next step of debugging was to look at journalctl -xe

Mar 11 16:10:07 0a084f9 systemd[1]: resin-supervisor.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
Mar 11 16:10:07 0a084f9 systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.
Mar 11 16:10:07 0a084f9 systemd[1]: Failed to start Balena supervisor.
Mar 11 16:10:12 0a084f9 systemd[1]: balena-engine.socket: Failed to create listening socket (/var/run/balena-engine.sock): Address already in use
Mar 11 16:10:12 0a084f9 systemd[1]: balena-engine.socket: Failed to listen on sockets: Address already in use
Mar 11 16:10:12 0a084f9 systemd[1]: balena-engine.socket: Failed with result 'resources'.
Mar 11 16:10:12 0a084f9 systemd[1]: Failed to listen on Docker Socket for the API.
Mar 11 16:10:12 0a084f9 systemd[1]: Dependency failed for Balena Application Container Engine.
Mar 11 16:10:12 0a084f9 systemd[1]: balena.service: Job balena.service/start failed with result 'dependency'.

We are a little lost when it comes to why the service is claiming that the address is in use. The only balena socket found using netstat -a | grep balena is the balena host socket.

unix  2      [ ACC ]     STREAM     LISTENING      18094 /var/run/balena-host.sock

What could the issue with the balena engine be? Should I just attempt to reboot the device?

To get a complete picture of the system here is the output from the failing lines of the device health checks. Additionally the supervisor is not running but this makes sense as the balena container engine is not running.

Given the local disk is reported as having errors I followed the section on storage media corruption in Balena Device Debugging Masterclass and it looks like there might be some issues with our SD card.

/etc/resin-supervisor/supervisor.conf: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

So is the resolve to the above to swap out the SD card? If this were an issue that occurred on a customer site how might we get some early warning signs of this failure? Or perhaps is the issue with the localdisk a red herring and actually there is a deeper issue with the balena engine?

Our bio team need to use the device this morning, so we are going attempt another reboot and if that fails flash a new SD card. I will keep the old one in case it is useful for debugging in the future.

The device is back online and working. In the end I opted for running reboot in the HostOS. When the device came back online the balena engine started with no trouble and the device diagnostics are no longer reporting any issues with the write latency.

Hello, thanks for the detailed report and apologies for the slow reply here. Did you get a chance to run grep -v "/var/cache/ldconfig/aux-cache" /resinos.fingerprint | md5sum --quiet -c - again? If that continues to show issues I would suggest replacing the SD card as that might be the cause of the initial issues with the engine.

@nazrhom I did run again when the device had rebooted and found that the same config file was still corrupted. We will replace the SD card when next possible. If the device fails before we can replace it would be good to know the steps to recover the data if needed. Here is an except from the debugging masterclass.

Should the worst happen and a device is no longer bootable due to filesystem corruption, they still have the option of recovering data from the device. In this case, they’ll need to remove the media (SD card, HDD, etc.) from the device and then follow appropriate instructions. The Jellyfish knowledgebase includes an entry for doing this (search for ‘Recovery data from a dead device’).

Is this guide available somewhere where I might be able to see the steps required?

Hi

JellyFish is our internal thing - so you won’t be able to access it as is. I am pasting the relevant article below for your reference

The following is a brief guide assuming a multicontainer application, and how to get the volume data off the SD card on a Linux machine. This can be applied to any media.

  1. CLONE THE MEDIA (eg SD card)! This ensures we don’t do anything destructive to the card
  2. Insert cloned SD card into Linux machine, in this example I’m using a USB convertor. Once inserted, a quick way to check what the SD card disk is called is to run dmesg:
$ dmesg
[2492110.746490] scsi 6:0:0:0: Direct-Access     Generic- SD/MMC           1.00 PQ: 0 ANSI: 0 CCS
[2492110.747284] sd 6:0:0:0: Attached scsi generic sg4 type 0
[2492111.431717] sd 6:0:0:0: [sdd] 15523840 512-byte logical blocks: (7.95 GB/7.40 GiB)
[2492111.431965] sd 6:0:0:0: [sdd] Write Protect is off
[2492111.431973] sd 6:0:0:0: [sdd] Mode Sense: 03 00 00 00
[2492111.432205] sd 6:0:0:0: [sdd] No Caching mode page found
[2492111.433279] sd 6:0:0:0: [sdd] Assuming drive cache: write through
[2492111.445551]  sdd: sdd1 sdd2 sdd3 sdd4 < sdd5 sdd6 >
[2492111.448176] sd 6:0:0:0: [sdd] Attached SCSI removable disk

As you can see, the SD card is /dev/sdd and it contains 6 partitions.
3. We need to mount the data partition in the image, this requires an appropriate offset into the image and a mount point on the host machine to point the image at. The data partition is currently the last partition on the disk. We get the right partition by running disk on the image file:

$ fdisk -l /dev/sdd
Disk /dev/sdd: 7.4 GiB, 7948206080 bytes, 15523840 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x5bc7238a
Device     Boot   Start      End  Sectors  Size Id Type
/dev/sdd1  *       8192    90111    81920   40M  c W95 FAT32 (LBA)
/dev/sdd2         90112   729087   638976  312M 83 Linux
/dev/sdd3        729088  1368063   638976  312M 83 Linux
/dev/sdd4       1368064 15523839 14155776  6.8G  f W95 Ext'd (LBA)
/dev/sdd5       1376256  1417215    40960   20M 83 Linux
/dev/sdd6       1425408 15523839 14098432  6.7G 83 Linux

Now mount the data partition by using the offset. We want to create a new mount point on the local host to point the image at:

$ mkdir balena-data
$ mount -o loop,offset=$(( 1425408 * 512 )) /dev/sdd balena-data
  1. Now change to the /mnt/balena-data directory. This includes all the writable data from the SD card. We’ll need to navigate into the docker volumes directory to find the relevant volumes attached to services:
$ cd balena-data
$ ls -l
total 15
-rw-r--r--  1 root root     3 Oct 28 14:42 apps.json
drwxr-xr-x 15 root root  1024 Sep 17 14:01 docker
drwx------  2 root root 12288 Aug 13 09:49 lost+found
drwxr-xr-x  3 root root  1024 Oct 28 14:42 resin-data
$ cd docker/volumes
$ ls -l 
total 27
drwxr-xr-x 3 root root  1024 Oct 28 14:42 1513354_datatest
drwxr-xr-x 3 root root  1024 Oct 28 14:42 1513354_twocontainer
-rw-r--r-- 1 root root 32768 Oct 28 14:42 metadata.db

The directory name suffixes correlate to the volumes defined in the docker-compose manifest (if a single service application, there should only be one, with a UUID name). Any data created in the service will exist in these directories (within another _data directory) ready to be copied off:

ls -l 1513354_datatest/_data/
total 1
-rw-r--r-- 1 root root 7 Oct 28 14:47 testfile
  1. Once you’re done copying data, remember to unmount the data partition from your local machine:
$ umount balena-data/
1 Like