Balena Engine won't start after Host OS update

jpj625 · April 25, 2023, 9:21pm

After updating my Host OS to 2.113.18, my Pi4 no longer starts completely. Dashboard reports Online (VPN only) and doesn’t receive updates to metrics, logs, or container status (as expected in this state). I can attempt many of the dashboard tasks, but they fail. I cannot Identify, Shutdown, or upgrade Supervisor version.

Terminal to Host OS works, but commands via balena-engine all fail with:

ERROR: Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

Attempting systemctl start balena results in “balena.service failed because the control process exited with error code.” Checking the journal shows “2/INVALIDARGUMENT” as the reason:

journalctl -xeu balena.service

systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: balena.service: Failed with result ‘exit-code’.
systemd[1]: Failed to start Balena Application Container Engine.

referring to an earlier error:

balenad[629800]: time=“…” level=error msg=“failed to load container” container=ef…32f46f error=“Invalid container config.json, missing Config property”

My config.json seems normal and hadn’t been touched for months. (Presuming this is the config it’s talking about?)

root:~# cat /mnt/boot/config.json | jq

{    // redacted
  "apiEndpoint": "https://api.balena-cloud.com",
  "appUpdatePollInterval": 900000,
  "applicationId": "0000000",
  "deltaEndpoint": "https://delta.balena-cloud.com",
  "developmentMode": true,
  "deviceApiKey": "...",
  "deviceApiKeys": {
    "api.balena-cloud.com": "..."
  },
  "deviceId": 0000000,
  "deviceType": "raspberrypi4-64",
  "listenPort": "48484",
  "mixpanelToken": "...",
  "persistentLogging": true,
  "registered_at": 0000000000000,
  "registryEndpoint": "registry2.balena-cloud.com",
  "userId": "000000",
  "uuid": "...",
  "vpnEndpoint": "cloudlink.balena-cloud.com",
  "vpnPort": "443"
}

Google doesn’t give results for balena "Invalid container config.json" and I’m not finding the components of this message in source, so I’m at the end of my trail.

This is my home automation server, so I’d prefer not to burn it completely down. What other avenues can I explore?

mpous · April 26, 2023, 8:26am

Hello @jpj625 first of all welcome to the balena community!

This issue might have different causes. It can be related with the engine or the supervisor failing. What is the version of the Supervisor that you are running?

First of all, did you run the Diagnostics on balenaCloud? Could you please share the results here?

Could you please confirm the logs from balena ps? and to check the status run systemctl status balena?

On the other hand, could you please confirm that your device is not running out of storage? Run df -h and check that /mnt/data has enough space. If that’s the case, could you please delete anything to get more memory?

If this is not the case, let’s check the supervisor. Could you please run curl -I localhost:48484/v1/healthy on each container?

Could you please run ls -la /run | grep 'balena-engine' and share the results? If you see balena-engine.sock as a directory, could you please delete the directory and then run systemctl restart balena?

jpj625 · April 26, 2023, 3:51pm

Hi Marc! Thanks for the response.

What is the version of the Supervisor that you are running?

14.9.4. Attempting to update to 14.10.7 fails silently.

did you run the Diagnostics on balenaCloud?

Device health checks and diagnostics had previously silently failed to run, but are now working.
b2acb2363b80fef68716aafe1ffdbb46_checks_2023.04.26_15.20.41+0000.txt (1.3 KB)
b2acb2363b80fef68716aafe1ffdbb46_diagnostics_2023.04.26_15.19.05+0000.txt (594.3 KB)

Could you please confirm the logs from balena ps?

Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

and to check the status run systemctl status balena?

root@b2acb23:~# systemctl status balena --no-pager -l
● balena.service - Balena Application Container Engine
     Loaded: loaded (/lib/systemd/system/balena.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/balena.service.d
             └─storagemigration.conf
     Active: activating (start) since Wed 2023-04-26 15:25:37 UTC; 13s ago
TriggeredBy: ● balena-engine.socket
       Docs: https://www.balena.io/docs/getting-started
   Main PID: 1462376 (balenad)
      Tasks: 17 (limit: 4196)
     Memory: 66.9M
     CGroup: /system.slice/balena.service
             ├─ 1460515 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1460507
             ├─ 1460737 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1460736
             ├─ 1460968 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1460963
             ├─ 1461203 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1461202
             ├─ 1461422 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1461421
             ├─ 1461655 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1461649
             ├─ 1461889 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1461883
             ├─ 1462132 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1462125
             ├─ 1462376 /usr/bin/balenad --experimental --log-driver=journald --storage-driver=overlay2 -H fd:// -H unix:///var/run/balena.sock -H unix:///var/run/balena-engine.sock --dns=10.114.102.1 --bip=10.114.101.1/24 --fixed-cidr=10.114.101.0/25 --max-download-attempts=10 --exec-opt=native.cgroupdriver=systemd -H tcp://0.0.0.0:2375
             └─ 1462382 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1462376

Apr 26 15:25:37 b2acb23 balenad[1462376]: time="2023-04-26T15:25:37.213281487Z" level=info msg="Starting up"
Apr 26 15:25:37 b2acb23 balenad[1462376]: time="2023-04-26T15:25:37.213430652Z" level=warning msg="Running experimental build"
Apr 26 15:25:37 b2acb23 balenad[1462376]: time="2023-04-26T15:25:37.216060934Z" level=warning msg="Binding to IP address without --tlsverify is insecure and gives root access on this machine to everyone who has access to your network." host="tcp://0.0.0.0:2375"
Apr 26 15:25:37 b2acb23 balenad[1462376]: time="2023-04-26T15:25:37.216207599Z" level=warning msg="Binding to an IP address, even on localhost, can also give access to scripts run in a browser. Be safe out there!" host="tcp://0.0.0.0:2375"
Apr 26 15:25:38 b2acb23 balenad[1462376]: time="2023-04-26T15:25:38.216541119Z" level=warning msg="Binding to an IP address without --tlsverify is deprecated. Startup is intentionally being slowed down to show this message" host="tcp://0.0.0.0:2375"
Apr 26 15:25:38 b2acb23 balenad[1462376]: time="2023-04-26T15:25:38.219238549Z" level=warning msg="Please consider generating tls certificates with client validation to prevent exposing unauthenticated root access to your network" host="tcp://0.0.0.0:2375"
Apr 26 15:25:38 b2acb23 balenad[1462376]: time="2023-04-26T15:25:38.219381139Z" level=warning msg="You can override this by explicitly specifying '--tls=false' or '--tlsverify=false'" host="tcp://0.0.0.0:2375"
Apr 26 15:25:38 b2acb23 balenad[1462376]: time="2023-04-26T15:25:38.219469268Z" level=warning msg="Support for listening on TCP without authentication or explicit intent to run without authentication will be removed in the next release" host="tcp://0.0.0.0:2375"
Apr 26 15:25:44 b2acb23 balenad[1460258]: Parent process exited

df -h and check that /mnt/data has enough space

Filesystem                      Size  Used Avail Use% Mounted on
...
overlay                         300M  253M   27M  91% /
/dev/sda6                       458G   11G  424G   3% /mnt/data
/dev/sda1                        40M  7.9M   32M  20% /mnt/boot
...

run curl -I localhost:48484/v1/healthy on each container

I can’t connect to any container except the Host OS, which reports curl: (7) Failed to connect to localhost port 48484 after 0 ms: Connection refused.
All other attempts to ssh return Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

run ls -la /run | grep 'balena-engine'

This looks right to me?

root@b2acb23:~# ls -la /run | grep 'balena-engine'
drwx------  4 root   root           100 Apr 26 15:30 balena-engine
-rw-r--r--  1 root   root             7 Apr 26 15:30 balena-engine.pid
srw-rw----  1 root   balena-engine    0 Apr 26 15:30 balena-engine.sock
srw-rw----  1 root   balena-engine    0 Apr 25 06:19 balena-host.sock
lrwxrwxrwx  1 root   root            22 Apr 25 06:19 balena.pid -> /run/balena-engine.pid
srw-rw----  1 root   balena-engine    0 Apr 26 15:30 balena.sock
lrwxrwxrwx  1 root   root            22 Apr 25 06:19 docker.pid -> /run/balena-engine.pid

Thanks again for the support, happy to do the legwork. I’ve turned on support access for b2acb2363b80fef68716aafe1ffdbb46 if you want to probe anything yourself.

myarmolinsky · May 1, 2023, 2:44pm

Hi, what do you get when you run:

systemctl status balena-engine.service
systemctl status balena-engine.socket

If they come out looking as follows (respectively):

Unit balena-engine.service could not be found.

● balena-engine.socket - Docker Socket for the API
   Loaded: loaded (/lib/systemd/system/balena-engine.socket; disabled; vendor preset: enabled)
   Active: inactive (dead)
   Listen: /var/run/balena-engine.sock (Stream)

then perhaps the following solution suggested by another user may work for you?:

disable balena engine (systemctl disable balena-engine.service? not certain this is the command)
remount / as rw
enable balena engine (systemctl enable balena-engine.service? not certain this is the command)

jpj625 · May 1, 2023, 5:23pm

I’m not seeing that particular result.

balena-engine.service appears to be restarting every 15-20s, and I believe the critical line is:

level=error msg=“failed to load container” container=<...> 
    error=“Invalid container config.json, missing Config property”

From the tail of:

$ systemctl status balena-engine.service --no-pager
● balena.service - Balena Application Container Engine
     Loaded: loaded (/lib/systemd/system/balena.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/balena.service.d
             └─storagemigration.conf
     Active: activating (start) since Mon 2023-05-01 17:02:55 UTC; 17s ago
TriggeredBy: ● balena-engine.socket
       Docs: https://www.balena.io/docs/getting-started
   Main PID: 35358 (balenad)
      Tasks: 93 (limit: 4196)
     Memory: 150.3M
     CGroup: /system.slice/balena.service
             ├─ 33285 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 33284
             ├─ 33527 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 33520
             ├─ 33777 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 33776
             ├─ 34062 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 34055
             ├─ 34327 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 34321
             ├─ 34584 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 34578
             ├─ 34840 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 34839
             ├─ 35105 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 35098
             ├─ 35358 /usr/bin/balenad --experimental --log-driver=journald --storage-driver=overlay2 -H fd:// -H unix:///var/run/balena.sock -H unix:///var/run/balena-engine.sock --dns=10.114.102.1 --bip=10.114.101.1/24 --fixed-cidr=10.114.101.0/25 --max-download-attempts=10 --e…
             ├─ 35364 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 35358
             ├─ 35408 balena-engine-containerd --config /var/run/balena-engine/containerd/containerd.toml --log-level info
             ├─ 35527 /usr/bin/balena-engine-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8080 -container-ip 172.17.0.2 -container-port 3218
             ├─ 35536 /usr/bin/balena-engine-proxy -proto tcp -host-ip :: -host-port 8080 -container-ip 172.17.0.2 -container-port 3218
             ├─ 35551 /usr/bin/balena-engine-proxy -proto tcp -host-ip 0.0.0.0 -host-port 1883 -container-ip 172.17.0.3 -container-port 1883
             ├─ 35560 /usr/bin/balena-engine-proxy -proto tcp -host-ip :: -host-port 1883 -container-ip 172.17.0.3 -container-port 1883
             ├─ 35575 /usr/bin/balena-engine-proxy -proto tcp -host-ip 0.0.0.0 -host-port 445 -container-ip 172.17.0.5 -container-port 445
             ├─ 35583 /usr/bin/balena-engine-proxy -proto tcp -host-ip :: -host-port 445 -container-ip 172.17.0.5 -container-port 445
             └─ 35599 /usr/sbin/iptables --wait -t filter -A DOCKER "" "" -i br-0cb27ce6bcad -o br-0cb27ce6bcad -p tcp -d 172.17.0.5 --dport 139 -j ACCEPT
May 01 17:03:11 b2acb23 balenad[35408]: time="2023-05-01T17:03:11.928754910Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd.sock.ttrpc
May 01 17:03:11 b2acb23 balenad[35408]: time="2023-05-01T17:03:11.929133739Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd.sock
May 01 17:03:11 b2acb23 balenad[35408]: time="2023-05-01T17:03:11.929212238Z" level=info msg="containerd successfully booted in 0.096179s"
May 01 17:03:11 b2acb23 balenad[35358]: time="2023-05-01T17:03:11.955858700Z" level=warning msg="Configured runtime \"runc\" is deprecated and will be removed in the next release"
May 01 17:03:11 b2acb23 balenad[35358]: time="2023-05-01T17:03:11.956527191Z" level=info msg="Storage migration skipped: %!s(<nil>)"
May 01 17:03:12 b2acb23 balenad[35358]: time="2023-05-01T17:03:12.090449453Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
May 01 17:03:12 b2acb23 balenad[35358]: time="2023-05-01T17:03:12.090542933Z" level=warning msg="Your kernel does not support cgroup blkio weight"
May 01 17:03:12 b2acb23 balenad[35358]: time="2023-05-01T17:03:12.090571785Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
May 01 17:03:12 b2acb23 balenad[35358]: time="2023-05-01T17:03:12.091002539Z" level=info msg="Loading containers: start."
May 01 17:03:12 b2acb23 balenad[35358]: time="2023-05-01T17:03:12.096059123Z" level=error msg="failed to load container" container=ef61ba82f3cb42ddbc8e76cf54366b2a7462170b5fe76393d9d3825b5332f46f error="Invalid container config.json, missing Config property"
# (and if I time it right...)
May 01 17:21:01 b2acb23 balenad[49232]: time="2023-05-01T17:21:01.191724758Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
May 01 17:21:01 b2acb23 balenad[49232]: time="2023-05-01T17:21:01.191721221Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
May 01 17:21:01 b2acb23 balenad[49232]: time="2023-05-01T17:21:01.191783794Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd

The socket appears to be fine?

$ systemctl status balena-engine.socket                                                                                                                                                                                                                    
● balena-engine.socket - Docker Socket for the API
     Loaded: loaded (/lib/systemd/system/balena-engine.socket; enabled; vendor preset: disabled)
     Active: active (listening) since Mon 2023-05-01 16:28:54 UTC; 35min ago
      Until: Mon 2023-05-01 16:28:54 UTC; 35min ago
   Triggers: ● balena.service
     Listen: /run/balena-engine.sock (Stream)
      Tasks: 0 (limit: 4196)
     Memory: 4.0K
     CGroup: /system.slice/balena-engine.socket

stephanagantha · May 3, 2023, 9:09pm

Check the logos, the first thing you should do is check the logos for any error message. You can do it by running the following command in terminal and using command : journalctl -u balena

cywang117 · May 6, 2023, 12:13am

Hi @jpj625 ,

The error message may be indicative of an inconsistent internal Engine state, especially if the issue persists on reboot. This issue may occur with large OS jumps.

Since the Engine appears to be erroring on startup, you could check for FS corruption:

journalctl -n 10000 | grep "corrupt\|Data will be lost\|block allocation failed\|invalid magic\|dockerd: error -117\|EXT4-fs error\|error count since last fsck\|failed checksum\|checksum invalid\|Please run e2fsck"

If so, you could attempt to repair the data partition (partition 6 for RPi4) using fsck, although attempt this with extreme care. You could also purge the Engine directory as another last resort, if there is no FS corruption.

Regards,
Christina

jpj625 · May 7, 2023, 8:08pm

Hi Christina,

Thanks for more ideas.

# `journalctl -n 10000 | grep “corrupt.…”

May 06 14:00:03 b2acb23 systemd-journald[1091]: File /var/log/journal/d02468b56b9947a08ced32a162a7323d/system.journal corrupted or uncleanly shut down, renaming and replacing.
May 07 17:01:51 b2acb23 kernel: EXT4-fs (sda5): error count since last fsck: 1

I’m not sure if a single fs issue means anything (early-stage hardware failure, underlying issue, just one file that got out of whack). Doesn’t look like fsck is worried about anything.

root@b2acb23:~# fsck -nfv /dev/sda6

fsck from util-linux 2.37.4
e2fsck 1.46.5 (30-Dec-2021)
Warning!  /dev/sda6 is in use.s
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

      606937 inodes used (1.00%, out of 60959744)
         347 non-contiguous files (0.1%)
         397 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 595545/137/2
    19223971 blocks used (3.94%, out of 487677953)
           0 bad blocks
           1 large file

      546064 regular files
       47968 directories
         199 character device files
           0 block device files
          26 fifos
        7121 links
       12651 symbolic links (11000 fast symbolic links)
          20 sockets
------------
      614049 files

After wrestling with umount and fsck a bit, but never actually seeing anything found or fixed, something dislodged and the device booted all the way to “online” once, then started throwing out-of-space errors that look like it’s trying to update.

At the start of this, I had tried a new release (but now pinned back to what’s already deployed), and I’ve also tried to update the supervisor a few times; those may be queued? My primary service no longer appears in the service list, which is unexpected. The rest of the services show “downloaded” and aren’t started.

07.05.23 10:40:10 (-0700) Failed to download image 'registry2.balena-cloud.com/v2/b3bb85de12a78fd622d9ed52f05f380a@sha256:f3025800bbbc61c04abb0cd27fbe69a81830b3485b4cbee8acad053928fabd72' due to 'failed to register layer: Error processing tar file(exit status 1): write /usr/share/X11/locale/en_US.UTF-8/Compose: no space left on device'
07.05.23 10:40:43 (-0700) Downloading image 'registry2.balena-cloud.com/v2/b3bb85de12a78fd622d9ed52f05f380a@sha256:f3025800bbbc61c04abb0cd27fbe69a81830b3485b4cbee8acad053928fabd72'
07.05.23 10:40:43 (-0700) Downloading image 'registry2.balena-cloud.com/v2/8f4d0e783f9d6f06e8e2a3445d7c7015@sha256:7e0b1236341f0133c95207ab0c8a784438d599b3994fc4327dddcc11e358be98'
07.05.23 10:43:35 (-0700) Failed to download image 'registry2.balena-cloud.com/v2/8f4d0e783f9d6f06e8e2a3445d7c7015@sha256:7e0b1236341f0133c95207ab0c8a784438d599b3994fc4327dddcc11e358be98' due to 'failed to register layer: Error processing tar file(exit status 1): write /usr/local/share/pico/lang/es-ES_zl0_sg.bin: no space left on device'
07.05.23 10:43:36 (-0700) Failed to download image 'registry2.balena-cloud.com/v2/b3bb85de12a78fd622d9ed52f05f380a@sha256:f3025800bbbc61c04abb0cd27fbe69a81830b3485b4cbee8acad053928fabd72' due to 'failed to register layer: Error processing tar file(exit status 1): write /usr/lib/python3.7/pydoc_data/__pycache__/topics.cpython-37.pyc: no space left on device'

I can only assume it’s complaining about overlay/partition 3?

root@b2acb23:/usr# df -h

Filesystem                      Size  Used Avail Use% Mounted on
devtmpfs                        1.8G     0  1.8G   0% /dev
tmpfs                           1.9G  1.1M  1.9G   1% /run
/dev/sda3                       300M  253M   27M  91% /mnt/sysroot/active
/dev/disk/by-state/resin-state   19M  244K   17M   2% /mnt/state
overlay                         300M  253M   27M  91% /
/dev/sda6                       458G   11G  424G   3% /mnt/data
tmpfs                           1.9G     0  1.9G   0% /dev/shm
tmpfs                           4.0M     0  4.0M   0% /sys/fs/cgroup
tmpfs                           1.9G     0  1.9G   0% /tmp
tmpfs                           1.9G   60K  1.9G   1% /var/volatile
/dev/sda1                        40M  7.9M   32M  20% /mnt/boot

I remain confused, so any help is appreciated.

lmbarros · May 10, 2023, 8:43pm

Hi Jeff! A tricky issue, indeed. Let me share some more info and ideas for troubleshooting. (BTW, I just tried to access your device, but it seems to be offline now.)

The most important clue I see here is that the error message

balenad[629800]: time="..." level=error msg="failed to load container" container=ef...32f46f error="Invalid container config.json, missing Config property"

is really coming from balenaEngine (we can tell this because the Engine daemon, balenad, is mentioned as the source of the message, right at the start). So, it’s not referring to balena’s /mnt/boot/config.json, but instead to a configuration file that every container has.

What is happening here is that the Engine is trying to initialize. As part of the initialization the Engine is trying to bring back to life the containers that were running before it has stopped (it was stopped because of the OS upgrade or some other reason). And when it looks for the container data that is persisted in your device’s storage, it can’t find what it hoped to find.

Now, what can we do about it? I sadly cannot tell from the data we got so far, so here are suggestions for further digging. I’ll also share an arguably simpler (if a bit risky) workaround later on.

So, for troubleshooting:

Open a terminal to the HostOS and cd /mnt/data/docker/containers. This is the place where balenaEngine stores the container data I talked about above.
If you ls here, you shall see some directories whose names match the IDs of the containers you had running in the past.
Each of these directories shall have either a config.json or (more likely) a config.v2.json. As far as I can tell, these are the files the Engine is looking for during initialization. So, let’s try to get some clues on what is wrong.
First, check if either (or both) of files are indeed present in each of the subdirs under /mnt/data/docker/containers.
Second, check if they have any content at all. Perhaps they are empty?
Third, may be worth checking if they are valid JSON files. Try running jq . config.v2.json to see if jq reports any errors.
Fourth, check if they have the Config property the message is complaining about. An easy way to do this is:

root@92a02e7:/mnt/data/docker/containers/17396e45a08b1e976e45bbb492d87cb9cfa2c181c0508f40dff97a5908f23d2f# jq .Config config.v2.json
{
  "Hostname": "17396e45a08b",
  "Domainname": "",
  "User": "",
  [...]
  "StopSignal": "SIGTERM",
  "StopTimeout": 10
}

(Here I tested a device of mine that has the Config property as expected. Perhaps you’ll get only null as the response, because the Config is missing?)

To be honest, I am not super optimistic that these commands will get us enough info to solve the issue. However, as a maintainer of balenaEngine, I’d be very happy if you could try them out and share your findings. This may really help us to get a better understanding of the issue, even we cannot solve it.

OK, I also promised a workaround. Instead of purging the whole Engine data directory, you can try to remove the container data, that is, the /mnt/data/docker/containers/* directories. The Engine should then recreate the necessary container data dirs the next time it tries to start up. I believe you’d lose any data stored locally in the containers (but volumes and the images themselves would be preserved).

I said this operation is a bit a risky because we really shouldn’t be manually changing things in the Engine/Docker data directories. It’s too easy to break things, and I generally advise against it. Anyway, if the alternative is, like you said, “burn it completely down,” you may want to give this a try.

jpj625 · May 10, 2023, 11:57pm

Hi Leandro. Thanks for adding more eyes!

Checked my containers for config.v2.json, which they all had. Most were valid, but a single one was missing the Config section, as well as anything else identifying (all 0s and nulls).
Removed that troubled directory and restarted. The Invalid container config.json error is no longer showing up, but things are still stuck.
Removed all container directories (:/mnt/data/docker/containers# ls -a → . ..) and restarted, but still no progress.

Current state: Dashboard shows “Updating” at 76% for device status, and most containers show “Downloaded” with 2 “Downloading” at 75% and 50%. No new logs are making it to the dashboard; previous logs are Failed to download image ... no space left on device. Only Host OS is available to ssh.

To confirm that the service still won’t start:

root@b2acb23:~# systemctl start balena

Job for balena.service failed because the control process exited with error code.
See "systemctl status balena.service" and "journalctl -xeu balena.service" for details.

And since everyone loves stacktraces…

root@b2acb23:~# journalctl -xeu balena.service

May 10 22:19:13 b2acb23 balenad[32860]: time="2023-05-10T22:19:13.141621481Z" level=info msg="Starting up"
May 10 22:19:13 b2acb23 balenad[32860]: time="2023-05-10T22:19:13.141855330Z" level=warning msg="Running experimental build"
May 10 22:19:13 b2acb23 balenad[32860]: time="2023-05-10T22:19:13.144764963Z" level=warning msg="Binding to IP address without --tlsverify is insecure and gives root access on this machine to everyone who has access to your network." host="tcp://0.0.0.0:2375"
May 10 22:19:13 b2acb23 balenad[32860]: time="2023-05-10T22:19:13.144865832Z" level=warning msg="Binding to an IP address, even on localhost, can also give access to scripts run in a browser. Be safe out there!" host="tcp://0.0.0.0:2375"
May 10 22:19:14 b2acb23 balenad[32860]: time="2023-05-10T22:19:14.145117488Z" level=warning msg="Binding to an IP address without --tlsverify is deprecated. Startup is intentionally being slowed down to show this message" host="tcp://0.0.0.0:2375"
May 10 22:19:14 b2acb23 balenad[32860]: time="2023-05-10T22:19:14.145437781Z" level=warning msg="Please consider generating tls certificates with client validation to prevent exposing unauthenticated root access to your network" host="tcp://0.0.0.0:2375"
May 10 22:19:14 b2acb23 balenad[32860]: time="2023-05-10T22:19:14.145540428Z" level=warning msg="You can override this by explicitly specifying '--tls=false' or '--tlsverify=false'" host="tcp://0.0.0.0:2375"
May 10 22:19:14 b2acb23 balenad[32860]: time="2023-05-10T22:19:14.145628223Z" level=warning msg="Support for listening on TCP without authentication or explicit intent to run without authentication will be removed in the next release" host="tcp://0.0.0.0:2375"
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.151790705Z" level=info msg="libcontainerd: started new balena-engine-containerd process" pid=32885
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.383625428Z" level=warning msg="Configured runtime \"runc\" is deprecated and will be removed in the next release"
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.384681934Z" level=info msg="Storage migration skipped: %!s(<nil>)"
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.507087257Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.508049838Z" level=warning msg="Your kernel does not support cgroup blkio weight"
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.508332168Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
May 10 22:19:29 b2acb23 balenad[32860]: time="2023-05-10T22:19:29.509307861Z" level=info msg="Loading containers: start."
May 10 22:19:30 b2acb23 balenad[32860]: time="2023-05-10T22:19:30.927907746Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
May 10 22:19:30 b2acb23 balenad[32860]: time="2023-05-10T22:19:30.927911339Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
May 10 22:19:30 b2acb23 balenad[32860]: time="2023-05-10T22:19:30.928002523Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
May 10 22:19:31 b2acb23 balenad[32860]: panic: page 2 already freed
May 10 22:19:31 b2acb23 balenad[32860]: goroutine 1 [running]:
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*freelist).free(0x40003d2600, 0x5f3a0b, 0x7f643c2000)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/freelist.go:175 +0x39c
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*node).spill(0x40001f01c0, 0x4000ba8e10, 0x1)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/node.go:359 +0x348
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*node).spill(0x40001f0150, 0x4000722cf0, 0x4000d1e168)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/node.go:346 +0x9c
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*Bucket).spill(0x400086ea00, 0x4000722c00, 0x4000d1e358)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/bucket.go:570 +0x340
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*Bucket).spill(0x40006f0558, 0x42aaf0453, 0x2bad540)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/bucket.go:537 +0x2b8
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*Tx).Commit(0x40006f0540, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/tx.go:160 +0xa8
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/go.etcd.io/bbolt.(*DB).Update(0x40003d5e00, 0x4000d1e590, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/go.etcd.io/bbolt/db.go:701 +0xe4
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libkv/store/boltdb.(*BoltDB).Put(0x40004aa730, 0x4000ae6390, 0x23, 0x2bdc510, 0x0, 0x0, 0x400083da30, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libkv/store/boltdb/boltdb.go:187 +0x194
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork/datastore.(*datastore).ensureParent(0x400007f800, 0x4000ae6390, 0x23, 0x7f6457af80, 0x40007223c0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/datastore/datastore.go:507 +0xc4
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork/datastore.(*datastore).iterateKVPairsFromStore(0x400007f800, 0x4000ae6390, 0x23, 0x1e8f3b8, 0x40007223c0, 0x4000d1e768, 0x4000061c00, 0x2b01880)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/datastore/datastore.go:539 +0x70
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork/datastore.(*datastore).Map(0x400007f800, 0x4000ae6390, 0x23, 0x1e8f3b8, 0x40007223c0, 0x0, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/datastore/datastore.go:578 +0xe0
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).getNetworksFromStore(0x40000db300, 0x134e76c, 0x400007f840, 0x169b657)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:134 +0x480
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).Networks(0x40000db300, 0x0, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:1023 +0x28
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).WalkNetworks(0x40000db300, 0x1dfc4e0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:1034 +0x28
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/docker/libnetwork.New(0x4000804a80, 0x9, 0x10, 0x40006d0720, 0x4000aeac90, 0x4000804a80, 0x9)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:240 +0x508
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/daemon.(*Daemon).initNetworkController(0x400071c1e0, 0x4000122000, 0x4000aeac90, 0x0, 0x0, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/daemon_unix.go:855 +0x78
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/daemon.(*Daemon).restore(0x400071c1e0, 0x400086e700, 0x40006f00e0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/daemon.go:491 +0x3d8
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/daemon.NewDaemon(0x1e75a70, 0x400086e700, 0x4000122000, 0x40006d0720, 0x0, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/daemon.go:1196 +0x2160
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/cmd/dockerd.(*DaemonCli).start(0x400083fc80, 0x40000894a0, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/daemon.go:190 +0x588
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/cmd/dockerd.runDaemon(...)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker_unix.go:13
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/cmd/dockerd.newDaemonCommand.func1(0x4000396840, 0x40000dab00, 0x0, 0x10, 0x0, 0x0)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:34 +0x78
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0x4000396840, 0x400000c250, 0x10, 0x11, 0x4000396840, 0x400000c250)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:850 +0x320
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x4000396840, 0x0, 0x0, 0x10)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:958 +0x258
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(...)
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:895
May 10 22:19:31 b2acb23 balenad[32860]: github.com/docker/docker/cmd/dockerd.Main()
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:98 +0x188
May 10 22:19:31 b2acb23 balenad[32860]: main.main()
May 10 22:19:31 b2acb23 balenad[32860]:         /work/build/tmp/work/cortexa72-poky-linux/balena/20.10.26+gitaf43ce40a2eb094b68a7d91c5fd9f36280c265a2-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/balena-engine/main.go:30 +0x1b4
May 10 22:19:31 b2acb23 systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 10 22:19:31 b2acb23 systemd[1]: balena.service: Failed with result 'exit-code'.
May 10 22:19:31 b2acb23 systemd[1]: Failed to start Balena Application Container Engine.
May 10 22:19:32 b2acb23 systemd[1]: balena.service: Found left-over process 31022 (exe) in control group while starting unit. Ignoring.
May 10 22:19:32 b2acb23 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
...

It does make me feel better that it’s not a simple fix or user-error (so far!). Thanks again to everyone working on this with me.

Device is on and support access is open.

lmbarros · May 11, 2023, 1:48pm

Hi there,

First off, thanks for taking the time to do the experiments we suggested and getting back with the results!

This new error you saw is also related to balenaEngine’s state stored in disk getting corrupted. Looking at the device, I spotted some other errors originating from unexpected stuff going on the filesystem:

Checking the kernel logs (using dmesg):

[  315.932726] EXT4-fs (sda5): error count since last fsck: 1
[  315.932769] EXT4-fs (sda5): initial error at time 1620926684: ext4_lookup:1703: inode 3429
[  315.932810] EXT4-fs (sda5): last error at time 1620926684: ext4_lookup:1703: inode 3429

And looking at the logs of your latest OS update (these logs are at /mnt/data/balenahup), I see various occurrences of this one:

[INFO] Deploying /mnt/boot/overlays/ssd1351-spi.dtbo...Error: Failed to read target file /mnt/boot/balenaos.fingerprint

Caused by:
    No such file or directory (os error 2)

In our experience, a large number of otherwise unrelated storage issues are a sign of hardware failure. It’s relatively common to see this with SD cards, which are very prone to flash media wear-and-tear, but I believe you are using an HD or SSD here, right? (Anyway, SSDs are also susceptible to wear-and-tear, just not as much. And of course, any storage media can fail one way or another.)

Unfortunately, I am afraid the recommended action for you would be to re-flash your device using a new storage media. If you have important data in the volumes used by your containers, you can probably salvage it: volume data is stored under /mnt/data/docker/volumes.

jpj625 · May 12, 2023, 12:57am

Disappointing to have the answer be “fried storage,” but it is what it is. It’s a WD SSD from 2021, so I ought to be able to get it replaced.

I was able to extract my volumes and get a new device set up. If I don’t encounter any unexpected side-effects, the hard part will have been updating my public device URL.

Thanks again to all of you fine folks, @mpous @myarmolinsky @cywang117 @lmbarros!

myarmolinsky · May 15, 2023, 11:34am

Glad we can help (albeit the disappointing conclusion )
Please let us know if you have any more questions

Topic		Replies	Views
Balena engine not starting due to socket busy error Product support raspberrypi4	5	1162	March 18, 2021
balenaOS 2.98.33 on raspberry pi 3b (32bit) balenaOS	5	435	August 18, 2022
Balena Engine won't start balenaOS	10	843	June 24, 2020
stuck on Update successful, rebooting balenaOS support , raspberrypi3	5	407	June 12, 2021
Upgraded OS and app - diagnostics show errors Product support raspberrypizero , foo	8	1623	May 25, 2020

Balena Engine won't start after Host OS update

Related topics