balena-engine not starting

Hi, I have a problem with one of my clients Balena applications.
"balena push " failed:

$ balena push 192.168.1.147 --debug
[debug] new argv=[/home/peter/.nvm/versions/node/v12.18.4/bin/node,/home/peter/.nvm/versions/node/v12.18.4/bin/balena,push,192.168.1.147] length=4
[debug] Using . as build source
[Debug]   Checking we can access device
[Debug]   Sending request to http://192.168.1.147:48484/ping
Could not communicate with local mode device at address 192.168.1.147

Additional information may be available with the `--debug` flag.
For help, visit our support forums: https://forums.balena.io
For bug reports or feature requests, see: https://github.com/balena-io/balena-cli/issues/

Using ssh to log into the device, I ran “systemctl status balena-engine”, which gave this output:

● balena.service - Balena Application Container Engine
   Loaded: loaded (/lib/systemd/system/balena.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/balena.service.d
           └─balena.conf
   Active: activating (start) since Tue 2021-01-26 14:30:27 UTC; 1s ago
     Docs: https://www.balena.io/docs/getting-started
 Main PID: 31222 (balenad)
    Tasks: 1207 (limit: 2062)
   Memory: 708.3M
   CGroup: /system.slice/balena.service
           ├─ 1375 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1372
           ├─ 1378 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1382 balena info
           ├─ 1466 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1465
           ├─ 1467 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1473 balena info
           ├─ 1774 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1773
           ├─ 1775 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1781 balena info
           ├─ 1813 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1812
           ├─ 1814 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1820 balena info
           ├─ 1889 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1888

.... skipping a couple of 100 similar lines ...

           ├─31277 balena info
           ├─31310 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31309
           ├─31346 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31345
           ├─31383 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31382
           ├─31424 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31422
           ├─31461 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31460
           ├─31462 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─31468 balena info
           ├─31499 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31498
           ├─31539 /usr/bin/balenad --experimental --log-driver=journald -s aufs -H fd:// -H unix:///var/run/balena.sock -H unix:///var/run/balena-engine.sock -H tcp://0.0.0.0:2375 --dns 10.114.1>
           ├─31540 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31539
           ├─31541 /bin/sh /usr/lib/balena/balena-healthcheck
           └─31547 balena info

Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.923971552Z" level=warning msg="failed to load plugin io.containerd.internal.v1.opt" error="mkdir /opt: read-only file system"
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.926699346Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.927134708Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.933444092Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.935423558Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.996124188Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd-debug.sock
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.003817625Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd.sock.ttrpc
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.005079960Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd.sock
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.012002830Z" level=info msg="containerd successfully booted in 1.611315s"
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.937936029Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd

The key message seems to be:
… msg=“failed to load plugin io.containerd.internal.v1.opt” error=“mkdir /opt: read-only file system”

I am totally lost.
Any help would be very much appreciated.

Cheers
Peter

Hi @pit007,

Can you tell us a little more about the project you’re pushing and the kind of device? Are you using balenaCloud or openBalena?

Generally speaking, using balena push <IP> is a local mode command, which pushes your image(s) to the device directly without going through the balena build servers. A more typical deployment command is balena push <app-name>. In order for local push to work, you must enable it by selecting “Enable local mode” from the device menu (near the Reboot and Restart buttons).

Can you also confirm that your device is running and the HostOS is accessible from the dashboard? Or are you just using the command line (with openBalena)?

John

Hi jtonello,

thanks for your reply.
The target is Raspberry Pi Compute Module on a custom I/O-Board. The compute module is probably CM3. Wie are using balenaCloud. We have been developing for several months without any problems, usually using "balena push ".
In the last weeks, my system got slower and less responsive. Using “ps” I saw a large number (> 200) instances of the balena health checker. Sometimes the machine took hours to boot. Then I tried “balena image -prune” which seemed to improve things slightly.
However, last weekend at yet another deploy cycle I ran into the problems described.
The device’s HostOS is accessible through Balena dashboard.

Thanks for your help.
Cheers
Peter

Hi Peter, sorry to hear you’re having trouble with your Balena container engine.

Can you please provide the OS version and supervisor version (can be found in your balena dashboard device ui)?

Also, it might be helpful if you could grant support access for our OS team (can be found in the “Actions” panel).

Hi zwhichcox,

thanks for your reply.
The OS-Version is balenaOS 2.51.1+rev1, supervisor version 11.4.10.
I’m not sure, what you need to identify the device, anyways, the BALENA_DEVICE_UUID is 4ddd44df5ce0df554d22c21a884fa9d8.

Thanky you very much for your help.
Peter

Hello, can you power on the device so I can take a look?

HI,

it has been running already.

Cheers
Peter

Hey there @pit007, it looks like the device has exceeded the max inode count on the data partition, which is causing issues with creating containers, starting the supervisor, and a whole host of other issues.

Could you try running this command to get an idea of where most of the inodes are used?

du --inodes -xSs /mnt/data/*/*

Once we see where the most inodes are taken up, we can provide more detail by adding to that path and running the same command, for example:

# for example, if /mnt/data/docker/volumes had the highest number above
du --inodes -xSs /mnt/data/docker/volumes/*

With that narrowed down, we may need to start deleting files for the OS to start behaving correctly. If those files were generated by the hostOS and not any of your containers, we’ll have to open an issue to fix it in the OS.

For now, we’ve opened an issue to start reporting free inodes when running healthchecks:

Hi Kyle,

thanks for your reply.

I issued the command:

du --inodes -xSs /mnt/data//

1 /mnt/data/docker/aufs
2 /mnt/data/docker/builder
4 /mnt/data/docker/buildkit
1 /mnt/data/docker/containerd
1 /mnt/data/docker/containers
1 /mnt/data/docker/image
1 /mnt/data/docker/network
1 /mnt/data/docker/plugins
1 /mnt/data/docker/runtimes
1 /mnt/data/docker/swarm
1 /mnt/data/docker/tmp
1 /mnt/data/docker/trust
2 /mnt/data/docker/volumes
2 /mnt/data/resin-data/resin-supervisor

That’s all there is.

I’ll keep the device powered through the night.
If you would like to look at it again, feel free to do so.
I would like to reflash the device tomorrow (Friday).

Thanks for your help.
Peter

Hey Peter, I’m looking at your device now. The command I provided you earlier wasn’t accurate, I fixed it up a bit and found that there are hundreds of containers taking up 10k inodes each on this device.

du --inodes -xSs /mnt/data/docker/containers/*

So the balena image prune command you ran before wouldn’t help much as you said. I would recommend trying balena system prune instead to clean up all the old containers before it gets to this state.

That being said, containers shouldn’t be sticking around after each balena push attempt, so there’s something else we may be missing here. I’ll ask internally for suggestions.

Hi Kyle,

yes, I had tried that before, but without success:
Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

As I said, I’ll leave the machine running, in case you would like to continue investigating.

Cheers
Peter

Hey, we see that there are 187 containers in /mnt/data/docker/containers and some of these containers are using upwards of 9000 inodes. We knew it was coming from that folder because of the following command:

root@4ddd44d:/var/lib/docker# df -i
Filesystem                      Inodes   IUsed  IFree IUse% Mounted on
devtmpfs                        110016     440 109576    1% /dev
tmpfs                           111936      14 111922    1% /tmp
/dev/mmcblk0p2                   39936    6056  33880   16% /mnt/sysroot/active
/dev/disk/by-state/resin-state    5136      70   5066    2% /mnt/state
none                             39936    6056  33880   16% /
tmpfs                           111936       1 111935    1% /dev/shm
tmpfs                           111936     683 111253    1% /run
tmpfs                           111936      14 111922    1% /sys/fs/cgroup
/dev/mmcblk0p1                       0       0      0     - /mnt/boot
tmpfs                           111936      44 111892    1% /var/volatile
/dev/mmcblk0p6                 1819648 1819648      0  100% /mnt/data

Note how /mnt/data has used 100% of the Inodes available.

Now we are trying to figure out how so many containers have been left behind and why they have so many inodes. Does the application you’re developing in localmode open files in some kind of loop ? Is your application using Python also ? I’m asking because the containers all have some python env variable set and this will confirm if it’s your container or something else.

How often have you seen this btw ? Are you developing in the same way on multiple devices and this is the only device that’s experiencing this issue ? How long have you been developing with local mode ? I wonder if this is something that has built up over time. We could confirm by looking at all the containers in /mnt/data/docker/containers which have a creation timestamp if needed. What version of the CLI are you using and what is your operating system ?

Hi Miguel,

yes our application is written in Python. However we are not using any files.
We talk to some special hardware.

Cheers
Peter

Hi Miguel,

personally I have been experiencing problems for a number of weeks:
The system became slower and slower to boot. There was a huge number of
healtcheck processes. Sometimes I used “balena image purge” which did help
slightly. Unfortunately I’m not very familiar with docker, so I never tried
“balena system purge” until recently.
Some of my collegues experienced similar problems and had to reflash their
devices. Apparently none of the really investigated the problem.
The balena-cli I’m using is version 12.22.2.

I will now reflash the device in order to continue to keep developing.

Thanks for investigating and your help.

Regards
Peter

Hello @pit007

I’ve looked a bit at your device.
It is running out of inodes on the data partition.
Most inodes are used by folders in /mnt/data/docker/containers/
You have 190 folders here, you should have 2: one per container.

I don’t know why they weren’t removed.
Most of these leftover folders have a config.v2.json file containing something like:

{
  "StreamConfig": {},
  "State": {
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false,
    "RemovalInProgress": false,
    "Dead": false,
    "Pid": 0,
    "ExitCode": 255,
    "Error": "",
    "StartedAt": "2021-01-24T16:41:09.572144356Z",
    "FinishedAt": "2021-01-25T15:21:11.183074408Z",
    "Health": null
  },
  "ID": "2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4",
  "Created": "2021-01-24T16:41:06.993921427Z",
  "Managed": false,
  "Path": "/bin/sh",
  "Args": [
    "-c",
    "while true; do sleep 3600; done"
  ],
  "Config": {
    "Hostname": "2206a5266182",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "LC_ALL=C.UTF-8",
      "DEBIAN_FRONTEND=noninteractive",
      "UDEV=off",
      "LANG=C.UTF-8",
      "PYTHON_VERSION=3.7.4",
      "PYTHON_PIP_VERSION=19.2.1",
      "SETUPTOOLS_VERSION=41.0.1",
      "PYTHON_DBUS_VERSION=1.2.8",
      "PYTHONPATH=/usr/lib/python3/dist-packages::/usr/src/",
      "LD_LIBRARY_PATH=/usr/local/lib"
    ],
    "Cmd": null,
    "Image": "be4850ca41a6",
    "Volumes": null,
    "WorkingDir": "/usr/src",
    "Entrypoint": [
      "/bin/sh",
      "-c",
      "while true; do sleep 3600; done"
    ],
    "OnBuild": null,
    "Labels": {
      "io.balena.architecture": "armv7hf",
      "io.balena.qemu.version": "4.0.0+balena-arm"
    }
  },
  "Image": "sha256:be4850ca41a6280ef489d85bfe45b5f6a0f1ceacc7f862cbb991a9a4a5d9547a",
  "NetworkSettings": {
    "Bridge": "",
    "SandboxID": "7d296765ad28216769c7c06f453d45e3cda20ca0613850558dcf8ac7cadac9c9",
    "HairpinMode": false,
    "LinkLocalIPv6Address": "",
    "LinkLocalIPv6PrefixLen": 0,
    "Networks": {
      "bridge": {
        "IPAMConfig": null,
        "Links": null,
        "Aliases": null,
        "NetworkID": "9b003273792ecd6a841fa27af8e7fb7413dbb93ab924ac7406a1c72bee8acd12",
        "EndpointID": "867ecae1f82fe41be9488cccab4c0f2ae0c7e29b7a3feed90b5c2e0d2dd92334",
        "Gateway": "10.114.101.1",
        "IPAddress": "10.114.101.3",
        "IPPrefixLen": 24,
        "IPv6Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "MacAddress": "02:42:0a:72:65:03",
        "DriverOpts": null,
        "IPAMOperational": false
      }
    },
    "Service": null,
    "Ports": {},
    "SandboxKey": "/var/run/balena-engine/netns/7d296765ad28",
    "SecondaryIPAddresses": null,
    "SecondaryIPv6Addresses": null,
    "IsAnonymousEndpoint": true,
    "HasSwarmEndpoint": false
  },
  "LogPath": "",
  "Name": "/trusting_mendeleev9",
  "Driver": "aufs",
  "OS": "linux",
  "MountLabel": "",
  "ProcessLabel": "",
  "RestartCount": 0,
  "HasBeenStartedBefore": true,
  "HasBeenManuallyStopped": false,
  "MountPoints": {},
  "SecretReferences": null,
  "ConfigReferences": null,
  "AppArmorProfile": "",
  "HostnamePath": "/var/lib/docker/containers/2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4/hostname",
  "HostsPath": "/var/lib/docker/containers/2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4/hosts",
  "ShmPath": "",
  "ResolvConfPath": "/var/lib/docker/containers/2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4/resolv.conf",
  "SeccompProfile": "",
  "NoNewPrivileges": false
}

These look like containers that ran a balenalib python base image according to the env vars and labels: base-images/Dockerfile at master · balena-io-library/base-images · GitHub .

See the Entrypoint: they were running an empty while loop.

I guess we can simply remove all folders here except ab36529295bb08d2939fdbd45d719a66d29919144f9bf0c6733e167e61424dea 'your app) and 78f457c98e99fe43323cb8157657041ac726711d4fe774c02aa330e8e7161699 (the supervisor) to free up inodes.

I wonder what created these containers (and folders). Do you have any clue ?
You can run ls /mnt/data/docker/containers/ | wc -l to ocunt the number of folders.
Can you check what in your workflow is creating folders here ?

@pit007 Do you (or did you) use livepush ?

Hi Alexis,

no, I have no idea why these files/folders were not removed.
I don’t know enough about how docker behaves or is supposed to behave.
From now on, I will issue a “balena image prune” and “balena system prune”
regularly.

Thanks for your investigation and help.
Peter

Hi Alexis,

well, probably yes: I usually forget add the --nolive option.
I usually run "balena push " for every dev cycle.

Cheers
Peter

@pit007
Ok, then it looks like these containers were created by livepush according to their Entrypoint: livepush/container.ts at c182d1191d66e064ce0c1107320573b5edd5f257 · balena-io-modules/livepush · GitHub

The question is why weren’t they removed by balenad.

You can delete all unused containers folders with the following:

  • df -ito check the available inodes (look at /mnt/data);
  • balena ps -aq → gives you the ids of the existing containers (currently ab36529295bb and 78f457c98e99);
  • ls /mnt/data/docker/containers | grep -v ab36529295bb | grep -v 78f457c98e99 | xargs rm -rf (one grep -v per existing container) to delete all folders except the 2 from above;
  • df -ito check the available inodes again;