balena-engine not starting

pit007 · January 26, 2021, 2:34pm

Hi, I have a problem with one of my clients Balena applications.
"balena push " failed:

$ balena push 192.168.1.147 --debug
[debug] new argv=[/home/peter/.nvm/versions/node/v12.18.4/bin/node,/home/peter/.nvm/versions/node/v12.18.4/bin/balena,push,192.168.1.147] length=4
[debug] Using . as build source
[Debug]   Checking we can access device
[Debug]   Sending request to http://192.168.1.147:48484/ping
Could not communicate with local mode device at address 192.168.1.147

Additional information may be available with the `--debug` flag.
For help, visit our support forums: https://forums.balena.io
For bug reports or feature requests, see: https://github.com/balena-io/balena-cli/issues/

Using ssh to log into the device, I ran “systemctl status balena-engine”, which gave this output:

● balena.service - Balena Application Container Engine
   Loaded: loaded (/lib/systemd/system/balena.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/balena.service.d
           └─balena.conf
   Active: activating (start) since Tue 2021-01-26 14:30:27 UTC; 1s ago
     Docs: https://www.balena.io/docs/getting-started
 Main PID: 31222 (balenad)
    Tasks: 1207 (limit: 2062)
   Memory: 708.3M
   CGroup: /system.slice/balena.service
           ├─ 1375 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1372
           ├─ 1378 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1382 balena info
           ├─ 1466 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1465
           ├─ 1467 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1473 balena info
           ├─ 1774 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1773
           ├─ 1775 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1781 balena info
           ├─ 1813 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1812
           ├─ 1814 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─ 1820 balena info
           ├─ 1889 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 1888

.... skipping a couple of 100 similar lines ...

           ├─31277 balena info
           ├─31310 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31309
           ├─31346 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31345
           ├─31383 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31382
           ├─31424 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31422
           ├─31461 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31460
           ├─31462 /bin/sh /usr/lib/balena/balena-healthcheck
           ├─31468 balena info
           ├─31499 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31498
           ├─31539 /usr/bin/balenad --experimental --log-driver=journald -s aufs -H fd:// -H unix:///var/run/balena.sock -H unix:///var/run/balena-engine.sock -H tcp://0.0.0.0:2375 --dns 10.114.1>
           ├─31540 /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 31539
           ├─31541 /bin/sh /usr/lib/balena/balena-healthcheck
           └─31547 balena info

Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.923971552Z" level=warning msg="failed to load plugin io.containerd.internal.v1.opt" error="mkdir /opt: read-only file system"
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.926699346Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.927134708Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.933444092Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.935423558Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
Jan 26 14:31:43 4ddd44d balenad[31539]: time="2021-01-26T14:31:43.996124188Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd-debug.sock
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.003817625Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd.sock.ttrpc
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.005079960Z" level=info msg=serving... address=/var/run/balena-engine/containerd/balena-engine-containerd.sock
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.012002830Z" level=info msg="containerd successfully booted in 1.611315s"
Jan 26 14:31:44 4ddd44d balenad[31539]: time="2021-01-26T14:31:44.937936029Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd

The key message seems to be:
… msg=“failed to load plugin io.containerd.internal.v1.opt” error=“mkdir /opt: read-only file system”

I am totally lost.
Any help would be very much appreciated.

Cheers
Peter

jtonello · January 26, 2021, 7:05pm

Hi @pit007,

Can you tell us a little more about the project you’re pushing and the kind of device? Are you using balenaCloud or openBalena?

Generally speaking, using balena push <IP> is a local mode command, which pushes your image(s) to the device directly without going through the balena build servers. A more typical deployment command is balena push <app-name>. In order for local push to work, you must enable it by selecting “Enable local mode” from the device menu (near the Reboot and Restart buttons).

Can you also confirm that your device is running and the HostOS is accessible from the dashboard? Or are you just using the command line (with openBalena)?

John

pit007 · January 27, 2021, 4:55pm

Hi jtonello,

thanks for your reply.
The target is Raspberry Pi Compute Module on a custom I/O-Board. The compute module is probably CM3. Wie are using balenaCloud. We have been developing for several months without any problems, usually using "balena push ".
In the last weeks, my system got slower and less responsive. Using “ps” I saw a large number (> 200) instances of the balena health checker. Sometimes the machine took hours to boot. Then I tried “balena image -prune” which seemed to improve things slightly.
However, last weekend at yet another deploy cycle I ran into the problems described.
The device’s HostOS is accessible through Balena dashboard.

Thanks for your help.
Cheers
Peter

zwhitchcox · January 28, 2021, 2:31am

Hi Peter, sorry to hear you’re having trouble with your Balena container engine.

Can you please provide the OS version and supervisor version (can be found in your balena dashboard device ui)?

Also, it might be helpful if you could grant support access for our OS team (can be found in the “Actions” panel).

pit007 · January 28, 2021, 10:00am

Hi zwhichcox,

thanks for your reply.
The OS-Version is balenaOS 2.51.1+rev1, supervisor version 11.4.10.
I’m not sure, what you need to identify the device, anyways, the BALENA_DEVICE_UUID is 4ddd44df5ce0df554d22c21a884fa9d8.

Thanky you very much for your help.
Peter

mbalamat · January 28, 2021, 2:14pm

Hello, can you power on the device so I can take a look?

pit007 · January 28, 2021, 3:23pm

HI,

it has been running already.

Cheers
Peter

klutchell · January 28, 2021, 4:30pm

Hey there @pit007, it looks like the device has exceeded the max inode count on the data partition, which is causing issues with creating containers, starting the supervisor, and a whole host of other issues.

Could you try running this command to get an idea of where most of the inodes are used?

du --inodes -xSs /mnt/data/*/*

Once we see where the most inodes are taken up, we can provide more detail by adding to that path and running the same command, for example:

# for example, if /mnt/data/docker/volumes had the highest number above
du --inodes -xSs /mnt/data/docker/volumes/*

With that narrowed down, we may need to start deleting files for the OS to start behaving correctly. If those files were generated by the hostOS and not any of your containers, we’ll have to open an issue to fix it in the OS.

For now, we’ve opened an issue to start reporting free inodes when running healthchecks:

pit007 · January 28, 2021, 7:02pm

Hi Kyle,

thanks for your reply.

I issued the command:

du --inodes -xSs /mnt/data//

1 /mnt/data/docker/aufs
2 /mnt/data/docker/builder
4 /mnt/data/docker/buildkit
1 /mnt/data/docker/containerd
1 /mnt/data/docker/containers
1 /mnt/data/docker/image
1 /mnt/data/docker/network
1 /mnt/data/docker/plugins
1 /mnt/data/docker/runtimes
1 /mnt/data/docker/swarm
1 /mnt/data/docker/tmp
1 /mnt/data/docker/trust
2 /mnt/data/docker/volumes
2 /mnt/data/resin-data/resin-supervisor

That’s all there is.

I’ll keep the device powered through the night.
If you would like to look at it again, feel free to do so.
I would like to reflash the device tomorrow (Friday).

Thanks for your help.
Peter

klutchell · January 28, 2021, 7:30pm

Hey Peter, I’m looking at your device now. The command I provided you earlier wasn’t accurate, I fixed it up a bit and found that there are hundreds of containers taking up 10k inodes each on this device.

du --inodes -xSs /mnt/data/docker/containers/*

So the balena image prune command you ran before wouldn’t help much as you said. I would recommend trying balena system prune instead to clean up all the old containers before it gets to this state.

That being said, containers shouldn’t be sticking around after each balena push attempt, so there’s something else we may be missing here. I’ll ask internally for suggestions.

pit007 · January 28, 2021, 8:22pm

Hi Kyle,

yes, I had tried that before, but without success:
Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

As I said, I’ll leave the machine running, in case you would like to continue investigating.

Cheers
Peter

20k-ultra · January 28, 2021, 10:27pm

Hey, we see that there are 187 containers in /mnt/data/docker/containers and some of these containers are using upwards of 9000 inodes. We knew it was coming from that folder because of the following command:

root@4ddd44d:/var/lib/docker# df -i
Filesystem                      Inodes   IUsed  IFree IUse% Mounted on
devtmpfs                        110016     440 109576    1% /dev
tmpfs                           111936      14 111922    1% /tmp
/dev/mmcblk0p2                   39936    6056  33880   16% /mnt/sysroot/active
/dev/disk/by-state/resin-state    5136      70   5066    2% /mnt/state
none                             39936    6056  33880   16% /
tmpfs                           111936       1 111935    1% /dev/shm
tmpfs                           111936     683 111253    1% /run
tmpfs                           111936      14 111922    1% /sys/fs/cgroup
/dev/mmcblk0p1                       0       0      0     - /mnt/boot
tmpfs                           111936      44 111892    1% /var/volatile
/dev/mmcblk0p6                 1819648 1819648      0  100% /mnt/data

Note how /mnt/data has used 100% of the Inodes available.

Now we are trying to figure out how so many containers have been left behind and why they have so many inodes. Does the application you’re developing in localmode open files in some kind of loop ? Is your application using Python also ? I’m asking because the containers all have some python env variable set and this will confirm if it’s your container or something else.

20k-ultra · January 28, 2021, 10:32pm

How often have you seen this btw ? Are you developing in the same way on multiple devices and this is the only device that’s experiencing this issue ? How long have you been developing with local mode ? I wonder if this is something that has built up over time. We could confirm by looking at all the containers in /mnt/data/docker/containers which have a creation timestamp if needed. What version of the CLI are you using and what is your operating system ?

pit007 · January 29, 2021, 7:34am

Hi Miguel,

yes our application is written in Python. However we are not using any files.
We talk to some special hardware.

Cheers
Peter

pit007 · January 29, 2021, 7:40am

Hi Miguel,

personally I have been experiencing problems for a number of weeks:
The system became slower and slower to boot. There was a huge number of
healtcheck processes. Sometimes I used “balena image purge” which did help
slightly. Unfortunately I’m not very familiar with docker, so I never tried
“balena system purge” until recently.
Some of my collegues experienced similar problems and had to reflash their
devices. Apparently none of the really investigated the problem.
The balena-cli I’m using is version 12.22.2.

I will now reflash the device in order to continue to keep developing.

Thanks for investigating and your help.

Regards
Peter

zvin · January 29, 2021, 4:07pm

Hello @pit007

I’ve looked a bit at your device.
It is running out of inodes on the data partition.
Most inodes are used by folders in /mnt/data/docker/containers/
You have 190 folders here, you should have 2: one per container.

I don’t know why they weren’t removed.
Most of these leftover folders have a config.v2.json file containing something like:

{
  "StreamConfig": {},
  "State": {
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false,
    "RemovalInProgress": false,
    "Dead": false,
    "Pid": 0,
    "ExitCode": 255,
    "Error": "",
    "StartedAt": "2021-01-24T16:41:09.572144356Z",
    "FinishedAt": "2021-01-25T15:21:11.183074408Z",
    "Health": null
  },
  "ID": "2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4",
  "Created": "2021-01-24T16:41:06.993921427Z",
  "Managed": false,
  "Path": "/bin/sh",
  "Args": [
    "-c",
    "while true; do sleep 3600; done"
  ],
  "Config": {
    "Hostname": "2206a5266182",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "LC_ALL=C.UTF-8",
      "DEBIAN_FRONTEND=noninteractive",
      "UDEV=off",
      "LANG=C.UTF-8",
      "PYTHON_VERSION=3.7.4",
      "PYTHON_PIP_VERSION=19.2.1",
      "SETUPTOOLS_VERSION=41.0.1",
      "PYTHON_DBUS_VERSION=1.2.8",
      "PYTHONPATH=/usr/lib/python3/dist-packages::/usr/src/",
      "LD_LIBRARY_PATH=/usr/local/lib"
    ],
    "Cmd": null,
    "Image": "be4850ca41a6",
    "Volumes": null,
    "WorkingDir": "/usr/src",
    "Entrypoint": [
      "/bin/sh",
      "-c",
      "while true; do sleep 3600; done"
    ],
    "OnBuild": null,
    "Labels": {
      "io.balena.architecture": "armv7hf",
      "io.balena.qemu.version": "4.0.0+balena-arm"
    }
  },
  "Image": "sha256:be4850ca41a6280ef489d85bfe45b5f6a0f1ceacc7f862cbb991a9a4a5d9547a",
  "NetworkSettings": {
    "Bridge": "",
    "SandboxID": "7d296765ad28216769c7c06f453d45e3cda20ca0613850558dcf8ac7cadac9c9",
    "HairpinMode": false,
    "LinkLocalIPv6Address": "",
    "LinkLocalIPv6PrefixLen": 0,
    "Networks": {
      "bridge": {
        "IPAMConfig": null,
        "Links": null,
        "Aliases": null,
        "NetworkID": "9b003273792ecd6a841fa27af8e7fb7413dbb93ab924ac7406a1c72bee8acd12",
        "EndpointID": "867ecae1f82fe41be9488cccab4c0f2ae0c7e29b7a3feed90b5c2e0d2dd92334",
        "Gateway": "10.114.101.1",
        "IPAddress": "10.114.101.3",
        "IPPrefixLen": 24,
        "IPv6Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "MacAddress": "02:42:0a:72:65:03",
        "DriverOpts": null,
        "IPAMOperational": false
      }
    },
    "Service": null,
    "Ports": {},
    "SandboxKey": "/var/run/balena-engine/netns/7d296765ad28",
    "SecondaryIPAddresses": null,
    "SecondaryIPv6Addresses": null,
    "IsAnonymousEndpoint": true,
    "HasSwarmEndpoint": false
  },
  "LogPath": "",
  "Name": "/trusting_mendeleev9",
  "Driver": "aufs",
  "OS": "linux",
  "MountLabel": "",
  "ProcessLabel": "",
  "RestartCount": 0,
  "HasBeenStartedBefore": true,
  "HasBeenManuallyStopped": false,
  "MountPoints": {},
  "SecretReferences": null,
  "ConfigReferences": null,
  "AppArmorProfile": "",
  "HostnamePath": "/var/lib/docker/containers/2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4/hostname",
  "HostsPath": "/var/lib/docker/containers/2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4/hosts",
  "ShmPath": "",
  "ResolvConfPath": "/var/lib/docker/containers/2206a5266182734e425992f558e5fafa6369f4deb46855f879d627b0ee8037c4/resolv.conf",
  "SeccompProfile": "",
  "NoNewPrivileges": false
}

These look like containers that ran a balenalib python base image according to the env vars and labels: base-images/Dockerfile at master · balena-io-library/base-images · GitHub .

See the Entrypoint: they were running an empty while loop.

I guess we can simply remove all folders here except ab36529295bb08d2939fdbd45d719a66d29919144f9bf0c6733e167e61424dea 'your app) and 78f457c98e99fe43323cb8157657041ac726711d4fe774c02aa330e8e7161699 (the supervisor) to free up inodes.

I wonder what created these containers (and folders). Do you have any clue ?
You can run ls /mnt/data/docker/containers/ | wc -l to ocunt the number of folders.
Can you check what in your workflow is creating folders here ?

zvin · January 29, 2021, 4:22pm

@pit007 Do you (or did you) use livepush ?

pit007 · January 29, 2021, 4:45pm

Hi Alexis,

no, I have no idea why these files/folders were not removed.
I don’t know enough about how docker behaves or is supposed to behave.
From now on, I will issue a “balena image prune” and “balena system prune”
regularly.

Thanks for your investigation and help.
Peter

pit007 · January 29, 2021, 4:47pm

Hi Alexis,

well, probably yes: I usually forget add the --nolive option.
I usually run "balena push " for every dev cycle.

Cheers
Peter

zvin · January 29, 2021, 5:01pm

@pit007
Ok, then it looks like these containers were created by livepush according to their Entrypoint: livepush/container.ts at c182d1191d66e064ce0c1107320573b5edd5f257 · balena-io-modules/livepush · GitHub

The question is why weren’t they removed by balenad.

You can delete all unused containers folders with the following:

df -ito check the available inodes (look at /mnt/data);
balena ps -aq → gives you the ids of the existing containers (currently ab36529295bb and 78f457c98e99);
ls /mnt/data/docker/containers | grep -v ab36529295bb | grep -v 78f457c98e99 | xargs rm -rf (one grep -v per existing container) to delete all folders except the 2 from above;
df -ito check the available inodes again;

Topic		Replies	Views
How to get rid of Online (VPN only) mode in a beaglebone-black device Product support	14	839	January 15, 2021
Why Balena push created new container for every changes? balenaOS	31	2306	May 31, 2019
Balena Engine won't start after Host OS update Product support raspberrypi4	12	1036	May 15, 2023
Developing on a Balena device Product support	13	1883	August 16, 2019
Balena OS supervisor takes too much time to start balenaOS	8	1226	June 20, 2019

balena-engine not starting

du --inodes -xSs /mnt/data//

Related topics