Container stuck in status downloaded/installed

We have a multi-container setup that is running fine on a lot of devices. On one device one container in the status “Installed” and the containers which depend on that one in “Downloaded” (2 other containers are running fine). I’m not exactly sure when this happened but the device is out in the field connected via LTE. The device is running on version balenaOS 2.45.1+rev4 of the UpBoard Image. A restart of the device does not solve the issue.

When checking the logs, I get this output:

Sep 15 07:34:10 af1a482 12d07c71a571[782]: [info]    Applying target state
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [warn]    Ignoring unsupported or unknown compose fields: links
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [warn]    Ignoring unsupported or unknown compose fields: containerName
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [warn]    Ignoring unsupported or unknown compose fields: containerName
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [warn]    Ignoring unsupported or unknown compose fields: links
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [warn]    Ignoring unsupported or unknown compose fields: containerName, links
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [warn]    Ignoring unsupported or unknown compose fields: containerName
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [event]   Event: Service start {"service":{"appId":1762257,"serviceId":777636,"serviceName":"zookeeper","releaseId":1593546}}
Sep 15 07:34:10 af1a482 balenad[782]: time="2021-09-15T07:34:10.661781481Z" level=warning msg="Failed to allocate and map port 32181-32181: Bind for 0.0.0.0:32181 failed: port is already allocated"
Sep 15 07:34:10 af1a482 balenad[782]: time="2021-09-15T07:34:10.728397992Z" level=error msg="6fcd28581c4039af460a3fc831ae13b5f02e457469f5167cdc9b6c8b61df3336 cleanup: failed to delete container from containerd: no such container"
Sep 15 07:34:10 af1a482 balenad[782]: time="2021-09-15T07:34:10.728500331Z" level=error msg="Handler for POST /containers/6fcd28581c4039af460a3fc831ae13b5f02e457469f5167cdc9b6c8b61df3336/start returned error: driver failed programming external connectivity on endpoint zookeeper_2931171_1593546 (a10cb57b0d7db21c5b1b2a9d2cd22c885458ae5c5ecea3c46cc6a8faf8300b79): Bind for 0.0.0.0:32181 failed: port is already allocated"
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]   Scheduling another update attempt in 900000ms due to failure:  Error: (HTTP code 500) server error - driver failed programming external connectivity on endpoint zookeeper_2931171_1593546 (a10cb57b0d7db21c5b1b2a9d2cd22c885458ae5c5ecea3c46cc6a8faf8300b79): Bind for 0.0.0.0:32181 failed: port is already allocated 
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]         at /usr/src/app/dist/app.js:596:108431
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at /usr/src/app/dist/app.js:596:108394
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at m.buildPayload (/usr/src/app/dist/app.js:596:108404)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:596:107904)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at IncomingMessage.emit (events.js:194:15)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at endReadableNT (_stream_readable.js:1125:12)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at process._tickCallback (internal/process/next_tick.js:63:19)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]   Device state apply error Error: (HTTP code 500) server error - driver failed programming external connectivity on endpoint zookeeper_2931171_1593546 (a10cb57b0d7db21c5b1b2a9d2cd22c885458ae5c5ecea3c46cc6a8faf8300b79): Bind for 0.0.0.0:32181 failed: port is already allocated 
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]         at /usr/src/app/dist/app.js:596:108431
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at /usr/src/app/dist/app.js:596:108394
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at m.buildPayload (/usr/src/app/dist/app.js:596:108404)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:596:107904)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at IncomingMessage.emit (events.js:194:15)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at endReadableNT (_stream_readable.js:1125:12)
Sep 15 07:34:10 af1a482 12d07c71a571[782]: [error]       at process._tickCallback (internal/process/next_tick.js:63:19)
Sep 15 07:34:57 af1a482 balenad[782]: time="2021-09-15T07:34:57.626176324Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/303b4949d34d377e38ff949db9b26ca3fa2f62a9b69221a28f641b8742fa22c9/shim.sock debug=false pid=27925
Sep 15 07:34:58 af1a482 balenad[782]: time="2021-09-15T07:34:58.687460213Z" level=info msg="shim reaped" id=303b4949d34d377e38ff949db9b26ca3fa2f62a9b69221a28f641b8742fa22c9
Sep 15 07:34:58 af1a482 balenad[782]: time="2021-09-15T07:34:58.695717331Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 15 07:35:09 af1a482 deeb61aacf0e[782]: waiting
Sep 15 07:37:09 af1a482 deeb61aacf0e[782]: waiting
Sep 15 07:37:59 af1a482 balenad[782]: time="2021-09-15T07:37:59.475858027Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/ae60a16271ee5bfb59d3ed5e7d3cee1cf322c4cb15fda772bb254393f60645e5/shim.sock debug=false pid=28072
Sep 15 07:38:00 af1a482 balenad[782]: time="2021-09-15T07:38:00.534786559Z" level=info msg="shim reaped" id=ae60a16271ee5bfb59d3ed5e7d3cee1cf322c4cb15fda772bb254393f60645e5
Sep 15 07:38:00 af1a482 balenad[782]: time="2021-09-15T07:38:00.543598115Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 15 07:38:08 af1a482 12d07c71a571[782]: [api]     GET /v1/healthy 200 - 2.649 ms
Sep 15 07:39:09 af1a482 deeb61aacf0e[782]: waiting
Sep 15 07:41:01 af1a482 balenad[782]: time="2021-09-15T07:41:01.318956737Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/534a250b8c9f0361f3e72e147a77c26633897ae4b95e6aed6ccbf5dfad023e53/shim.sock debug=false pid=28277
Sep 15 07:41:02 af1a482 balenad[782]: time="2021-09-15T07:41:02.387840307Z" level=info msg="shim reaped" id=534a250b8c9f0361f3e72e147a77c26633897ae4b95e6aed6ccbf5dfad023e53
Sep 15 07:41:02 af1a482 balenad[782]: time="2021-09-15T07:41:02.396218195Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 15 07:41:09 af1a482 deeb61aacf0e[782]: waiting
Sep 15 07:41:36 af1a482 12d07c71a571[782]: [debug]   Attempting container log timestamp flush...
Sep 15 07:41:36 af1a482 12d07c71a571[782]: [debug]   Container log timestamp flush complete

In another thread I found the advice to run netstat -lauten which returns

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 127.0.0.2:53            0.0.0.0:*               LISTEN      
tcp        0      0 10.114.102.1:53         0.0.0.0:*               LISTEN      
tcp        0    493 <LTE_IP>:58368    3.227.28.93:443         ESTABLISHED 
tcp        0      0 127.0.0.1:59448         127.0.0.1:48484         TIME_WAIT   
tcp        0      0 :::80                   :::*                    LISTEN      
tcp        0      0 :::8081                 :::*                    LISTEN      
tcp        0      0 :::8082                 :::*                    LISTEN      
tcp        0      0 :::8181                 :::*                    LISTEN      
tcp        0      0 :::32181                :::*                    LISTEN      
tcp        0      0 :::502                  :::*                    LISTEN      
tcp        0      0 :::1880                 :::*                    LISTEN      
tcp        0      0 :::8442                 :::*                    LISTEN      
tcp        0      0 :::8123                 :::*                    LISTEN      
tcp        0      0 :::8000                 :::*                    LISTEN      
tcp        0      0 :::48484                :::*                    LISTEN      
tcp        0      0 :::62662                :::*                    LISTEN      
tcp        0      0 :::22222                :::*                    LISTEN      
tcp        0   2080 ::ffff:10.240.2.193:22222 ::ffff:52.4.252.97:38174 ESTABLISHED 
udp        0      0 127.0.0.1:323           0.0.0.0:*                           
udp        0      0 0.0.0.0:58494           0.0.0.0:*                           
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           
udp        0      0 0.0.0.0:51558           0.0.0.0:*                           
udp        0      0 127.0.0.2:53            0.0.0.0:*                           
udp        0      0 10.114.102.1:53         0.0.0.0:*                           
udp        0      0 ::1:323                 :::*                                
udp        0      0 :::57683                :::*                                
udp        0      0 :::5353                 :::*                                
udp        0      0 :::5353                 :::*                                
udp        0      0 :::60221                :::*     

Further more I check the disc and when comparing with other devices I’d assume /mnt/state shouldn’t be full?!

Filesystem                      Size  Used Avail Use% Mounted on
devtmpfs                        3.9G     0  3.9G   0% /dev
/dev/disk/by-label/resin-rootA  300M  294M     0 100% /mnt/sysroot/active
/dev/disk/by-label/resin-state   19M   18M     0 100% /mnt/state
none                            300M  294M     0 100% /
tmpfs                           3.9G  360K  3.9G   1% /dev/shm
tmpfs                           3.9G   18M  3.9G   1% /run
tmpfs                           3.9G     0  3.9G   0% /sys/fs/cgroup
tmpfs                           3.9G     0  3.9G   0% /tmp
/dev/mmcblk0p1                   40M  650K   40M   2% /mnt/boot
tmpfs                           3.9G  156K  3.9G   1% /var/volatile
/dev/mmcblk0p6                   57G  5.5G   49G  11% /mnt/data

Basically it looks to me like something is already occupying the port 32181 - I don’t know what and why this happens in this case though.

Apparently NetworkManager spammed the /mnt/state partition with temporary timestamp files (seems to be an old issue. I fixed that as well as some temporary aufs files filling up the root partition. I then rebooted the device. The issue however is still persisting.
Maybe something got stuck when the partitions were full. Is there any further way to reset things than a reboot?

Hey @hesch, there’s a file in your /mnt/state partition called remove_me_to_reset that when removed and rebooted will clear the state partition. We are thinking of implementing something similar for the data partition as well, but mostly to test recovery scenarios and not to be used in production.

A workaround one of my colleagues posted may work for you: Port already in use, because proxy keeps binding to the wrong container IP · Issue #272 · balena-os/balena-engine · GitHub

ip link delete supervisor0
balena rm --force $(balena ps -aq)
balena network prune -f
systemctl restart balena
1 Like

Hello @hesch

Please let us know if the suggested workaround successful.

Thanks - Mark

Hey and sorry for the late reply - I already swapped the device in the field before @klutchell 's reply and now finally tested the workaround at home. After that the containers are starting again and the device looks healthy. Thanks a lot, I’ll keep that in mind if it ever happens again.

ps: the remove_me_to_reset is cool for the state partition, would be great to have it for data too since more recovery options are always good.

Hi Henning,

We are happy to hear that the workaround worked for you. If you need any further help please do not hesitate to get in touch with us.

Kind regards,
Genadi