Failed to stop/kill containers

Similar to
https://forums.balena.io/t/what-does-status-unhealthy-mean-and-why-wont-this-container-pause/5279
https://forums.balena.io/t/balena-engine-containerd-connection-issues-physical-power-cycle-required/5438

Following error occurs when trying to reboot/restart/update/stop/kill containers on Jetson Nano device;

connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown

Device seems to be in complete lock due to this error.

remote support is open and device is reachable at:
UUID: 841a3e74d517c409a34176245d0d10e9

Hi,
It looks like containerd on your device stopped working. In the newer version of balenaOS we enhanced our health checks to automatically detect this kind of failure (see https://github.com/balena-os/meta-balena/issues/1391).

I will try restarting balena engine service on your device if you don’t mind.

If you will, please. I don’t have physical access to this device.

balena engine service has been restarted, and the device seems to function normally - it applied the target release successfully from what I can see.

I would suggest upgrading to the latest available version of balenaOS. As I mentioned health checks there are improved, and such failures should be handled automatically. Though, please note that this device runs a development build of the OS, so you will need to upgrade it manually, reflashing the device - we currently don’t support over-the-air updates for development builds.
If it’s a production device, we strongly recommend switching to production build.

Also, please note that you have container_name on one of the services in your docker-compose file, and it’s not supported by the supervisor running on the device and is simply ignored. It’s ok to have it there - just want to make sure you understand it has no effect.

Please let us know if the device looks good now.

Hi,
I have a similar problem.

I am running balena OS [balenaOS 2.47.0+rev1] which is the latest available for an rpi 3B+ and rosetta@home. I have 3 devices all the same and all the same os. One of them keeps failing with the messages:

Killing service ‘ui sha256:677ac37d9eeb9b74025d272bc8c756e85a0fd543642ff7cb4b86062b2e1589cc’
23.09.20 20:52:35 (+0200) Failed to kill service ‘ui sha256:677ac37d9eeb9b74025d272bc8c756e85a0fd543642ff7cb4b86062b2e1589cc’ due to '(HTTP code 409) unexpected - You cannot remove a running container 93d21d5e6187458187171b803962ac9b4bdc7bab2d079afb9c1f2cecf5b6c2a8. Stop the container before attempting removal or force remove ’
23.09.20 20:52:36 (+0200) Killing service ‘ui sha256:677ac37d9eeb9b74025d272bc8c756e85a0fd543642ff7cb4b86062b2e1589cc’

Any help would be great!

Thanks!
bob

We have the same issue as you on many hundreds of devices.

Has there been any resolution for this?

Hi @gratefulfrog and @shawaj, unfortunately we have observed this type of error when the container is refusing to exit. This may happen for different reasons: the container has a hold on a kernel resource, it is using a hardware device that may be malfunctioning. Sometimes the output of dmesg can show some information on possible causes of the malfunction.

Trying to kill the container directly using balena kill <container> will generally fail too, as it is the container that is in a state that it cannot be killed. There have been reports than doing balena rm --force <container> can work, or restarting the engine using systemctl restart balena can help too, but it may depend on the root cause.

If you have a device with a supervisor that is showing this message right now, please enable support access and let us know so we can take a look, try to pinpoint the underlying cause and run some tests.

Thank you.