Balena-engine-containerd connection issues: Physical power cycle required?

When I start seeing error messages like those below, is the only remedy a physical power cycle? https://dashboard.balena-cloud.com/devices/5f8c6096c1c3f73c129cfd06bdf81263

Terminal:
connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”:unknown
SSH session disconnected
SSH reconnecting…
Spawning shell…
connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”:unknown

Logs:
25.02.19 06:01:24 (-0800) Killing service ‘barnserv sha256:8e4ae165ae5ddf059ad8291f849c591e2713f1ce6bfd1a140f70310f9ccc605b’
25.02.19 06:01:25 (-0800) Failed to kill service ‘camera sha256:6ebb4910dc1e6d6647f0230be9b29ea6888abeeb1abeade0be8a26e8875f6664’ due to '(HTTP code 500) server error - cannot stop container: e57d6ade032f2c03aefc4527bd4e18b481b3c994624fb872f3b99d98f9f890be: Cannot kill container e57d6ade032f2c03aefc4527bd4e18b481b3c994624fb872f3b99d98f9f890be: connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown ’
25.02.19 06:01:26 (-0800) Failed to kill service ‘barnserv sha256:8e4ae165ae5ddf059ad8291f849c591e2713f1ce6bfd1a140f70310f9ccc605b’ due to '(HTTP code 500) server error - cannot stop container: 89e657e9eacaa5e71ac68a9a33de916e460e6626eaec565fc46a36bcc699b9a8: Cannot kill container 89e657e9eacaa5e71ac68a9a33de916e460e6626eaec565fc46a36bcc699b9a8: connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown ’

Again, after an update has been downloaded and is ready to install. Device:
https://dashboard.balena-cloud.com/devices/5f8c6096c1c3f73c129cfd06bdf81263/summary

28.02.19 09:47:29 (-0800) Killing service ‘barnserv sha256:012b222206904d18b3d09e548794c240cf363a9570c23f443de0acbcb3b909c7’
28.02.19 09:47:31 (-0800) Failed to kill service ‘barnserv sha256:012b222206904d18b3d09e548794c240cf363a9570c23f443de0acbcb3b909c7’ due to '(HTTP code 500) server error - cannot stop container: 3c196aa7441257665950438ed7045585a86144a14231d69f30aa08cc20a9ee34: Cannot kill container 3c196aa7441257665950438ed7045585a86144a14231d69f30aa08cc20a9ee34: connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown ’

I tracked the root issue down to this bug https://github.com/moby/moby/issues/36002 which I was able to fix via a backport of the upstream fix.

But as I describe here this leads us to a more severe issue, that’s less easy to work around and has no immediate solution now.

I am actually curious why this happens to you on a freshly deployed device since I suspected it to happen when containerd is oom killed.
Can you try to do: pkill -HUP balena-engine-daemon and see if that solves it…

The Jetson TX2 device has repeatedly done this lock up. If there is more information that I can provide please let me know. I can still flash it with a newer Balena OS.

One of the possible sources of this problem is the need to start one of the containers. However it appears that the other container is the one that gets into this state. The only rememdy appears to be a physical reset or reboot.

I don’t quite understand the last part.

One of the possible sources of this problem is the need to start one of the containers. However it appears that the other container is the one that gets into this state

You mean you are running a single container already and want to start a second one, which leads to the first one getting into this state?

Did you try sending SIGHUP to the balenaEngine daemon? Just so we can verify it’s not related to PR#149

I don’t quite understand the last part.

One of the possible sources of this problem is the need to start one of the containers. However it appears that the other container is the one that gets into this state

You mean you are running a single container already and want to start a second one, which leads to the first one getting into this state?
Did you try sending SIGHUP to the balenaEngine daemon? Just so we can verify it’s not related to PR#149…

Sorry, we have a multiple container device. When the device gets into this state I can’t connect to it using the balena dashboard. Haven’t tried balena cli.

My point about “starting a container” should be “restarting a container”

hmm if it was the issue I suspected you should still be able to login to the host OS via the dashboard console.

Is the device still experiencing this? If so could you refresh support access for me, I would like to take another look…

The device is off now. When I get one into this state again, and can keep it on, I will let you know!

Thanks :+1: