Balena-engine-containerd connection issues: Physical power cycle required?

jason10 · February 25, 2019, 2:03pm

When I start seeing error messages like those below, is the only remedy a physical power cycle? https://dashboard.balena-cloud.com/devices/5f8c6096c1c3f73c129cfd06bdf81263

Terminal:
connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”:unknown
SSH session disconnected
SSH reconnecting…
Spawning shell…
connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”:unknown

Logs:
25.02.19 06:01:24 (-0800) Killing service ‘barnserv sha256:8e4ae165ae5ddf059ad8291f849c591e2713f1ce6bfd1a140f70310f9ccc605b’
25.02.19 06:01:25 (-0800) Failed to kill service ‘camera sha256:6ebb4910dc1e6d6647f0230be9b29ea6888abeeb1abeade0be8a26e8875f6664’ due to '(HTTP code 500) server error - cannot stop container: e57d6ade032f2c03aefc4527bd4e18b481b3c994624fb872f3b99d98f9f890be: Cannot kill container e57d6ade032f2c03aefc4527bd4e18b481b3c994624fb872f3b99d98f9f890be: connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown ’
25.02.19 06:01:26 (-0800) Failed to kill service ‘barnserv sha256:8e4ae165ae5ddf059ad8291f849c591e2713f1ce6bfd1a140f70310f9ccc605b’ due to '(HTTP code 500) server error - cannot stop container: 89e657e9eacaa5e71ac68a9a33de916e460e6626eaec565fc46a36bcc699b9a8: Cannot kill container 89e657e9eacaa5e71ac68a9a33de916e460e6626eaec565fc46a36bcc699b9a8: connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown ’

jason10 · February 28, 2019, 5:50pm

Again, after an update has been downloaded and is ready to install. Device:
https://dashboard.balena-cloud.com/devices/5f8c6096c1c3f73c129cfd06bdf81263/summary

28.02.19 09:47:29 (-0800) Killing service ‘barnserv sha256:012b222206904d18b3d09e548794c240cf363a9570c23f443de0acbcb3b909c7’
28.02.19 09:47:31 (-0800) Failed to kill service ‘barnserv sha256:012b222206904d18b3d09e548794c240cf363a9570c23f443de0acbcb3b909c7’ due to '(HTTP code 500) server error - cannot stop container: 3c196aa7441257665950438ed7045585a86144a14231d69f30aa08cc20a9ee34: Cannot kill container 3c196aa7441257665950438ed7045585a86144a14231d69f30aa08cc20a9ee34: connection error: desc = “transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused”: unknown ’

robertgzr · March 1, 2019, 1:03pm

I tracked the root issue down to this bug https://github.com/moby/moby/issues/36002 which I was able to fix via a backport of the upstream fix.

But as I describe here this leads us to a more severe issue, that’s less easy to work around and has no immediate solution now.

I am actually curious why this happens to you on a freshly deployed device since I suspected it to happen when containerd is oom killed.
Can you try to do: pkill -HUP balena-engine-daemon and see if that solves it…

jason10 · March 1, 2019, 4:05pm

The Jetson TX2 device has repeatedly done this lock up. If there is more information that I can provide please let me know. I can still flash it with a newer Balena OS.

One of the possible sources of this problem is the need to start one of the containers. However it appears that the other container is the one that gets into this state. The only rememdy appears to be a physical reset or reboot.

robertgzr · March 6, 2019, 4:58pm

I don’t quite understand the last part.

One of the possible sources of this problem is the need to start one of the containers. However it appears that the other container is the one that gets into this state

You mean you are running a single container already and want to start a second one, which leads to the first one getting into this state?

Did you try sending SIGHUP to the balenaEngine daemon? Just so we can verify it’s not related to PR#149…

robertgzr · March 6, 2019, 4:58pm

I don’t quite understand the last part.

One of the possible sources of this problem is the need to start one of the containers. However it appears that the other container is the one that gets into this state

You mean you are running a single container already and want to start a second one, which leads to the first one getting into this state?
Did you try sending SIGHUP to the balenaEngine daemon? Just so we can verify it’s not related to PR#149…

jason10 · March 6, 2019, 5:05pm

Sorry, we have a multiple container device. When the device gets into this state I can’t connect to it using the balena dashboard. Haven’t tried balena cli.

My point about “starting a container” should be “restarting a container”

robertgzr · March 11, 2019, 12:07pm

hmm if it was the issue I suspected you should still be able to login to the host OS via the dashboard console.

Is the device still experiencing this? If so could you refresh support access for me, I would like to take another look…

jason10 · March 11, 2019, 2:16pm

The device is off now. When I get one into this state again, and can keep it on, I will let you know!

robertgzr · March 11, 2019, 2:36pm

Thanks

Topic		Replies	Views
Failed to stop/kill containers balenaEngine	6	2025	December 21, 2021
Container ssh keeps disconnecting and reconnecting every 6 minutes Product support	4	288	August 9, 2023
Balena engine not starting due to socket busy error Product support raspberrypi4	5	1161	March 18, 2021
Warning unknown container in balena-engine balenaEngine	1	291	April 14, 2022
Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running? Product support raspberrypi4 , balenacloud	2	555	June 10, 2023

Balena-engine-containerd connection issues: Physical power cycle required?

Related topics