Persistent "Failed to download image due to 'connect ECONNREFUSED /var/run/balena-engine.sock" error

Thanks - but I tried that and it errors purging - see above :frowning:

Still getting this happening. No real idea why it happens or how to fix it. Tried deleting /mnt/data but to no avail

ECONNREFUSED /var/run/balena-engine.sock

So I just downloaded a clean app image and wrote it to flash with Etcher. It is failing to download the app container with this messaging

I’ve been looking at this and I think I have some ideas. I suspect one kind of failure may be OOM on the Pi0. Another kind seems to be watchdog related.

I’ve changed the service watchdog to 60 minutes instead of 6 minutes and that seems to fix it

It looks like the balena daemon is spawning the untar process which is quite slow on mmc for large images. I suspec that watchdog responses are being blocked during untar. Maybe this needs to be looked at? Perhaps putting the responses in a thread whilst the untar operates.

Thank you @ajlennon for your research! We are going to check internally to understand if this is a potential solution!

Let’s stay connected

1 Like

Any time bug bounty beers cough cough :wink:

2 Likes

Hello, engine maintainer here.

We’ve been looking into ways to optimize this. I know this is an issue on lower spec devices, most notably the pi zero, where it can actually cause the situation you experienced here.
As you observed there’s some unfortunate behaviour where the engine’s healthcheck itself actually sets off the watchdog on an otherwise functioning device. My idea actually goes towards setting up a better
healthcheck that is more delicate than our current one, which tries to run the hello-world container.

We ran some tests in the past which adjusted the priorities of the untar operations, which actually already run as a completely separate process. If I remember correctly that didn’t entirely solve the issue.

2 Likes

we have an issue for this if you are interested to follow that: Healthdog checks time out during image pulls · Issue #196 · balena-os/balena-engine · GitHub

as I said I’m going to look into coming up with a better healthcheck, since it would most likely improve this situation as well as other issues we’ve had around that

could you share the OS version you are on? just for reference?

1 Like

Sorry only just seen this !!! I was working on balenaOS 2.54.2+rev1

@robertgzr Any progress on this? I have the issue with a RasPi Zero W and balenaOS 2.83.21+rev1, Supervisor 12.10.3

Someone mentioned that their /mnt/data partition was full. Would using a bigger SD card fix the issue? Or is this only related to the slow processor of the Pi Zero?

Are there any hacky fixes that work?

1 Like