Failed to download image due to 'connect ECONNREFUSED /var/run/balena-engine.sock'

I am unable to download images to my jetson nano device due to persistent errors of the form:
Failed to download image (id removed for posting purposes) due to ‘connect ECONNREFUSED /var/run/balena-engine.sock’
It seems to always occur after about 6 minutes of downloading, and has held us up for a couple of days now as we have unsuccessfully tried to troubleshoot this.
I saw another post on the forum that had a similar issue in the past due to a bad SD card, but I’ve tried two different SD cards with no luck. Any ideas what might be causing this? I’ve granted 12 hours of support access to UUID cff6fe10e53dec20e6a7518a9b9a4b23 to help troubleshoot - thanks in advance.

Hi @drcnyc – thanks for your post. Can I ask a couple things:

  • Can you run Diagnostics on your device and see if there are any problems reported?
  • Can you let me know the size of your image, and what kind of network connection you have?

All the best,
Hugh

Thanks for the quick response. Below is the diagnostic output (apologies for the poor formatting) - it appears that a few of them failed, and interestingly one of them mentions SD latency. Both SD cards we are testing with are class 4, I’m wondering if this could be the issue? Perhaps with large images the write speed can not keep up with the download speed and buffers run out?

Our app has eight containers in total, seven of them average about 300mb each while the eigth is abourt 5gb as it has the CUDA and cuDNN modules required to run tensorflow apps using the GPU. The network connection where the device is currently located is 300mb down / 50mb up.

Diagnostic output:

[check_balenaOS] [ Succeeded] [Supported balenaOS 2.x detected] 

[check_container_engine] [ Failed] [Container engine balena is up, but has 4 unclean restarts and may be crashlooping (most recent start time: Mon 2020-07-06 19:28:17 UTC)] 

[check_localdisk] [ Failed] [Some localdisk issues detected: test_write_latency Slow disk writes detected: mmcblk0: 4206.26ms / write, sample size 57503 mmcblk0p13: 1271.89ms / write, sample size 616 mmcblk0p16: 4274.48ms / write, sample size 56340] 

[check_memory] [ Succeeded] [70% memory available] 

[check_networking] [ Succeeded] [No networking issues detected] 

[check_os_rollback] [ Succeeded] [No OS rollbacks detected] 

[check_service_restarts] [ Failed] [Inspecting service(s) (xorg-chromium_2467372_1448418) has timed out, check data incomplete] 

[check_supervisor] [ Succeeded] [Supervisor is running & healthy] 

[check_temperature] [ Succeeded] [No temperature issues detected] 

[check_timesync] [ Succeeded] [Time is synchronized]

Hi @drcnyc – thanks very kindly for the additional info; this definitely helps. Looking at the device, I assume it is the jetson-nano image (the one which is currently downloading over and over again) that is 5GB? Assuming for the moment that’s true, can you try disabling Delta Updates under “Device Configuration”? I’m curious to see if that makes a difference.

As for the SD card – it’s possible that the SD card could be improved (we always recommend SanDisk Extreme Pro cards), but at this point I’m not sure that the warning you see is connected to the supervisor restarts. As a side note, though, I’m curious if you have tried the eMMC module on the Nano? I understand that it offers higher performance, which is why we chose it for storage on the balena Fin.

All the best,
Hugh

I had tried with and without delta updates - same issue. It would abort the downloads with the ECONNREFUSED message after about 6 minutes. I just went and quickly picked up a UHS-I card, and the problem has been solved! Works like a charm.

Since I tried with two separate slower (class 4) cards previously, and received the same error message with each of them, I am assuming the card speed is the issue. And I know the slower cards worked fine (still do) with smaller images.

My guess: slower cards work with smaller images because there is sufficient memory to buffer the writes when the write speed is less than the download speed. But the combination of larger images and slower cards doesn’t work, because the buffer exhausts the device memory before the download is complete. This would also explain the mysterious 6 minute timing of the error; that would be how long it takes for the write buffer runs out.

Glad you sorted it. We always recommend to use the best quality SD cards available on the market both for performance and resilience. What OS version is this? There was a bug in some OS versions where the watchdog would kill balena-engine if it took too long to update and that consistent 6-minute timeout feels about right.

That would definitely do it too. It is running 2.51.1+rev1 - let me know if that version would have the bug.

I don’t believe that is it, that issue should have been fixed around 2.41.0