image fails to download 404 no such image

shaunmulligan · March 4, 2020, 9:47am

Hi Frederic, we are still looking into this, do you only see this on the 2.46.1 version of the OS, do you have others connected to the instance?

Langhalsdino · March 4, 2020, 8:07pm

Sadly i only have the 2.46.1 running

Are there any further debug logs i could help you with?

shaunmulligan · March 5, 2020, 11:28am

Hi Frederic, I chatted to the supervisor maintainer and it seems like the Device state apply error Error: Failed to apply state transition steps. Steps:["fetch"] is due to a failed image download but the correct error message is being swallowed and not reported to us. We might be able to direct you to install a debug supervisor to help capture some more logs and better understand the issue, but we will get back to you on that asap.

shaunmulligan · March 5, 2020, 11:43am

My colleagues also suggested repushing the image just to rule out any issues with the registry possibly loosing it.

shaunmulligan · March 5, 2020, 12:59pm

Okay, so getting a debug supervisor probably wont help apparently, but maybe you could grab logs from the container engine using journalctl -u balena.service -t balenad as we might see something useful in there to figure out why the image pull is failing.

Langhalsdino · March 5, 2020, 8:57pm

Thank you for pointing to that system log.
But i am not sure if this is acctually the reason for the filed to download image, since it happens more then every two hours and therefor might not be related to the [Logs] [3/5/2020, 4:00:20 PM] Failed to download image

Mar 03 20:07:25 ShortUUI balenad[1292]: time="2020-03-03T20:07:25.076395426Z" level=warning msg="failed to download layer: \"unexpected EOF\", retrying to read again"

This happens roughly every 20 minutes

Do you know if this is a server error, should i upgrade OpenBalena or a missconfigured app?

Langhalsdino · March 5, 2020, 9:04pm

I repushed the image already and it did not help

_Page · March 5, 2020, 10:23pm

When I’ve seen unexpected EOF returned by docker/balena-engine it’s been due to network errors, do you maybe have a load balancer that could be terminating the connection unexpectedly?

Langhalsdino · March 6, 2020, 10:13am

I checked my GCP firewall, ingress, … settings and it all seems to be good. Furthermore the time difference for the warning msg="failed to download layer" seems to be evenly distributed around 1000 seconds (looks like a gausian plot). Since this is quiet a nice number i suspect it be set in some setting I will investigate it further.

Do you have any clue about a location that i should take a look into.?

Langhalsdino · March 6, 2020, 10:23am

Still not sure if this is the right place to look into, but my current haproxy config defines:

defaults
  timeout connect 5000
  timeout client 50000
  timeout server 50000
redis:
   timeout 1h
postgres:
   timeout 1h

Langhalsdino · March 6, 2020, 11:13am

I upgraded our OpenBalena deployment to:

OPENBALENA_API_VERSION_TAG=0.49.9
OPENBALENA_REGISTRY_VERSION_TAG=2.13.1
OPENBALENA_VPN_VERSION_TAG=9.10.0
OPENBALENA_DB_VERSION_TAG=3.0.1
OPENBALENA_S3_VERSION_TAG=2.9.0

maybe this might help

robertgzr · March 6, 2020, 12:34pm

Maybe you could try putting the image on a registry hosted on gcp and try pulling it from there? Just so we can definitively rule out network related issues?

Langhalsdino · March 6, 2020, 1:42pm

I will try it, since upgrading the OpenBalena image versions did not help.

Is there some documentation on how to add a custom private registry to a OpenBalena deployment?

Langhalsdino · March 6, 2020, 2:17pm

Sorry for spamming this thread with a lot of ideas around the issue. I just think that my approach enables others who might have a relate issue to figure a fix.

I just came across a debian setting in /proc/sys/net/ipv4/tcp_keepalive_time with the default timeout value is 7200 (2 hours). This two hours correlate with my error messages that appear every 2 hours.

A root user can change them with:

echo {value in seconds} > /proc/sys/net/ipv4/tcp_keepalive_time

robertgzr · March 6, 2020, 2:36pm

It would maybe be worth it to set it to a lower value (30m or even lower) for a test run. I suspect you would see your errors showing up past that mark then.

TCP keepalive process waits for two hours (7200 secs) for socket activity before sending the first keepalive probe, and then resend it every 75 seconds. As long as there is TCP/IP socket communications going on and active, no keepalive packets are needed.
TCP keepalive Recommended Settings and Best Practices | Linux Tutorials for Beginners

I suspect nothing is actually ever transferred and you only notice that when the keepalive probe is sent and the connection dropped.

Do keep us in the loop for what you find.

Langhalsdino · March 6, 2020, 2:55pm

I set it to an aggressive 120 seconds and as you suspected the error does not show up, yet. Therefore its probably occurring way past this mark.

alexgg · March 6, 2020, 5:52pm

Hi Frederic, thanks for the update. Please let us know whether the problem is still appearing. Did you try placing the image on a registry hosted on gcp and try pulling it from there to rule out network issues as suggested above?

alexgg · March 6, 2020, 6:05pm

Hi Frederic, re-reading the thread, the suggestion was to try an use a virtual machine on a cloud different from GCP, run OB in there are then try to pull the large image. This is only to rule out network problems. Pulling from a private registry is not supported as far as I know, but I have reached out to the balenaCloud team to confirm.

alexgg · March 6, 2020, 7:22pm

Hi again, just to confirm that pulling from a private registry is not supported. Finding a way to decouple the problem from the current network to rule out firewall/other issues would still be desirable though.

Langhalsdino · March 6, 2020, 7:24pm

I will try a couple of configurations during the next days. Is there a recommended platform and OS that i should try first? I could try Azure, AWS or Digital Ocean, since GCP with a compute engine (Ubuntu, debian) did not work.

Topic		Replies	Views
Failing to download image from openBalena balenaEngine	19	1118	October 27, 2020
Getting repeated Failed to download image updating device after pushing balenaSound to application Product support	14	2793	June 17, 2022
Download error, URL wrong Product support	24	2565	April 28, 2020
Image randomly fails to download openBalena	6	551	August 26, 2019
Issues downloading images to boards Product support	20	2461	March 9, 2019

image fails to download 404 no such image

Related topics