image fails to download 404 no such image

Hi Frederic, we are still looking into this, do you only see this on the 2.46.1 version of the OS, do you have others connected to the instance?

Sadly i only have the 2.46.1 running :frowning:

Are there any further debug logs i could help you with?

Hi Frederic, I chatted to the supervisor maintainer and it seems like the Device state apply error Error: Failed to apply state transition steps. Steps:["fetch"] is due to a failed image download but the correct error message is being swallowed and not reported to us. We might be able to direct you to install a debug supervisor to help capture some more logs and better understand the issue, but we will get back to you on that asap.

My colleagues also suggested repushing the image just to rule out any issues with the registry possibly loosing it.

Okay, so getting a debug supervisor probably wont help apparently, but maybe you could grab logs from the container engine using journalctl -u balena.service -t balenad as we might see something useful in there to figure out why the image pull is failing.

Thank you for pointing to that system log.
But i am not sure if this is acctually the reason for the filed to download image, since it happens more then every two hours and therefor might not be related to the [Logs] [3/5/2020, 4:00:20 PM] Failed to download image

Mar 03 20:07:25 ShortUUI balenad[1292]: time="2020-03-03T20:07:25.076395426Z" level=warning msg="failed to download layer: \"unexpected EOF\", retrying to read again"

This happens roughly every 20 minutes :confused:

Do you know if this is a server error, should i upgrade OpenBalena or a missconfigured app?

I repushed the image already and it did not help :frowning:

When I’ve seen unexpected EOF returned by docker/balena-engine it’s been due to network errors, do you maybe have a load balancer that could be terminating the connection unexpectedly?

I checked my GCP firewall, ingress, … settings and it all seems to be good. Furthermore the time difference for the warning msg="failed to download layer" seems to be evenly distributed around 1000 seconds (looks like a gausian plot). Since this is quiet a nice number i suspect it be set in some setting :slight_smile: I will investigate it further.

Do you have any clue about a location that i should take a look into.?

Still not sure if this is the right place to look into, but my current haproxy config defines:

defaults
  timeout connect 5000
  timeout client 50000
  timeout server 50000
redis:
   timeout 1h
postgres:
   timeout 1h

I upgraded our OpenBalena deployment to:

OPENBALENA_API_VERSION_TAG=0.49.9
OPENBALENA_REGISTRY_VERSION_TAG=2.13.1
OPENBALENA_VPN_VERSION_TAG=9.10.0
OPENBALENA_DB_VERSION_TAG=3.0.1
OPENBALENA_S3_VERSION_TAG=2.9.0

maybe this might help :slight_smile:

Maybe you could try putting the image on a registry hosted on gcp and try pulling it from there? Just so we can definitively rule out network related issues?

I will try it, since upgrading the OpenBalena image versions did not help.

Is there some documentation on how to add a custom private registry to a OpenBalena deployment?

Sorry for spamming this thread with a lot of ideas around the issue. I just think that my approach enables others who might have a relate issue to figure a fix.

I just came across a debian setting in /proc/sys/net/ipv4/tcp_keepalive_time with the default timeout value is 7200 (2 hours). This two hours correlate with my error messages that appear every 2 hours.

A root user can change them with:

echo {value in seconds} > /proc/sys/net/ipv4/tcp_keepalive_time

It would maybe be worth it to set it to a lower value (30m or even lower) for a test run. I suspect you would see your errors showing up past that mark then.

TCP keepalive process waits for two hours (7200 secs) for socket activity before sending the first keepalive probe, and then resend it every 75 seconds. As long as there is TCP/IP socket communications going on and active, no keepalive packets are needed.
TCP keepalive Recommended Settings and Best Practices | Linux Tutorials for Beginners

I suspect nothing is actually ever transferred and you only notice that when the keepalive probe is sent and the connection dropped.

Do keep us in the loop for what you find.

I set it to an aggressive 120 seconds and as you suspected the error does not show up, yet. Therefore its probably occurring way past this mark.

Hi Frederic, thanks for the update. Please let us know whether the problem is still appearing. Did you try placing the image on a registry hosted on gcp and try pulling it from there to rule out network issues as suggested above?

Hi Frederic, re-reading the thread, the suggestion was to try an use a virtual machine on a cloud different from GCP, run OB in there are then try to pull the large image. This is only to rule out network problems. Pulling from a private registry is not supported as far as I know, but I have reached out to the balenaCloud team to confirm.

Hi again, just to confirm that pulling from a private registry is not supported. Finding a way to decouple the problem from the current network to rule out firewall/other issues would still be desirable though.

I will try a couple of configurations during the next days. Is there a recommended platform and OS that i should try first? I could try Azure, AWS or Digital Ocean, since GCP with a compute engine (Ubuntu, debian) did not work.