App Doesn't Update After Push

Hi Balena Team! I’m trying to update an app on a remote computer, and on two of my three deployed computers the updated app finishes downloading, but on the third computer the containers never actually finish downloading their updates. The logs are filled with ( detailed shas and keys removed and replaced with ~ )

18.01.19 17:43:04 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:43:04 (-0500) Downloading image 'registry2.balena-cloud.com/v2/@sha256:~'
18.01.19 17:43:17 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:43:17 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256~:' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:43:17 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:43:17 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:43:32 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:43:32 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:43:32 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:43:32 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:43:47 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:43:47 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:43:48 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:43:48 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:44:20 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/v2/~/manifests/sha256:~: Get https://api.balena-cloud.com/auth/v1/token?account=~&scope=repository%3Av2%2F~%3Apull&service=registry2.balena-cloud.com: net/http: TLS handshake timeout '
18.01.19 17:44:20 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/v2/~/manifests/sha256:~: Get https://api.balena-cloud.com/auth/v1/token?account=~&scope=repository%3Av2%2F~%3Apull&service=registry2.balena-cloud.com: net/http: TLS handshake timeout '
18.01.19 17:44:20 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:44:20 (-0500) Downloading image 'registry2.balena-cloud.com/v2/~@sha256:~'
18.01.19 17:44:34 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '
18.01.19 17:44:34 (-0500) Failed to download image 'registry2.balena-cloud.com/v2/~@sha256:~' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/: net/http: TLS handshake timeout '

All three computers are lined up in a row in front of me, they have been provisioned identically and they are using the same hardware and the same wifi access point.

1 Like

Hey @jarek! I’ve been experiencing a similar problem (with a different HTTP error code), and have submitted an issue on the team’s GitHub page for the supervisor. You can see it over here.

Have you tried re-imaging the device? Sometimes I’ve solved strange errors with a simple re-provisioning.

Our issue seems to be with the Balena servers not actually having the images to download ( 404 / 500 errors ) but I’ll try switching the update strategy anyway to test on a local machine; I’ll watch that thread anyway ty for the link :slight_smile:

Reprovisioning always works, but I’ve been incredibly anxious to update anything already deployed in the field as we have no access to those boxes now and the issue seems to pop up with big and small dockerfile changes with no rhyme or reason. Not planning on any remote updates until this is completely resolved :frowning:

Hi @jarek

can you please tell us device type & OS version? Any chance to enable support access on the device?

Thanks,
Robert

Hi @jarek
can you please tell us device type & OS version? Any chance to enable support access on the device?
Thanks,
Robert

Hi zrzka! The device is an Interl NUC, and the OS version is balenaOS 2.29.0+rev1.
Currently none of the boxes are displaying this because we haven’t done an update yet and all the previous boxes were re-flashed, we’ll be testing a new configuration next week and I’ll reach out if it happens again.

Workaround while fix is being developed:

On a host OS terminal, run the commands:
balena ps -- this will list the images in use (IMAGE column)
balena images -- this will list all the downloaded images (IMAGE ID column)
balena rmi ID1 ID2 ID3 ...

where ID1 ID2 ID3 ... are the IDs of the images NOT in use.

I ran "balena rmi 457128a3924b 163baeae9ac0", after which the downloads eventually succeeded.

Also, switching the update strategy to https://www.balena.io/docs/learn/deploy/release-strategy/update-strategies/#delete-then-download helps.

Hey @jarek

That workaround seems odd, the images which are currently present should not be affecting the daemons ability to pull further images. Furthermore these images should be removed by the supervisor (assuming these were added there by the supervisor).

The only way that I could see this helping is if your images were so big that they were filling internal storage, but I wouldn’t imagine to see the error that you were.

Can you provide any further information as to why you think this works for you?

Cheers

Just relaying the information I got from support :slight_smile: we just moved to a paid plan so I was chatting with an engineer in a private chat window, but I feel bad leaving loose ends like this so I wanted to tie it up in case someone else stumbles across this.

Ah, that was actually me you were chatting with, I didn’t notice the same username. Yeah so this workaround is for a bug in the backend that is currently being worked on, the logs you posted above shouldn’t be related (except for perhaps network issues like that exacerbating the manifestation of the bug we discussed). For the benefit of the thread, I’ll post a description of the problem here:

It basically boils down to some edge cases in our permission system, which is a bit overzealous. It denies state updating from the supervisor, until the supervisor stops trying to report state about images it no longer has access to. One of the supervisor’s healthchecks is that it has been able to successfully update state in the past few minutes (something we also plan to change), so in the case that the images take longer to download than the healthcheck timeout, the supervisor will be restarted and the download will begin again, indefinitely.
In the case of delete-then-download, the supervisor will delete the images, and this will cause the state update to be allowed, for the downloads too.

We’re aware of this bug, and are working on it currently.

Hi @jarek, we were to launch our system yesterday but had to push forward the date by 2 days due to the same issue.

I keep getting 404/500 errors whenever the device tries to download an update. Tiny updates manage to get downloaded after 4-5 tries but large updates never successfully download. Also, tried the workarounds but I don’t have any extra images and nor does the delete-then-download update strategy help.

Is there anything else I can try which worked for you maybe? Thanks!

For me the resolution was granting Balena support access :wink:

even moving the devices from one application to another didnt help in my case