Safe way to update over poor connection

Hello,

We have deployed to the field multiple instances of a 3 service application running on rPi0. In our lab we have had no issues running these apps or updating them. However, in the field where the connectivity is poor, we have encountered balena engine and supervisor failures, and a going hypothesis is that perhaps the poor connectivity results in increased resource usage, which causes services to fail. This was previously discussed Balena engine start failure.

Per the recommendations Balena engine start failure we have trimmed our application to be single-service. However, we are now encountering a different problem. When we initiate an update, the system deletes the two services that are no longer present in the new application, tries to update to the new application, and repeatedly fails, presumably because of the poor connection. Now I try to pin back to the previous release, but then it tries to download the images for the 2 services that it had deleted, and that fails too. This is a known problem How to stop infinite download loop.

Net result is that the system is now in an unstable condition - can’t go forward to the single-service app and can’t go back to the previous 3-service app. Device ID is 71b132b65186d1c80107c86a5af6cd35 and support is granted for reference. I can connect to the Host via web terminal w/o issues.

So I have 2 questions:

  1. Is there a hack that would allow me to stop the repeat downloading so it doesn’t kill the data plan?
  2. Is there a safe way to attempt to update other field devices so that I don’t fall into this predicament if the update fails?

Hey there! Thanks for coming to the forums. Have you seen the section of our documentation about reducing the bandwidth usage: https://www.balena.io/docs/reference/supervisor/bandwidth-reduction/? Do you think that implementing the tips discussed there could get data usage down enough that downloads wouldn’t be a problem?

Thanks. I’m not concerned about data usage from request packets associated with connectivity checks etc. I’m concerned about repeated download attempts of 100MB updates, which per previous interactions with balena support, seems to be tied to the inability to maintain a connection long enough to download the update.

Unfortunately, coupled with the known issues of treating each service separately rather than part of a holistic application, and the inability to resume download after an interrupted connection, this causes the problems outlined above.

My questions relate to whether it’s possible to work around these issues.

Downloads can be resumed with delta updates: https://www.balena.io/docs/learn/deploy/delta/#delta-behavior

Ah, I missed this nuance of delta updates - it seems that delta updates can handle download interruptions whereas normal updates cannot. So even if the delta update requires the entire image to be downloaded, it’s more reliable over an unstable connection?

Hi there, yes, the default behavior of the Delta updates are such that: “Delta updates are resumable, so if the connection drops or otherwise stalls, the update will resume from the last byte received.” That could help in the situation you have described, certainly.

Hi,

Asking for ideas to get around an unfortunate scenario that has happened in deployed systems:

  • App on device has 3 containers Alpha,Beta,Gamma
  • I created a new App with a single container Gamma, built off a different base image so with a large delta
  • When updating app over an slow connection, supervisor first removed images Alpha,Beta and then tried to download newer image for Gamma. That kept failing. I tried to revert by pinning back to previous app release, but now it tried to download the old images for Alpha and Beta, which continued to fail to download, so I was left w/o working app. This is a known update issue: https://github.com/balena-io/balena-supervisor/issues/1103

I tested a possible hack around this by creating 2 small dummy images for Alpha & Beta, and then started the update. They downloaded fine. In the middle of the Gamma download, I pinned back to previous release. Once again, the system seems to try to download the old images for Alpha and Beta. This supported the issue indication that as soon as the container image is downloaded, the old one is removed.

So wondering if anyone can think of another hack around this issue. As is, we are hesitant to update any apps in the field for fear of getting stuck in no-man’s-land.

Hi there – thanks for the additional details about the problems you’re encountering.

As you note, the question of application update strategies is currently being tracked at https://github.com/balena-io/balena-supervisor/issues/1103, and being discussed internally. We will update that ticket when we have a solution ready to roll out.

All the best,
Hugh

I guess that means there are no known workarounds or hacks and that I need to wait for the issue to be resolved.