Jetson Nano emmc fails to update with a small delta

I’m facing a problematic situation which happened in two steps:

Image build didn’t use previous build results although Dockerfile didn’t change.

Could it be that the "balenalib/jetson-nano-ubuntu:bionic” image was updated recently?

This on its own did not create a problem except some wasted time.

The image Delta Size was reported to be ~86MB:

[Info] Release: 41da95ed404c4842b04a69660c336d77 (id: 162498)

[Info] ┌─────────┬────────────┬────────────┬────────────────────────┐

[Info] │ ServiceImage SizeDelta SizeBuild Time

[Info] ├─────────┼────────────┼────────────┼────────────────────────┤

[Info] │ main │ 6.08 GB │ 86.38 MB │ 12 minutes, 59 seconds │

[Info] └─────────┴────────────┴────────────┴────────────────────────┘

[Info] Build finished in 24 minutes, 21 seconds

But then a device in the affected fleet failed to update despite having almost 3GB of free disk space.

What could cause such a problem?

What can we do to avoid it in the future?

balenaOS 2.107.5
SUPERVISOR VERSION 14.4.2
OS Variant: Development

1 Like

Hello @sgserg welcome to the balena community!

Could you please check on the Logs what is the error message from the supervisor?

Sometimes if the source image is not available (e.g image of the device as you anticipated) , the supervisor will pull the full release.

Thank you @mpous glad to be here!

Here is what I see in the supervisor Logs section:

Failed to download image ‘registry2.balena-staging.com/v2/6d0456a13111854f952954e0e6c7d6e5@sha256:718948de5d0358e5ad4ac7e90c0256ce787ed5946b09e7303c135ba2e8ff0019’ due to ‘failed to register layer: Error processing tar file(exit status 1): write /usr/local/cuda-10.2/targets/aarch64-linux/lib/libcusolver.so.10.3.0.300: no space left on device’
Downloading delta for image ‘registry2.balena-staging.com/v2/6d0456a13111854f952954e0e6c7d6e5@sha256:718948de5d0358e5ad4ac7e90c0256ce787ed5946b09e7303c135ba2e8ff0019

I also monitored the disk usage during update and confirmed it got full somwehere around 50% of the download. After that the download restarts.

1 Like

Hello @sgserg after speaking with my colleagues, we have an hypothesis of what is happening.

If the base image has changed, the delta will need enough disk space to store all the layers of the base image that have been updated. If this is the case, try to allow delete-then-download update (at the expense of downtime and bandwith).

You can read more about the balena update strategies here Fleet update strategy - Balena Documentation

Let us know if that solves the problem!

@mpous by base image you mean the image we use in the first line of Dockerfile?

FROM balenalib/jetson-nano-ubuntu:bionic

Yes, freezing the base image to an earlier date and dropping the “failed” release appears to have helped with using the previously built image.

But for some reason, after the build the device has been stuck in “Delta still processing remotely. Will retry…” for more than 24 hours already.

[main]     Successfully built 3b539becd932
[Info]     Uploading images
[Success]  Successfully uploaded images
[Info]     Built on arm02
[Success]  Release successfully created!
[Info]     Release: 6cc679d7d05662afb06fc764e10eb12a (id: 162670)
[Info]     ┌─────────┬────────────┬────────────┐
[Info]     │ Service │ Image Size │ Build Time │
[Info]     ├─────────┼────────────┼────────────┤
[Info]     │ main    │ 6.09 GB    │ 53 seconds │
[Info]     └─────────┴────────────┴────────────┘
[Info]     Build finished in 4 minutes, 2 seconds
1 Like

@sgserg did the delta finally complete? If not, have you tried the “Delete then download” strategy suggested earlier?

@alanb128 It didn’t. We had to turn it off after a couple of days.
We have another device in similar state, if you’d like to take a look.

Sure, glad to take a look. Are you seeing the same error messages such as “Delta still processing remotely. Will retry…” ? If you can attach or send us the device diagnostics that may help.

Yes, the same messages.
system.log (6.8 MB)
Please find attached output of journalctl --system

1 Like

@mpous @alanb128 Sorry to bump this after over a year but we just ran into this same issue, and I was wondering if there is another workaround other than the “delete then download” method which would result in downtime.

We recently pushed an update to our fleet which inadvertently included an updated balenalib/raspberrypi4-64-debian:bookworm image, and now we are having devices run out of storage as they try to apply the delta updates. I’m assuming it is related to this comment above:

If the base image has changed, the delta will need enough disk space to store all the layers of the base image that have been updated. If this is the case, try to allow delete-then-download update (at the expense of downtime and bandwith).

If we kill and delete the old containers and let them redownload from scratch, the devices download the images fine - which I guess is in essence the same as applying a “delete then download” strategy… but I’m hoping there might be a better way?

And separately, mostly out of curiosity, but also because I’d like to see if I can help - what is the reason that balena-engine is not able to handle deltas appropriately when base images change? It seems to work great otherwise…

1 Like

@drcnyc could you please confirm if you are running this device on openBalena or balenaCloud?

@mpous we are running openbalena with a delta server that uses balena-engine to create the deltas. While I know that potentially introduces a number of other variables, I believe this is the issue we are seeing.

@drcnyc could you please confirm what delta server are you using? did you implement one yourselves?

If this is not working, I think the device needs to have enough storage to apply the image… i’m not sure how we can help you here.

@mpous it is our own open source delta server, which is based on ‘balena-engine’ as detailed here and here.

Could I ask it a different way - is the issue noted above still present in balena cloud? (i.e. if base images change, does that require devices to download new base images plus changed layers)? And if so, is there some kind of fundamental reason why this can’t be handled differently / can I help resolve it?

It feels like a fairly significant issue, because people who use delta updates rely on them being small, and I suspect most won’t appreciate that you need to leave 2x your fleet size in storage headroom on your device just in case base images change otherwise the update will get stuck. Especially when it seems that the base images do change from time to time, the debian one I noted above was just changed two weeks ago but retained the same tag - so even if you were “pinned” to that image, it would have changed, and would necessitate the storage headroom.

We ended up referencing image with its creation date:

FROM balenalib/jetson-nano-ubuntu:bionic-20221109

2 Likes

@sgserg thank you for sharing that - we were planning to reference the hash but this is a better approach.

1 Like