Looping download when 'no space left on device' error

boes_seob · April 29, 2019, 11:03pm

What’s the point of restarting a deployment over and over again if it generates an ‘error no space left on device’?

This loop continues until the entire data plan of my GW was consumed, resulting in an offline device…

sradevski · April 30, 2019, 6:39pm

Hey, that is a fair point, and in fact, we already have an issue that we track internally. Unfortunately, it hasn’t been a high priority issue and it still hasn’t been handled, but I will ping the people working on it to take another look at it.

shaunmulligan · April 30, 2019, 6:56pm

@sradevski and @boes_seob I think the one problem specifically with space is that there filesystem auditing for layered filesystems is not trivial and has a pretty big performance cost on the system. Its of course possible to take a naive approach and just check the disk storage with say df and if the container image size is larger than the space left then not let it update. Obviously this will not work great as we then don’t benefit from any of the docker layer sharing, so even if you updated only 1 file in one of your images, a change of 1kb or something like that, it would block you from doing that update because it would count your old image as say 2GB and the new image to download would be 2GB so on a 4GB device like the Beaglebone black you wouldnt be allow to update, where it would actually be completely fine to update because 99% of the space is shared space.

Not sure if what I wrote here makes sense, but its not as straightforward a problem as it appears on the surface unfortunately.

shaunmulligan · April 30, 2019, 6:58pm

@boes_seob what OS version and device type did you experience this on. It would be good to get a sense of what was filling up the disk space. I also believe the supervisor in the latest OS versions has exponential backoff for failed updates, which should make the situation a little bit better. We are definitely working on approaches to reduce these types of failures.

boes_seob · May 1, 2019, 8:32am

It also occurs in case of other deployment failures.

I reduced the size of the container image, changed RESIN_SUPERVISOR_UPDATE_STRATEGY into ‘delete-then-download’ and set RESIN_SUPERVISOR_DELTA_RETRY_COUNT to 1.

Same result: looping download/deployment, now caused by a “Failed to download image X due to ‘rsync exited. code: 11 signal: null’” issue. And again causing the GW to go down after all my data plan capacity got wasted.

boes_seob · May 1, 2019, 8:40am

Agree. But what about just adding a configuration flag representing the number of deployment retries, and rollback to the previous image if that threshold gets exceeded?

This may not be compatible with your break-before-make deployment strategy (i.e. delete-then-download) but could work for all other make-before-break policies.

boes_seob · May 1, 2019, 9:04am

OS version = Resin OS 2.15.1+rev2
Supervisor version = 7.16.6

Maybe also interesting to know is that the size of my initial container was too big – which I fixed using multi stage deployment containers.

Exponential backoff would be useful indeed. Maybe another interesting feature may be to add a ‘cancel’ button that forces roll-back in case something goes wrong – so to manually trigger the final stage of your exponential backoff.

shaunmulligan · May 7, 2019, 8:44am

@boes_seob both of those are great feature requests, would you mind detailing them on https://github.com/balena-io/balena-supervisor/issues and we can see how easy they are to add.

boes_seob · May 9, 2019, 9:11am

@shaunmulligan OK, will do.

Topic		Replies	Views
stuck in deployment loop balenaOS	13	547	March 19, 2020
SOLVED: Downloading Loop Fail - "no space left on device" Product support	2	1500	November 9, 2017
Issues downloading images to boards Product support	20	2460	March 9, 2019
Failed to download image when disk usage very high Product support	10	942	October 6, 2020
Safe way to update over poor connection Product support raspberrypizero	8	664	June 11, 2020

Looping download when 'no space left on device' error

Related topics