What’s the point of restarting a deployment over and over again if it generates an ‘error no space left on device’?
This loop continues until the entire data plan of my GW was consumed, resulting in an offline device…
What’s the point of restarting a deployment over and over again if it generates an ‘error no space left on device’?
This loop continues until the entire data plan of my GW was consumed, resulting in an offline device…
Hey, that is a fair point, and in fact, we already have an issue that we track internally. Unfortunately, it hasn’t been a high priority issue and it still hasn’t been handled, but I will ping the people working on it to take another look at it.
@sradevski and @boes_seob I think the one problem specifically with space is that there filesystem auditing for layered filesystems is not trivial and has a pretty big performance cost on the system. Its of course possible to take a naive approach and just check the disk storage with say df and if the container image size is larger than the space left then not let it update. Obviously this will not work great as we then don’t benefit from any of the docker layer sharing, so even if you updated only 1 file in one of your images, a change of 1kb or something like that, it would block you from doing that update because it would count your old image as say 2GB and the new image to download would be 2GB so on a 4GB device like the Beaglebone black you wouldnt be allow to update, where it would actually be completely fine to update because 99% of the space is shared space.
Not sure if what I wrote here makes sense, but its not as straightforward a problem as it appears on the surface unfortunately.
@boes_seob what OS version and device type did you experience this on. It would be good to get a sense of what was filling up the disk space. I also believe the supervisor in the latest OS versions has exponential backoff for failed updates, which should make the situation a little bit better. We are definitely working on approaches to reduce these types of failures.
It also occurs in case of other deployment failures.
I reduced the size of the container image, changed RESIN_SUPERVISOR_UPDATE_STRATEGY into ‘delete-then-download’ and set RESIN_SUPERVISOR_DELTA_RETRY_COUNT to 1.
Same result: looping download/deployment, now caused by a “Failed to download image X due to ‘rsync exited. code: 11 signal: null’” issue. And again causing the GW to go down after all my data plan capacity got wasted.
Agree. But what about just adding a configuration flag representing the number of deployment retries, and rollback to the previous image if that threshold gets exceeded?
This may not be compatible with your break-before-make deployment strategy (i.e. delete-then-download) but could work for all other make-before-break policies.
OS version = Resin OS 2.15.1+rev2
Supervisor version = 7.16.6
Maybe also interesting to know is that the size of my initial container was too big – which I fixed using multi stage deployment containers.
Exponential backoff would be useful indeed. Maybe another interesting feature may be to add a ‘cancel’ button that forces roll-back in case something goes wrong – so to manually trigger the final stage of your exponential backoff.
@boes_seob both of those are great feature requests, would you mind detailing them on https://github.com/balena-io/balena-supervisor/issues and we can see how easy they are to add.
@shaunmulligan OK, will do.