I am trying to update older RPi devices (Zero W and B+ v1.2) with a new build, and it has sort of ground to a halt and won’t update. So I forcefully removed all the Docker containers and images and shut down the device. Then I pinned it to a new build that should be much lighter weight. But when the device comes back up, it still tries to fully deploy the OLD build instead of the one it’s supposed to deploy. I assume it would deploy the new build once it has finished the old one. But since the old build keeps crashing the device, this is not a suitable situation.
Is the device available online by chance, for us take a look? If so, please send us the UUID of the device. If not, have you got another device on the same network that we can use to hop-into this device? That could help us check logs and investigate some more.
Well, I got the devices to drudge through it and are running the latest builds I need on them. So I don’t really have any way to troubleshoot at the moment, because everything is working now.
I also discovered there is an issue on the RPi B+ v1.2, which I will post separately about.
But this post was meant to also be informative to the balenaOS and/or balenaCloud teams. The devices can be made to push through the problem and get onto the next build. It just takes many hours, and in my case, two days. I was hoping to spur some interest from the teams in looking into why the device must first update itself to it’s prior build before it even tries to update to the one it’s pinned to. That is not really a diagnostic problem, but rather a process problem. When the device boots up, the first thing it should do is look for what is the latest from balenaCloud, and then fall back to existing configs if nothing new is sent from the cloud, rather than what it appears to do, which is fully establish the prior state, and then look to the cloud for updates.
Hi Mark – thanks for getting back to us. We would definitely like to understand what’s going on here; even if your device has made it through this particular problem, it may be helpful to examine its logs to see what is going on. Alternately, if we were able to duplicate this problem ourselves, it would help us to investigate what you saw. Are you able to share the device UUID, or to confirm a set of steps that duplicate the problem? Either of those would help us enormously.
So, I think I may have figured out a big issue that was contributing to this problem. I have a mix of older and newer devices, and I had stripped down the application for deployment on the older devices. But I was running the build for that and the regular build into the same application in balenaCloud, and pinning devices to builds. So I had basically two different applications in the same application.
I would always run the bigger build and then the smaller build. Then I’d pin devices. The two builds share some of the containers, but the smaller build omits the container the older devices don’t need or can’t run.
What I think might have been happening is the images/deltas getting weird because the bigger build would have eight containers, and the smaller build only has three. Every update resulted in two builds - a big and a small. So I’m thinking that alternation between big and small builds in the build history was making delta calculations a problem for the build system. And I suspect that was causing the issue I was most recently seeing where the smaller devices would throw 503 and 404 errors in the delta download stage of deployment.
This is all theory, but it might make sense to those who know the build system.
I have moved the older devices to a different balenaCloud application, and only run small builds to that application, and big builds to the original application. Things seem to be a lot happier this way. I even got my most troublesome RPi B+ 1.2 working this way.
Hey Mark, glad to hear things are working better for you now. I think splitting the app into two smaller ones is the way to go, it should make handling updates easier and will also make the creation of deltas more efficient. Build time deltas are generated when you push a new release and they are the diff from the applications current release to the one that is being pushed, this means that your old flow would trigger some useless deltas to be generated (big to small build) which in turn might have caused other slowdowns. However the behaviour you describe is still troublesome, as the device should not have to go through the intermediate release from my understanding. We are currently looking into this and discussing it internally, we will get back to you as soon as we figure out what the issue is here.
Hi there, I am not sure you’ve seen the docs in relation to delta behaviour, but what you describe follows how on-demand deltas work. A delta is automatically generated only between v.Current and v.Next release at build time. If a device is requesting to go to any other release, a delta is requested between the v.Current and v.Next (which may be an older release or a completely different app). These deltas are generated on-demand and could take some time, depending on the binary differences between the images. In the device logs, you will see an exponential backoff retry behaviour from the supervisor, where a http error is actually expected/normal behaviour (it just means the delta isn’t ready yet).
When troubleshooting bad deploys, I personally find it more convenient to move the device to an empty app (no releases) first and let the supervisor establish the target (empty) state and remove all the containers. Then on the host OS, check and remove any problematic downloaded images deltas using balena rmi ... flow in required. That way, when you move the device to the target app, it can begin from a known good state and avoid generating deltas that would be bigger than a new image.
That’s a good idea - moving the device to a new, empty application. I hadn’t thought of that. What I have been doing to try and clear a device, is to stop all containers, remove all containers, remove all images, reboot device, all as one pasted command in a SSH terminal. Moving to an empty application instead gives a little more room for host diags and verification that things are actually ready for a reboot.
Very good. This approach also avoids potentially unnecessary delta generation in edge cases such as this, where the computational overhead of generating a delta may be more than just letting a device to download an overall smaller release image.