Lets Hack Balena! Moving devices from one application to another without redownloading entire application image

Hey guys,

Per usual I have some regrets with the way my fleet is deployed. My customer success contact advised me to put all my deployed devices in one big happy application…

I didn’t. I have around 10 - 20 applications with 3 -5 getting added every week. As you can imagine pushing a fleet-wide update takes a minute or 40.

However, while I am not that bandwith limited, my devices don’t like staying online for long periods of time and a sizable percentage would not redownload the entire application image if I moved them.

The reason I designed them like this was so that location-specific environmental variables could be used easily. Now that I have learned how to use tags from within my application, I will use tags instead. Is this how people do things? Are environmental variable sub groups in the pipeline? How do other people solve my problem?

Are there any weird things that happen when you have 200+ devices in one application. I have been told that isn’t even close to what balena has been tested on, but I can’t hope but notices that the larger applications seems to get slower at updating device status. With this effect increase?

Thanks for your help.
My best idea would be if it was somehow possible to temporary create a multi-application release, then they wouldn’t need to update when moved. But that wouldn’t get around the /data purge which I also want to avoid?

Am I doomed to running balena push x1, x2, x3, … in a script?

Hi, there is no reason that larger fleets will take longer to update as each device will check it’s target state and download updates independently.
From my understanding, you currently have many applications and you push the same code to each of them, but you would like to move all the devices to a single application. Is this correct?
If so, could you be specific about the issues you are facing when trying to move devices into the same application?

Hey @lucianbuzzo,

The main issue I would like to focus on is moving the devices to one application. I don’t run into any difficulties because I haven’t dared try. Dumping the /data and re-downloading a 600 MB image would probably send 10-20 percent of my devices into the abyss of offline mode. They just don’t have the signal strength to stay connected long enough to download an image of that size. And there doesn’t seem to be support for continuing incomplete updates from where you left off. I can’t afford to do that so I am stuck with my system of many applications for now.

The solution for this would be a one time solf-move. Meaning the devices are moved to a new application but keep the release and /data in-place. For this to be possible would require assistance from balena and might not be possible at all.

Here is what I propose:
There is a script or series of actions that goes into the place during a move.
On the balena cloud back-end the device is “moved” in the sense it is retagged for the new application. The new application is updated to include that device and a command is queued to tell that device it has been moved. When that device comes online again, or immediately if it is online during the move a command is given to the supervisor to possibly do 3 things:

  • Update the config.json -maybe?
  • purge the /data dir with balena engine
  • purge the old release
  • Possibly more

Now I can’t have the last 3 things happen, so the method I would propose is to manually modify the config.json of each device updating: applicationName, applicationId, as needed.

Then the part I would need Balenas help with:
I would release the code to the final application, and the same code to the donor application. Balena would make take the release from the final application and make it the same as the one on donor application. This is where things might fall apart. This just might not be possible. Or worse this might break the cache system and cause a redownload of all the layers which would make this a fruitless exercise.

I would initiate a move with all my devices turned offline. The devices would move, and the command to update the devices in the field would have to be deleted. The devices would get powered back on and reach out like normal and are magically in the new application with no release needed.

What I am asking here is this possible? Because of the time involved this might be more a custom engineering job and not a support issue. But before we go there, I would like to even establish this is possible.

Also,

By update, I mean the refresh the dashboard view of the fleet. For example, if I power cycle the devices it takes longer to show up in the dashboard that all the devices have been online for X minutes not several hours the more devices you have in that application.

It seems to update your view of the device status you have to be on that tab and the more devices you have the longer it takes to load each refresh.

But that is not a problem just an observation.

Thanks

Hey @tacLog

OK there’s a lot to unpack here so I’m going to try and clarify. Firstly you said that the applications are (will be?) running the same code? If this is the case, moving the devices between these applications, docker layer caching should come into play, as the images are still on your local device (depending on your update strategy).

Volumes are always purged when moving between applications. We’re talking internally about enabling methods of avoiding this, for situation like you mention, but it isn’t quite there yet.

Now I can’t have the last 3 things happen, so the method I would propose is to manually modify the config.json of each device updating: applicationName, applicationId, as needed.

This won’t work, as the cloud needs to be informed of the device move. It needs this so that the device can be authenticated with the cloud, and is allowed to download images etc. If you changed the values, your device would continue to get a target state from the cloud for the current release, because the cloud doesn’t know any better.

In terms of saving the volume data, perhaps the best way currently is to back it up somewhere, but I understand that’s not really a nice method. There might be a way of tricking the supervisor into believing the data is what it needs by changing the volume name and perhaps the labels - but this would need to be tested.

Another thing is you cannot share images between applications, even if those images are built with the exact same source, because of the authentication the registry and API provides.

Given the above, I’m not sure on the best way to proceed. We’re more than happy to try and work with you to find a solution, but currently there’s no way out of the box to do what you want.

Hey Cameron,

Thanks for the insights!

Firstly, correct. The situation we are going for here is around 10 - 20 different applications getting merged into one. They all run the same code with slightly different configurations. It was a dreadful mistake to set things up this way…

Secondly, The possibility of some of the layers being cached is great news! I will run some tests to determine what layers are cached properly. If the updates would be small enough then we might not have a problem at all. I was under the impression the images were purged with the /data because of the sequence of status messaged the supervisor logs during a move.

As for my quoted segment, as I understand things the cloud needs to know what devices are in which application so it can keep track of status, tags, env vars, and releases.

However, if the device is offline when I move it in the interface, when the device comes back online, and asks the cloud what’s up. The cloud has to respond with, hey device x your part of this application now, purge your data, change your config.json, and update your release with the release server. Now what I am not certain is that this is a message, or that this series of actions it just cause by a gereral check in of some kind, where like:
Device: Hey api, I am UUID, what env vars do I need, and what applicaiton am I part of:?
API: (Not knowing or caring that this devices was just moved), Hey UUID, your are part of this application, tracking this release, these are your env vars, and these are your supervisor config settings.
Device: Looks like this is different time to change do the update steps.
Or more likely something entirely different is happening.

Regardless, I was talking about tricking the device to think it was already part of the new application to avoid the /data purge. However, that wouldn’t be simple and might be risky so lets not go there unless we have to.

Now this is all kinda pointless if the required downloads won’t be that large. I can handle the /data being deleted, it is annoying but I have a system in place where that should all be sent to one of our servers anyways. It would just be a matter of verifying that system.

Basically, if your right about the cache still working, we are good and don’t need to worry about the above process.

Thanks and have a great weekend!

-Thomas

This is pretty much exactly what happens, the supervisor can be thought of as stateless, in that it just looks at what is going on on the device, compares this to what the api sends, and continues like that, constantly iterating until it notices something changing on the cloud or device side.

I’m pretty sure that docker should use the existing layers to cache the images, although there’s a chance the supervisor will remove this images first. Let me know what your tests say (I’ll not be able to test this until early next week probably) but if that’s not the case then I can put you together a custom build of the supervisor which does not remove the images prior to moving application, but this is a bit more time consuming so we should try other options first.