Unexpected Fleet restart when adding variable

So there is this ‘feature’ that restarts containers when a device variable is changed. This is without any warning. Can these reboots be disabled (or at least a warning be added)?

Two workflows when you have a (large) balena fleet:

Rebooting a device:
A uses chooses to click the reboot device
The uses intention is very clear, expected behavior is a device that reboots
Yet a warning appears and the number of affected devices has to be manually entered. (Double safe: not just a simple click, a verification of the users intention.
Then the device reboots.

Adding a device variable:
For a new feature the user wants to add a new device variable to the fleet.
POEF whole fleet reboots

Do you notice that the main intention of the user is not to restart devices, yet no warning is given and a whole fleet is restarted…

Hello,
Thank you for bringing this to attention. It does seem like a warning would be helpful when adding/changing variables to inform the user. I would like to clarify that adding/changing a variable restarts the services related to the variable. For example, if a variable is created for all services, all services will restart. However, if you change a variable that has been created for only 1 service, only that 1 service should restart.
See the docs on variables here for more info: Variables - Balena Documentation

I have created a github issue to track the improvement of this feature, although the issue is in a private repository so I am unable to share it. We will however let you know when the ticket is resolved and the feature is improved.

Thank you for the quick reply and adding it to you issue list.

Can I please stress that a setting to disable this behavior is much more welcome than adding an extra warning layer?

When adding new features requiring such variable, it’s very common to set a proper default for the whole fleet.

But adding such a variable to the fleet does require a restart (at some point) from every robot in the fleet. Of which 97% is not interested at all in that variable as they haven’t even updated to the latest (beta?) software version yet.

I understand, and you have a fair point, a warning informs the user of what appears to be an inconvenience but it does not alleviate them of the inconvenience. I will definitely raise the idea of disabling service restarts upon variable change for discussion so that we could consider the best way to proceed.

Hello,

My colleagues and I have discussed your feedback a bit and we would like to get more information on your usecase, as well as to check if a certain feature we currently have would address the friction you are experiencing.

The feature we have available is called Update Locks (see: Update locks - Balena Documentation ). With update locks, you can tell the supervisor to not restart the services on your devices (even when updating variables) until you tell them to do so by overriding the update lock. Please see the aforementioned docs on how to create an update lock. Can you please confirm whether this is a fitting solution for your usecase?

Regarding your usecase, as mentioned we would like to learn more about it. You mentioned that 97% of the robots in your fleet are not interested at all in some variables because they have not updated to the latest software version yet. When would be ideal for the restarts to take place? Do your devices have maintenance periods? Is there some time during the day (i.e. midnight) when it is okay for the devices to restart? What are the consequences of the restarts that you are trying to avoid? As much information as you are comfortable providing would be useful for us so that we could consider your usecase and try to come up with the best possible solution for all users.

I’m working together with Timple so can comment in this as well:

We have a fleet of robots in the field, doing work at customers. We can’t have robots decide to restart services on their own, since that will also disable the hardware and an operator on site is needed to confirm startup again. Hence, this also answers partly the follow up question: Whenever an operator/customer is near the machine it could be allowed to update variables or software releases.

We tried the lockfile feature and this works as expected, also blocking unwanted restarts if fleet wide variables are updated. Thanks for that suggestion!

With the API we can also check whether a new release was downloaded and overwrite the lock to force an update, so we’re almost there.
What we were wondering though, is there also an option to block services from downloading a new release? Right now updates are downloaded, but only applied when the lockfile is released. This is mostly fine, but since the robots are using cellular data, it might be tricky to always download updates in the field. Ideally we can tell the supervisor when WiFi is near and only allow downloads then, is that an option?

Thanks again for your help!