Lock single container & have supervisor request releasing lock

We are working on changing the way we build and deploy to our fleets to something like this:

  • We slice our service into multiple containers, which we define as separate blocks.
    Example: Telemetry@1.25.3
  • The code for all these blocks live in a monorepo which uses a dependency graph to only rebuild and version blocks that are impacted by code changes for each PR
  • A release for a fleet is made up exclusively of blocks

The goal for this kind of architecture is to allow high velocity updates to critical devices in the field without risking customer uptime (if we fail, the lights go out on our customer site).
So the idea is that we have containers that are low risk to update (think telemetry) and others that are more risky (think high frequency controllers for inverters), by splitting these up we can minimise the amount of updates that touch the critical containers while still deploying new features quickly in other parts of the product.

Now for the feature request. At this moment a container can take a (f)lock to prevent the container from being stopped/restarted by ENV variable changes, new releases and OSupdates. The thing we find a pity is that we cannot take a lock in a single container while allowing updates to other containers.
To us this would be a very logical addition to the above concept.
For example, container ‘A’ takes a (f)lock → ‘A’ cannot be stopped, so anything goes, except stopping this container:

  • ENV variable updates that are only linked to container ‘B’ and ‘C’ are fine
  • ENV variable updates that are linked to ‘A’ alone or ‘A’ and ‘B’ cannot be executed
  • installing release updates in which container A has a zero delta can be done immediately
  • manually restarting or stopping containers ‘B’ and/or ‘C’ is possible
  • shutting off or rebooting the device will not be possible because of the flock
  • an OS update cannot be executed because of the flock

To make this truly complete, the container holding a (f)lock would have a way of knowing the supervisor is waiting for it to release it and why.
For example, when Balena rolls out scheduled OS updates in the future, you treat that differently to a normal release update. I’m assuming an OSupdate won’t execute handover procedures that containers may have, resulting in shut down and start up procedures to be executed.
I’m thinking about these situations:

  • Container has flock, supervisor schedules an OS update. It then requests the container to release the lock and gives a reason why (‘OSupdate’). The container now knows a handover won’t be available and waits to release the lock until an approved maintenance window. Once the lock is released, the container triggers the update via local supervisor API and once completed the new container will retake the flock.
  • Container has flock, supervisor schedules a release update with delta >0. It requests the container to release the lock and gives a reason why (‘update_restart’). The container knows it will be restarted. It releases the lock and performs handover with newly released container which then retakes the lock.
  • Container has flock, supervisor schedules a release update with delta 0. It requests the container to release the lock and gives a reason why (‘update_no_restart’). The container knows it won’t be restarted because the delta for the container is zero. It releases the lock and retakes it after update is completed (only necessary if the other part of this feature request where we can lock a single container is not implemented).

To me these are natural extensions to the locking feature which allows to selectively lock the least amount of processes, so that only those that are truly necessary to be protected will block, but allow others to be freely updated without issue.
If the supervisor can request removal of the lock for specific reasons, you can start to build customer experience on it where we could have notifications to our users where they could schedule a moment to allow an OSupdate, or follow fixed maintenance windows.

I look forward to your thoughts/feedback on this and hope to see this capability added to the Balena platform in the future!

Hi @spacetesla,

Thank you for your message and the description of your use case.

I think what you are describing in your use case is better supported by our planned multi-apps feature, where you could group critical and not critical functionality into separate apps, with separate release and update workflows. Unfortunately we don’t have an ETA on this feature as there are many things that need to happen for that.

Partial update of apps is not something that we are planning to support, among other things, because it makes deployment health harder to track. Currently a release is either installed or not, if it’s not installed or partially installed we know there is a deployment problem and the fleet operator can be notified about this via the device update status. If partial updates were valid, it would make it very hard to know if when the partial update is due to a deployment issue without some specialized interface.

This also applies to locking, while locks currently are created per service, we consider locking status an app-level property, meaning the app is either locked or not. This is related to what I mention above about partial deployments. A partial app is considered non-functional, so the act of taking the lock by the supervisor means announcing its intention to temporarily take down the app.

There are also other practical reasons for this per-app locking point of view. With newer compose versions (which we are working to support), service dependencies may specify that a dependency should be restarted if its dependent is restarted. This means managing per-service locks would also imply taking into account these dependencies and would complicate the logic and make it easy to introduce bugs.

Finally regarding communication about the intention to take the lock by the supervisor with a reason. I understand the use case behind this idea, however the lock as a synchronization mechanism has been designed to be atomic. Any other type of mechanism is susceptible to TOCTOU type bugs, which we have observed on the past.

For the use cases you describe, we have seen other customers use locks in combination to the /v1/update endpoint and the force property, to hold the locks until manual request, or until some event happens, checked via the updateStatus field, for instance.

Hope this helps, please let me know if I missed anything or if you’d like some more feedback on a specific point or some use case you are trying to solve for.

Hi, thanks for providing some more information.
I agree that the future apps functionality is probably what I am looking for, however the ‘at some point in the future’ timeline means I need to find other ways to do it for the foreseeable future.

The main point of friction I feel right now is that for Balena an app is a collection of containers that is either deployed or not, whereas for us a single container is an app that is either deployed or not (or locked or …).
So I am hunting for ways that I can get close to what I ideally want to build, while staying within the confines of ‘the Balena way’.

I understand how the dependencies (nice feature!) with newer compose versions might introduce some weird behaviour, although I do think this can be handled on our application side. In any case, I wasn’t really pushing for partial releases. The image tags will still show that all images of that release are deployed, many of the containers just have the same image version as before, so don’t need a reboot. I don’t understand how locking those (if they don’t need to be rebooted anyway) somehow lead to a partial release?

To be clear, even though I don’t see the immediate issues, I can live with the limitation that I cannot update individual containers. It does however make the following more valuable.

I want to start with saying that I would indeed not break up the atomicity of the taking of the lock by the supervisor. For me it is more that my application should rarely need to take the lock, however it does need control over the kind of action that will happen once the supervisor has the lock.
I keep coming back to the difference between a normal ‘app update’ and an ‘os update’.
For the first type we can write code to have a graceful update where state is transferred from one container to the next. With an OS update (and reboots) this is not possible. To me it still is valuable to know what will happen once the lock is released (or the supervisor has taken the lock). I fail to see how having a ‘intention field’ on the supervisor API which communicates this intention, makes the taking of the lock no longer atomic. The application code will check the ‘intention field’ and decided to either release the lock or not. Once it releases the lock, the supervisor is free to work like it does now (and if it doesn’t release the lock there is also zero change to how the locking mechanism works)?