We are working on changing the way we build and deploy to our fleets to something like this:
- We slice our service into multiple containers, which we define as separate blocks.
Example: Telemetry@1.25.3 - The code for all these blocks live in a monorepo which uses a dependency graph to only rebuild and version blocks that are impacted by code changes for each PR
- A release for a fleet is made up exclusively of blocks
The goal for this kind of architecture is to allow high velocity updates to critical devices in the field without risking customer uptime (if we fail, the lights go out on our customer site).
So the idea is that we have containers that are low risk to update (think telemetry) and others that are more risky (think high frequency controllers for inverters), by splitting these up we can minimise the amount of updates that touch the critical containers while still deploying new features quickly in other parts of the product.
Now for the feature request. At this moment a container can take a (f)lock to prevent the container from being stopped/restarted by ENV variable changes, new releases and OSupdates. The thing we find a pity is that we cannot take a lock in a single container while allowing updates to other containers.
To us this would be a very logical addition to the above concept.
For example, container ‘A’ takes a (f)lock → ‘A’ cannot be stopped, so anything goes, except stopping this container:
- ENV variable updates that are only linked to container ‘B’ and ‘C’ are fine
- ENV variable updates that are linked to ‘A’ alone or ‘A’ and ‘B’ cannot be executed
- installing release updates in which container A has a zero delta can be done immediately
- manually restarting or stopping containers ‘B’ and/or ‘C’ is possible
- shutting off or rebooting the device will not be possible because of the flock
- an OS update cannot be executed because of the flock
To make this truly complete, the container holding a (f)lock would have a way of knowing the supervisor is waiting for it to release it and why.
For example, when Balena rolls out scheduled OS updates in the future, you treat that differently to a normal release update. I’m assuming an OSupdate won’t execute handover procedures that containers may have, resulting in shut down and start up procedures to be executed.
I’m thinking about these situations:
- Container has flock, supervisor schedules an OS update. It then requests the container to release the lock and gives a reason why (‘OSupdate’). The container now knows a handover won’t be available and waits to release the lock until an approved maintenance window. Once the lock is released, the container triggers the update via local supervisor API and once completed the new container will retake the flock.
- Container has flock, supervisor schedules a release update with delta >0. It requests the container to release the lock and gives a reason why (‘update_restart’). The container knows it will be restarted. It releases the lock and performs handover with newly released container which then retakes the lock.
- Container has flock, supervisor schedules a release update with delta 0. It requests the container to release the lock and gives a reason why (‘update_no_restart’). The container knows it won’t be restarted because the delta for the container is zero. It releases the lock and retakes it after update is completed (only necessary if the other part of this feature request where we can lock a single container is not implemented).
To me these are natural extensions to the locking feature which allows to selectively lock the least amount of processes, so that only those that are truly necessary to be protected will block, but allow others to be freely updated without issue.
If the supervisor can request removal of the lock for specific reasons, you can start to build customer experience on it where we could have notifications to our users where they could schedule a moment to allow an OSupdate, or follow fixed maintenance windows.
I look forward to your thoughts/feedback on this and hope to see this capability added to the Balena platform in the future!