Stale containers blocking updates

We just had an issue with one of our devices when a container wasn’t restarted running the new version after a fleet update, leaving the device effectively offline as it was unable to communicate with our backend.

It appears the issue was that at some point in the past we’d started a container via a host OS SSH session, along the lines of balena run -it $IMAGE_ID bash. Fast forward a few weeks, and the device attempted to perform an update because a new release had been pushed to it’s fleet, the other containers updated, but this got stuck in a state of Downloaded -> Downloaded.

The old container showed up when running balena ps -a, after which it could be deleted with balena rm $CONTAINER_ID, followed by restarting the supervisor to force it to reattempt the update, that unstuck everything and allowed the new container to be started.

I’m not sure how many people this has impacted in practice, but it would be great to have some sort of feedback on an update being blocked due to the old image having dependencies on it on the dashboard.

Hey @JonWood thanks for posting here.

My immediate reaction is that starting containers manually via a terminal to the host OS is an advanced use case and indirectly communicating that you know what you’re doing and we shouldn’t interfere with that. In this case I’m more wondering what was the failure that made it necessary to perform that workaround in the first place?

The issue is that if the “normal” method of updating and creating a new release is used, then the issue wouldn’t arise, so we might fall into the trap of trying to engineer for a situation we can’t predict or control. On the other hand we don’t want to prevent you from taking the action that you did if it gets you out of a jam, but would rather focus efforts on avoiding getting you stuck in jam in the first place!

Absolutely agreed that we’re doing things that aren’t necessarily supported, and we wouldn’t typically be starting containers interactively from the host OS. Our devices get deployed to customer sites where it can be necessary to do some exploration to work out how to hook them into the systems on site, and the container was I believe being used for that - since then we’ve implemented a specific container for this sort of work which we can SSH into, rather than relying on starting ad-hoc containers on the device.

I think you’re right that engineering to automatically resolve this specific issue would probably be the wrong direction since if there’s an unsupervised container sitting around you have no way of knowing whether it’s safe to delete it, it would be great to see a bit more in the way of alerting that a device hasn’t successfully restarted all it’s containers after an update though.