Our three fleets each have 2 containers. On about 5 devices in one fleet and 2 in the other, one of those two containers shut down properly (based on our service logs) and then just vanished. Gone from the cloud inventory for those devices, the balena ps -a doesn’t even show it as exited, and balena images ls no longer shows the image, but the volumes remained (thankfully!). There seemed to be no trace of the missing container having ever existed through balena cli commands (while SSH’d to the devices) or any config or OS level logs we could find on the box. Rebooting, restarting balena services, etc didn’t make any difference.
Rolling out a new release of our compose yml with new container versions brought it back.
If this happens again, what can we do to troubleshoot further? What commands can we run or logs can we capture?
There was no common thread - different OS versions (5.24-prod, 6.0.13+rev1, 6.3.12+rev4) and different supervisor versions (16.1.0, 16.4.6, 17.0.1). No host OS or supervisor updates were performed.
Perhaps it was the host’s OOMKiller and that caused moby to nuke the container instead of just restarting it?
We really have no idea, as when we got into these devices to investigate there were no traces of the missing container, which is why I’m asking for tips on additional commands to run or logs/places to look if we see the issue again.
Our fleets in balena-cloud haven’t seen this happen (yet). The fleets where this happened were on our open-balena side, so I can’t grant support access… but, even there, switching to a newer version of our app restored the missing container, so it would seem there is nothing left to investigate.
Still just looking for advice for the next time it happens on any self-investigation things we should capture to try to sort it out. Obviously when it happens we are working against the clock to get it fixed as quickly as possible, so capturing everything we can, then deploying a new version is about the best we can do.