One container vanished on multiple devices?

Our three fleets each have 2 containers. On about 5 devices in one fleet and 2 in the other, one of those two containers shut down properly (based on our service logs) and then just vanished. Gone from the cloud inventory for those devices, the balena ps -a doesn’t even show it as exited, and balena images ls no longer shows the image, but the volumes remained (thankfully!). There seemed to be no trace of the missing container having ever existed through balena cli commands (while SSH’d to the devices) or any config or OS level logs we could find on the box. Rebooting, restarting balena services, etc didn’t make any difference.

Rolling out a new release of our compose yml with new container versions brought it back.

If this happens again, what can we do to troubleshoot further? What commands can we run or logs can we capture?

Hello @scscsc thanks for sharing.

Did you perform any action before experimenting this issue? Any software or hostOS update?

What device type do you use? What balenaOS and supervisor versions are you using?

Could you please share anything so we can try to reproduce?

There was no common thread - different OS versions (5.24-prod, 6.0.13+rev1, 6.3.12+rev4) and different supervisor versions (16.1.0, 16.4.6, 17.0.1). No host OS or supervisor updates were performed.

Perhaps it was the host’s OOMKiller and that caused moby to nuke the container instead of just restarting it?

We really have no idea, as when we got into these devices to investigate there were no traces of the missing container, which is why I’m asking for tips on additional commands to run or logs/places to look if we see the issue again.

@scscsc could you please grant support access to the device/s and share the complete UUID of the device (if you want via DM)?

we would like to explore more!

Our fleets in balena-cloud haven’t seen this happen (yet). The fleets where this happened were on our open-balena side, so I can’t grant support access… but, even there, switching to a newer version of our app restored the missing container, so it would seem there is nothing left to investigate.

Still just looking for advice for the next time it happens on any self-investigation things we should capture to try to sort it out. Obviously when it happens we are working against the clock to get it fixed as quickly as possible, so capturing everything we can, then deploying a new version is about the best we can do.

@scscsc did you try to capture any supervisor error in the logs? maybe check that next time!

Let us know if this happen again!

There didn’t appear to be any errors in the supervisor or host os logs. Will capture them all next time, regardless of info/warn/error.