Balena thinks service is downloaded but it will not start

I’ve seen this on several devices recently. We will push a release, and the BalenaCloud console will indicate “downloaded” for all the services, but some of them (always the same group) will not start.

According to the supervisor log when I attempt to start the service the API call gets a 404. And surely enough, in /mnt/data/docker/containers there’s no directories corresponding to the containers that won’t start.

Rebooting the device often makes this problem go away, but I had one recently where I had to push another release, and then reboot and then the containers folder was updated.

I guess my question is, can anyone comment on how my devices are getting into a state where Balena thinks the services are downloaded but they don’t appear to be?

The group of services that have the problem are all quite large in size but the devices are on a high-quality ethernet connection and there’s no obvious (at least to me) networking errors, and no space issues on my /mnt/data partition.

These devices are Generic_x86_64 or Intel_NUC, Balena OS 2.68.1 with supervisor 12.3.0 or 12.3.5, though also one device with 2.4.6 / 10.6.27 (which we have some of in the field)

Thanks!

Hey, I understand the confusion with this. The Supervisor is a really complex application that works really well but there are some pain points that we need to address to avoid this class of issues. For example, your specific state was most likely caused by the on device database file getting out of sync with what is actually running in the engine. We plan on improving this by removing that database file which will help significantly but there’s no clear deadline for that as we try to focus on patching such issues and adding feature requests from users.

In the more immediate future I am working on a PR that will add more debug logging so when we look at the Supervisor’s journal logs on device we can see why it isn’t starting containers.

Whenever you encounter such instances don’t hesitate to reach out and I personally look at such instances and can resolve them pretty fast. It is rare that these situations happen but with the ability to update the OS and Supervisor we have seen the issues usually happen with big jumps in these versions. For example see Properly handle legacy volumes · Issue #1604 · balena-io/balena-supervisor · GitHub.

Hi, Miguel, thanks for reaching out. I will have new releases for my test fleet early next week and I’ll leave one of them in this state (assuming it does really take a reboot to resolve and not just magic) and hopefully get you to have a look at it while it’s failing to start the containers.

The issue you referenced definitely sounds like a possibility, and we did make a big supervisor jump, so hopefully we are onto something here.

Awesome. I would really appreciate if I can have some time to debug a device if you encounter the issue again.

Just a heads up, I reached out directly with the UUID and support access for a device currently in that state.

thanks!

This was solved with great assistance from Balena. In effect we had a dependency chain of starting containers that was broken due to one exiting too quickly. It was not super-obvious to us this was happening, so thanks for the help!

I’m glad you’re sorted Dave! Thanks for letting us know.
Phil