Balena “Supervisor” process does not restart after reboot / shutdown / powerloss

Seva · July 10, 2020, 7:49pm

Hi Balena folks,

I’m trying to reply to your request to share remote access to my devices and unfortunately I can’t paste in the same thread as “new users are limited to 3 replies in the topic”.

Here is original thread about Balena “Supervisor” process does not restart after a device is “Shut Down” from BalenaCloud.

Please find IDs for both Pi0 W devices demonstrating weird behavior I’m trying to fix when supervisor can’t start services. HostOS is accessible and status for both is Online (VPN Only. One Pi0 got into this mode after reboot via dashboard, another one after power loss.

d457ca0e7beb7f086753f3dbc6ed5eea - this one uses production HostOS build 2.48.0+rev1, supervisor 10.8.0

f963939a5b0b6a0522f45308b007f154- this one uses development HostOS build 2.48.0+rev1, supervisor 10.8.0

remote access granted for 1 week

20k-ultra · July 10, 2020, 11:40pm

Hey, thanks for allowing support access. I can see in the logs which I pulled from the diagnostics page that the balenaEngine is being terminated for some reason as seen:

Jul 10 20:24:46 systemd[1]: balena.service: Found left-over process 16543 (balena-engine-r) in control group while starting unit. Ignoring.
Jul 10 20:24:46 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 10 20:24:46 systemd[1]: balena.service: Found left-over process 16585 (balena-engine-c) in control group while starting unit. Ignoring.
Jul 10 20:24:46 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
....
Jul 10 20:26:16 systemd[1]: balena.service: Start operation timed out. Terminating.

I’ve asked for someone on the OS team to take a look.

Could you confirm the following steps will produce the error so we can easily recreate:

flash latest OS available to RPi Zero (if latest works try balenaOS 2.48.0+rev1)
deploy balenaSense project
shutdown device with dashboard OR power (doesn’t matter)
turn device back on
see that balenaEngine is not running

Seva · July 15, 2020, 5:15am

Hi guys,

I did clean install and I can confirm that after reboot / power loss balenaSense is not running on RPi Zero W.

What I did step-by-step:

removed devices from balena-sense app in my account
removed app balena-sense from my account
deployed new balenaSense project using Deploy with Balena button on balenaSense github page.
downloaded balenaOS 2.48.0+rev1 and flashed it using balenaEtcher
after device was provisioned and went live I shutdown device with reboot and power loss.
After turning device back on balenaEngine wasn’t running.

Now device is online in VPN only mode. UUID is 6c86d890a6e037f2b0d18749529b3c41
Remote access granted for 1 week. Can you please have a look.

gelbal · July 15, 2020, 9:04am

Hi @Seva, is it fine for us to enable persistent logging on this device (6c86d890a6e037f2b0d18749529b3c41) and then reboot?

We would like to observe the logs of balenaEngine behaviour across reboots. We have seen engine crashes in the recent past caused by resource limitations and we suspect this device has the same problem.

On a related note: What’s the SD card you use?
An SD card with relatively slow write speed might also cause an issue.

Seva · July 15, 2020, 11:43pm

Sure, feel free to enable persistent logging and then reboot the device as many times as you need.

Using SanDisk Ultra SD cards. Here is short description from Amazon: SanDisk 16GB Ultra MicroSDHC UHS-I Memory Card with Adapter - 98MB/s, C10, U1, Full HD, A1, Micro SD Card - SDSQUAR-016G-GN6MA,Red.

Hopefully it’s good enough for Pi0

Please let me know if you need anything else.

roman-mazur · July 16, 2020, 9:37am

Hi @Seva,

Unfortunately, you faced an issue when the containers engine does not start in the allocated timeframe and gets into a restart loop.
Let me explain this a bit.

The containers engine is started as a system service and its configuration has a startup timeout (90 sec atm). The engine completes its startup procedure on boot once it loads all the previously started containers. So this operation basically takes O(N) time, where N is the number of containers. Loading a container also involves disk write operations (because the engine needs to serialize the new containers state).
For many devices, time spent per container is really small, and 90 seconds is more than enough. However, on devices like Raspberry Pi Zero, it may become a problem because the startup procedure takes longer, and the more container you have, the higher chance is you will hit the timeout.

balenaSense has 5 containers, and when the supervisor rebooted the device to finish a new device config application, the engine got into the restart loop because it was not able to complete the startup in the allocated 90 seconds.

We are actively discussing the setup in a related balenaOS github issue:

And this thread will be updated once we have a resolution.

For now, I updated the service config on your device manually to mitigate the problem. However, keep in mind that this change will not survive the OS update.

cywang117 · May 11, 2022, 12:26am

Hi,

We’ve disabled Engine startup timeouts with v2.98.4+ and that version is available for host OS upgrade on select device types. For RPi Zero, this OS version will be out soon, so keep an eye out and let us know if it works to alleviate the issue described in this ticket.

Thanks,
Christina

Topic		Replies	Views
Balena "Supervisor" process does not restart after a device is "Shut Down" from BalenaCloud Product support support	20	3028	September 8, 2020
balenaSound - all containers exited balenaOS raspberrypizero	31	2074	January 30, 2020
Balena Engine won't start balenaOS	10	846	June 24, 2020
Startup and Shutdown problems #balena_sense Product support	20	1570	June 3, 2020
Device services stuck at "stopping" state Product support	10	2015	May 8, 2019

Balena “Supervisor” process does not restart after reboot / shutdown / powerloss

Related topics