Balena “Supervisor” process does not restart after reboot / shutdown / powerloss

Hi Balena folks,

I’m trying to reply to your request to share remote access to my devices and unfortunately I can’t paste in the same thread as “new users are limited to 3 replies in the topic”.

Here is original thread about Balena “Supervisor” process does not restart after a device is “Shut Down” from BalenaCloud.

Please find IDs for both Pi0 W devices demonstrating weird behavior I’m trying to fix when supervisor can’t start services. HostOS is accessible and status for both is Online (VPN Only. One Pi0 got into this mode after reboot via dashboard, another one after power loss.

d457ca0e7beb7f086753f3dbc6ed5eea - this one uses production HostOS build 2.48.0+rev1, supervisor 10.8.0

f963939a5b0b6a0522f45308b007f154- this one uses development HostOS build 2.48.0+rev1, supervisor 10.8.0

remote access granted for 1 week

1 Like

Hey, thanks for allowing support access. I can see in the logs which I pulled from the diagnostics page that the balenaEngine is being terminated for some reason as seen:

Jul 10 20:24:46 systemd[1]: balena.service: Found left-over process 16543 (balena-engine-r) in control group while starting unit. Ignoring.
Jul 10 20:24:46 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 10 20:24:46 systemd[1]: balena.service: Found left-over process 16585 (balena-engine-c) in control group while starting unit. Ignoring.
Jul 10 20:24:46 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
....
Jul 10 20:26:16 systemd[1]: balena.service: Start operation timed out. Terminating.

I’ve asked for someone on the OS team to take a look.

Could you confirm the following steps will produce the error so we can easily recreate:

  1. flash latest OS available to RPi Zero (if latest works try balenaOS 2.48.0+rev1)
  2. deploy balenaSense project
  3. shutdown device with dashboard OR power (doesn’t matter)
  4. turn device back on
  5. see that balenaEngine is not running

Hi guys,

I did clean install and I can confirm that after reboot / power loss balenaSense is not running on RPi Zero W.

What I did step-by-step:

  1. removed devices from balena-sense app in my account
  2. removed app balena-sense from my account
  3. deployed new balenaSense project using Deploy with Balena button on balenaSense github page.
  4. downloaded balenaOS 2.48.0+rev1 and flashed it using balenaEtcher
  5. after device was provisioned and went live I shutdown device with reboot and power loss.
  6. After turning device back on balenaEngine wasn’t running.

Now device is online in VPN only mode. UUID is 6c86d890a6e037f2b0d18749529b3c41
Remote access granted for 1 week. Can you please have a look.

Hi @Seva, is it fine for us to enable persistent logging on this device (6c86d890a6e037f2b0d18749529b3c41) and then reboot?

We would like to observe the logs of balenaEngine behaviour across reboots. We have seen engine crashes in the recent past caused by resource limitations and we suspect this device has the same problem.

On a related note: What’s the SD card you use?
An SD card with relatively slow write speed might also cause an issue.

1 Like

Sure, feel free to enable persistent logging and then reboot the device as many times as you need.

Using SanDisk Ultra SD cards. Here is short description from Amazon: SanDisk 16GB Ultra MicroSDHC UHS-I Memory Card with Adapter - 98MB/s, C10, U1, Full HD, A1, Micro SD Card - SDSQUAR-016G-GN6MA,Red.

Hopefully it’s good enough for Pi0 :slight_smile:

Please let me know if you need anything else.

Hi @Seva,

Unfortunately, you faced an issue when the containers engine does not start in the allocated timeframe and gets into a restart loop.
Let me explain this a bit.

The containers engine is started as a system service and its configuration has a startup timeout (90 sec atm). The engine completes its startup procedure on boot once it loads all the previously started containers. So this operation basically takes O(N) time, where N is the number of containers. Loading a container also involves disk write operations (because the engine needs to serialize the new containers state).
For many devices, time spent per container is really small, and 90 seconds is more than enough. However, on devices like Raspberry Pi Zero, it may become a problem because the startup procedure takes longer, and the more container you have, the higher chance is you will hit the timeout.

balenaSense has 5 containers, and when the supervisor rebooted the device to finish a new device config application, the engine got into the restart loop because it was not able to complete the startup in the allocated 90 seconds.

We are actively discussing the setup in a related balenaOS github issue:


And this thread will be updated once we have a resolution.

For now, I updated the service config on your device manually to mitigate the problem. However, keep in mind that this change will not survive the OS update.

1 Like

Hi,

We’ve disabled Engine startup timeouts with v2.98.4+ and that version is available for host OS upgrade on select device types. For RPi Zero, this OS version will be out soon, so keep an eye out and let us know if it works to alleviate the issue described in this ticket.

Thanks,
Christina

1 Like