I’m trying to reply to your request to share remote access to my devices and unfortunately I can’t paste in the same thread as “new users are limited to 3 replies in the topic”.
Please find IDs for both Pi0 W devices demonstrating weird behavior I’m trying to fix when supervisor can’t start services. HostOS is accessible and status for both is Online (VPN Only. One Pi0 got into this mode after reboot via dashboard, another one after power loss.
d457ca0e7beb7f086753f3dbc6ed5eea - this one uses production HostOS build 2.48.0+rev1, supervisor 10.8.0
f963939a5b0b6a0522f45308b007f154- this one uses development HostOS build 2.48.0+rev1, supervisor 10.8.0
Hey, thanks for allowing support access. I can see in the logs which I pulled from the diagnostics page that the balenaEngine is being terminated for some reason as seen:
Jul 10 20:24:46 systemd[1]: balena.service: Found left-over process 16543 (balena-engine-r) in control group while starting unit. Ignoring.
Jul 10 20:24:46 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 10 20:24:46 systemd[1]: balena.service: Found left-over process 16585 (balena-engine-c) in control group while starting unit. Ignoring.
Jul 10 20:24:46 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
....
Jul 10 20:26:16 systemd[1]: balena.service: Start operation timed out. Terminating.
I’ve asked for someone on the OS team to take a look.
Could you confirm the following steps will produce the error so we can easily recreate:
flash latest OS available to RPi Zero (if latest works try balenaOS 2.48.0+rev1)
deploy balenaSense project
shutdown device with dashboard OR power (doesn’t matter)
Hi @Seva, is it fine for us to enable persistent logging on this device (6c86d890a6e037f2b0d18749529b3c41) and then reboot?
We would like to observe the logs of balenaEngine behaviour across reboots. We have seen engine crashes in the recent past caused by resource limitations and we suspect this device has the same problem.
On a related note: What’s the SD card you use?
An SD card with relatively slow write speed might also cause an issue.
Sure, feel free to enable persistent logging and then reboot the device as many times as you need.
Using SanDisk Ultra SD cards. Here is short description from Amazon: SanDisk 16GB Ultra MicroSDHC UHS-I Memory Card with Adapter - 98MB/s, C10, U1, Full HD, A1, Micro SD Card - SDSQUAR-016G-GN6MA,Red.
Unfortunately, you faced an issue when the containers engine does not start in the allocated timeframe and gets into a restart loop.
Let me explain this a bit.
The containers engine is started as a system service and its configuration has a startup timeout (90 sec atm). The engine completes its startup procedure on boot once it loads all the previously started containers. So this operation basically takes O(N) time, where N is the number of containers. Loading a container also involves disk write operations (because the engine needs to serialize the new containers state).
For many devices, time spent per container is really small, and 90 seconds is more than enough. However, on devices like Raspberry Pi Zero, it may become a problem because the startup procedure takes longer, and the more container you have, the higher chance is you will hit the timeout.
balenaSense has 5 containers, and when the supervisor rebooted the device to finish a new device config application, the engine got into the restart loop because it was not able to complete the startup in the allocated 90 seconds.
We are actively discussing the setup in a related balenaOS github issue:
And this thread will be updated once we have a resolution.
For now, I updated the service config on your device manually to mitigate the problem. However, keep in mind that this change will not survive the OS update.
We’ve disabled Engine startup timeouts with v2.98.4+ and that version is available for host OS upgrade on select device types. For RPi Zero, this OS version will be out soon, so keep an eye out and let us know if it works to alleviate the issue described in this ticket.