Continuous Container Cycling

SMoore · October 24, 2022, 7:08pm

We’ve been tracking down random, sporadic NUC hanging for quite awhile.
There seems to be a memory-related issue. We’ve put tight memory management on our service containers. Finally we installed fluent-bit to help us get more info out of the system.

Now I am seeing in the fluent-bit logs that balena_healthcheck and balena_supervisor go through start-shutdown cycles continuously. Is this normal?

Device: Intel NUC
BalenaOS: 2.98.33
Supervisor: 14.0.13

Attached, please find recent

healthcheck
diagnostics
fluent-bit log

traceFluentBitLog.pdf (36.9 KB)
23217009bdacc06c967b41cf0ab1c97e_diagnostics_2022.10.24_15.29.59+0000.pdf (555.2 KB)
23217009bdacc06c967b41cf0ab1c97e_checks_2022.10.24_18.15.44+0000.pdf (16.9 KB)
:

SMoore · October 31, 2022, 4:05pm

Update: the start-shutdown cycles for
balena_healthcheck = 3 minutes (create, attach, start, die, destroy)
balena_supervisor = 5 minutes (exec_create, exec_die)

mpous · November 1, 2022, 11:05am

Hello @SMoore sorry to hear that you have this performance!

Some questions:

Is this the only device from your fleet with this issue?
Did you perform anything to the device before getting into this state?

Could you please grant support access to this device and share on DM the ID of your device?

Thanks

SMoore · November 1, 2022, 11:39pm

Hi, Marc,
Thanks for getting back to me.

I sent you a DM with access info.
Sandy

mpous · November 2, 2022, 9:07am

Thanks! let us check your device!

mpous · November 2, 2022, 9:19am

Hello Sandy, i see some weird LogBackend errors. I pinged internally the supervisor team to see how we can help you more.
On the other hand, is the device performing well for you now?

rcooke-warwick · November 2, 2022, 12:30pm

Hi there, I also took a look at your device, and at the diagnostics logs that you shared. From both the logs today, and the diagnostics, I don’t see any supervisor container restarts. If you were to take measurements again today, I don’t think that we will see all of these start->shutdown cycles.

As to why they were happening when these fluent-bit logs were initially taken, I’m not sure - something must have been happening on the device at the time. I find the logs a little hard to parse - could you tell us roughly how frequently/the time between these cycles is?

SMoore · November 2, 2022, 3:06pm

Hi, Marc,
Thanks for checking into this. At this moment, the device is behaving well. My expectation is that in a week or two I may see a system crash.

I’m curious about the LogBackend errors. Can you tell me more about this?
Sandy

SMoore · November 2, 2022, 6:56pm

Hi, Ryan,
Thanks for updating me. Here’s some info to answer your questions and clarification:

These cycles happen continuously. They are still happening. I’ve attached a parsed log from this morning.
If you look in this morning’s log I sent you, column L (name), you will see “balena_supervisor” highlighted in yellow
balena_supervisor container id stays the same across multiple cycles
balena-healthcheck container id changes with every cycle.
The timing for the cycles has been:
balena-healthcheck: every 3 min
balena_supervisor: every 5 min

If this is part of the normal function for balenaOS, let me know.

Thanks,
Sandy

Balena Trouble Ticket - Fluent-bit Data - Google Sheets.pdf (78.4 KB)

Topic		Replies	Views
Multiple Health Check Errors / Root Cause of Container Failure on NUC? Product support	14	476	December 28, 2021
Balena supervisor restarting due to failing healthcheck Product support support	9	195	April 8, 2026
Duplicate entries in `balena logs` after boot. Product support	25	1507	November 3, 2020
balenaFin stuck balenaFin	26	883	January 20, 2020
Supervisor continuously restarts - major outage of Balena balenaOS	17	204	July 24, 2024

Continuous Container Cycling

Related topics