Continuous Container Cycling

We’ve been tracking down random, sporadic NUC hanging for quite awhile.
There seems to be a memory-related issue. We’ve put tight memory management on our service containers. Finally we installed fluent-bit to help us get more info out of the system.

Now I am seeing in the fluent-bit logs that balena_healthcheck and balena_supervisor go through start-shutdown cycles continuously. Is this normal?

Device: Intel NUC
BalenaOS: 2.98.33
Supervisor: 14.0.13

Attached, please find recent

  1. healthcheck
  2. diagnostics
  3. fluent-bit log

traceFluentBitLog.pdf (36.9 KB)
23217009bdacc06c967b41cf0ab1c97e_diagnostics_2022.10.24_15.29.59+0000.pdf (555.2 KB)
23217009bdacc06c967b41cf0ab1c97e_checks_2022.10.24_18.15.44+0000.pdf (16.9 KB)
:

Update: the start-shutdown cycles for
balena_healthcheck = 3 minutes (create, attach, start, die, destroy)
balena_supervisor = 5 minutes (exec_create, exec_die)

Hello @SMoore sorry to hear that you have this performance!

Some questions:

  • Is this the only device from your fleet with this issue?
  • Did you perform anything to the device before getting into this state?

Could you please grant support access to this device and share on DM the ID of your device?

Thanks

Hi, Marc,
Thanks for getting back to me.

I sent you a DM with access info.
Sandy

Thanks! let us check your device!

Hello Sandy, i see some weird LogBackend errors. I pinged internally the supervisor team to see how we can help you more.
On the other hand, is the device performing well for you now?

Hi there, I also took a look at your device, and at the diagnostics logs that you shared. From both the logs today, and the diagnostics, I don’t see any supervisor container restarts. If you were to take measurements again today, I don’t think that we will see all of these start->shutdown cycles.

As to why they were happening when these fluent-bit logs were initially taken, I’m not sure - something must have been happening on the device at the time. I find the logs a little hard to parse - could you tell us roughly how frequently/the time between these cycles is?

1 Like

Hi, Marc,
Thanks for checking into this. At this moment, the device is behaving well. My expectation is that in a week or two I may see a system crash.

I’m curious about the LogBackend errors. Can you tell me more about this?
Sandy

1 Like

Hi, Ryan,
Thanks for updating me. Here’s some info to answer your questions and clarification:

  1. These cycles happen continuously. They are still happening. I’ve attached a parsed log from this morning.
  2. If you look in this morning’s log I sent you, column L (name), you will see “balena_supervisor” highlighted in yellow
  3. balena_supervisor container id stays the same across multiple cycles
  4. balena-healthcheck container id changes with every cycle.
  5. The timing for the cycles has been:
    balena-healthcheck: every 3 min
    balena_supervisor: every 5 min

If this is part of the normal function for balenaOS, let me know.

Thanks,
Sandy

Balena Trouble Ticket - Fluent-bit Data - Google Sheets.pdf (78.4 KB)

1 Like