Balena supervisor restarting due to failing healthcheck

We are continually getting issues where the Balena Supervisor service is restarted due to a failing health check. It looks like the service is going above it’s memory limit. Here is the supervisor memory usage from a day when it restarted around 4am:.

We use the supervisor API to stream logs to a device service which sends them to our logging service. This is a critical function of our device to ensure we have a record of all logs. When the supervisor reboots we loose about 30 seconds of logging.

Do you have any suspicions of what might be the root cause of the supervisor going above it’s memory limit? Ideally we could find a way to reduce the failure rate of the health check.

One suspicion is that the supervisor memory usage might be tied to streaming the logs for all containers. What would be the advised approach for resolving? Do the Balena team recommend use of the supervisor API for logs, or should we use io.balena.features.journal-logs feature to directly access the logs from our container?

Hello @henry-mytos could you please confirm the balenaOS and supervisor versions that you use? What device type?

Thanks!

Hello @mpous, we are using:

Please let me know if you need any more details?

Thanks

Hey @henry-mytos , we have fixed a few memory leaks in the supervisor since that release and it would probably be worth updating to the latest supervisor to retest. Can you run that for a while and see if you can still reproduce?

The supervisor container doesn’t actually have a healthcheck that includes memory usage. How did you come to the conclusion that this was related to a failed healthcheck?

Definitely mounting the journal logs via io.balena.features.journal-logs is going to be less work for the supervisor, but either way I would not expect anything via the API to cause a crash. What local API commands are you running currently to get the logs, and how often are you calling that endpoint?

1 Like

Hi @klutchell, thanks for your response.

I am happy to update a number of our internal development devices to the latest Balena Supervisor. We generally updated the Balena OS and supervisor some time after they are released to ensure bugs are caught.

The assumption that we are facing issues with the supervisor healthcheck is from the supervisor logs. Just prior to exiting with 143 the supervisor reports a healthcheck failure.

Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:10:26 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:11:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000099, skew: 0.100
Jan 1 15:12:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 200 - 3.119 ms
Jan 1 15:13:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000101, skew: 0.100
Jan 1 15:15:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000103, skew: 0.100
Jan 1 15:17:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000104, skew: 0.100
Jan 1 15:17:47 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 8m 34s
Jan 1 15:17:47 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:17:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.530 ms
Jan 1 15:19:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000106, skew: 0.100
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:19:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:21:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000108, skew: 0.100
Jan 1 15:22:48 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 13m 35s
Jan 1 15:22:48 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:22:48 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.316 ms
Jan 1 15:23:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000110, skew: 0.100
Jan 1 15:24:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:25:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000111, skew: 0.100
Jan 1 15:27:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000113, skew: 0.100

We are using the /v2/journal-logs endpoint with request data:

{
    follow: true,
    all: true,
    format: "json"
}

Which results in the following supervisor log:

Jan 1 15:28:30 square-meadow balena-supervisor debug [debug]   Spawning journalctl -a --follow -o json

I would be keen to understand root cause, but if that is not possible we could move to direct access of journal logs.

Ah yes, good catch! The supervisor has an in-process healthcheck that does indeed look for memory leaks. I was expecting a container engine healthcheck and looked in the wrong place.

I would definitely recommend updating to the latest supervisor on some of your test devices and see if the memory usage remains stable and avoids restarts. Let us know how it goes!

Hi @klutchell,

We are running:

  • balenaOS 6.5.24+rev5
  • supervisor 17.0.2
  • intel NUC
  • 8GB RAM

and we are encountering similar issues. Here a snapshot of our logging. Balena supervisor fails the healthchecks a few times, then restarts.

[balena_supe][INFO] 2025-05-19 22:54:37,742: [info]    Reported current state to the cloud
[balena-supe][INFO] 2025-05-19 22:54:37,743: [info]    Reported current state to the cloud
[balena_supe][INFO] 2025-05-19 22:55:42,978: [info]    Healthcheck failure - memory usage above threshold after 16h 45m 28s
[balena_supe][ERROR] 2025-05-19 22:55:42,982: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 22:55:42,982: [info]    Healthcheck failure - memory usage above threshold after 16h 45m 28s
[balena-supe][ERROR] 2025-05-19 22:55:42,982: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 22:55:42,982: [api]     GET /v1/healthy 500 - 1.224 ms
[balena_supe][INFO] 2025-05-19 22:55:42,983: [api]     GET /v1/healthy 500 - 1.224 ms
[healthdog][INFO] 2025-05-19 22:56:33,936: try: 1, refid: C17B26AC, correction: 0.000785253, skew: 0.333
[healthdog][INFO] 2025-05-19 22:58:33,947: try: 1, refid: C17B26AC, correction: 0.000783230, skew: 0.333
[balena_supe][INFO] 2025-05-19 22:59:38,002: [info]    Reported current state to the cloud
[balena-supe][INFO] 2025-05-19 22:59:38,003: [info]    Reported current state to the cloud
[fake-hwcloc][INFO] 2025-05-19 23:00:04,043: [fake-hwclock] Saving system time to /etc/fake-hwclock/fake-hwclock.data.
[fake-hwcloc][INFO] 2025-05-19 23:00:04,049: Saving system time to /etc/fake-hwclock/fake-hwclock.data.
[healthdog][INFO] 2025-05-19 23:00:33,965: try: 1, refid: C17B26AC, correction: 0.000781208, skew: 0.333
[balena_supe][INFO] 2025-05-19 23:00:43,124: [info]    Healthcheck failure - memory usage above threshold after 16h 50m 28s
[balena-supe][INFO] 2025-05-19 23:00:43,128: [info]    Healthcheck failure - memory usage above threshold after 16h 50m 28s
[balena-supe][ERROR] 2025-05-19 23:00:43,128: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 23:00:43,128: [api]     GET /v1/healthy 500 - 0.964 ms
[balena_supe][ERROR] 2025-05-19 23:00:43,129: [error]   Healthcheck failed
[balena_supe][INFO] 2025-05-19 23:00:43,129: [api]     GET /v1/healthy 500 - 0.964 ms
[healthdog][INFO] 2025-05-19 23:02:33,978: try: 1, refid: C17B26AC, correction: 0.000779185, skew: 0.333
[healthdog][INFO] 2025-05-19 23:04:33,992: try: 1, refid: C17B26AC, correction: 0.000777163, skew: 0.333
[balena_supe][INFO] 2025-05-19 23:04:48,324: [info]    Reported current state to the cloud
[balena-supe][INFO] 2025-05-19 23:04:48,325: [info]    Reported current state to the cloud
[balena_supe][INFO] 2025-05-19 23:05:43,256: [info]    Healthcheck failure - memory usage above threshold after 16h 55m 28s
[balena-supe][INFO] 2025-05-19 23:05:43,261: [info]    Healthcheck failure - memory usage above threshold after 16h 55m 28s
[balena-supe][ERROR] 2025-05-19 23:05:43,261: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 23:05:43,261: [api]     GET /v1/healthy 500 - 1.320 ms
[balena_supe][ERROR] 2025-05-19 23:05:43,261: [error]   Healthcheck failed
[balena_supe][INFO] 2025-05-19 23:05:43,261: [api]     GET /v1/healthy 500 - 1.320 ms
[balenad][INFO] 2025-05-19 23:05:43,266: time="2025-05-19T23:05:43.266361049Z" level=info msg="Unhealthy container 8ccecaa06b30a36b0e5bd2129984f5c9cf4891eeed63739b66dea71cc08cb2c9: restarting..."
[balena_supe][INFO] 2025-05-19 23:05:43,444: [info]    Received SIGTERM. Exiting.
[balena-supe][INFO] 2025-05-19 23:05:43,444: [info]    Received SIGTERM. Exiting.

What would you recommend?
Thanks!

Hello @Jeanine first of all welcome to the balena community.

Did you perform any action to the device before experiencing this issue?

Is this the only device in your fleet with this issue? or you have more?

Thanks!

Hey,

We updated 2 devices in our testfleet recently from balenaOS 2.95.12+rev1/supervisor 14.12.1 to this version. So this is first time/first device this occurs just after upgrading. On the same device, we do see the health check failures but no restart of the supervisor (yet). We are keeping an I on it though, because we will not upgrade in production with this risk.
Nothing notable happens before the crash. Last action in logs is a switch of wifi access point, 3 hours earlier.
Is there something extra I can provide you to help solving this?

** correction: Both updated devices show this behavior. One of the devices restarted the supervisor 4 times in 24 hours