Supervisor restarts and reverts services after seemingly unreasonable healthcheck failure

igork · September 25, 2024, 9:24am

I’ve noticed this multiple times on 2 completely different devices. Both devices were in local mode. Everything seems to be working fine. No excessive memory usage and suddenly supervisor restarts and reverts the services to some older version.
The supervisor logs have healthcheck failures just before the restart:

2024-09-25T08:50:10.969233000Z [info]    Healthcheck failure - memory usage above threshold after 9h 35m 37s
2024-09-25T08:50:10.978851000Z [error]   Healthcheck failed
2024-09-25T08:50:10.984553000Z [api]     GET /v1/healthy 500 - 13.699 ms
2024-09-25T08:54:24.021085000Z [info]    Reported current state to the cloud
2024-09-25T08:55:11.276431000Z [info]    Healthcheck failure - memory usage above threshold after 9h 40m 37s
2024-09-25T08:55:11.290136000Z [error]   Healthcheck failed
2024-09-25T08:55:11.293805000Z [api]     GET /v1/healthy 500 - 16.594 ms
2024-09-25T08:59:25.590129000Z [info]    Reported current state to the cloud
2024-09-25T09:00:11.626193000Z [info]    Healthcheck failure - memory usage above threshold after 9h 45m 37s
2024-09-25T09:00:11.631596000Z [error]   Healthcheck failed
2024-09-25T09:00:11.635314000Z [api]     GET /v1/healthy 500 - 7.959 ms
2024-09-25T09:00:11.906955000Z [info]    Received SIGTERM. Exiting.

However, in this last instance I was monitoring the memory consumption by our services and it was minimal. The dashboard shows ~1.5GB/7.7GB for the whole device.

One of the devices has these parameters:

Supervisor version: 16.7.0
Host OS version: balenaOS 5.3.25+rev6
Type: Generic x86_64 (legacy MBR)

The other one:

Supervisor version: 16.7.0
Host OS version: balenaOS 5.3.27+rev1
Type: Raspberry Pi 4 (using 64bit OS)

What is this ‘threshold’ that the logs mention? Can I change it?
Any suggestions on how I can troubleshoot this issue?

mpous · September 25, 2024, 10:09am

Hello @igork first of all welcome to the balena community!

The Healthcheck failure messages are warnings. The supervisor won’t be restarted by the engine until 3 consecutive healthchecks failures are reported within an interval.

Thank you for letting us know about the logs, we understand how that can be disruptive. The logs are related to how the supervisor backfills container logs when starting, but the backfilling can sometime overlap with the previously reported logs.

We are researching this issue and we will keep you posted!

mpous · September 25, 2024, 2:27pm

After speaking with the Supervisor team, the threshold still exist in the version 16.7.0. It is a threshold for the supervisor memory usage, which triggers the healthcheck if the memory increasing over 15mb from the memory usage at start.

Unfortunately since you are using local mode you are hitting this issue @igork Supervisor reverts to the first locally pushed version when restarted in local mode · Issue #2367 · balena-os/balena-supervisor · GitHub

We are working on this balena-supervisor/src/memory.ts at 6455dbd0a181e24cd9f35e447054a5feb34b80c9 · balena-os/balena-supervisor · GitHub

igork · September 26, 2024, 6:23am

Hi Marc, thank you for the reply. Looks like the threshold issue is reproducible only in local mode (at least from my experience).
Do you know if there’s a specific version of the supervisor where this issue was introduced, so I can maybe use an older version for the time being?

mpous · September 26, 2024, 10:13am

@igork we are working on this, but remember that this only happens in Local Mode!

We will keep you posted!

Topic		Replies	Views
Balena supervisor restarting due to failing healthcheck Product support support	8	80	May 21, 2025
Supervisor stuck in restart loop Product support support	2	13	June 28, 2025
Roolback version if update failed balenaOS	2	313	November 14, 2019
Too much logging might have `jammed` supervisor Product support	9	308	October 23, 2019
Unexpected supervisor crash/container restart on network change balenaOS	74	4628	May 6, 2021

Supervisor restarts and reverts services after seemingly unreasonable healthcheck failure

Related topics