Supervisor restarts and reverts services after seemingly unreasonable healthcheck failure

I’ve noticed this multiple times on 2 completely different devices. Both devices were in local mode. Everything seems to be working fine. No excessive memory usage and suddenly supervisor restarts and reverts the services to some older version.
The supervisor logs have healthcheck failures just before the restart:

2024-09-25T08:50:10.969233000Z [info]    Healthcheck failure - memory usage above threshold after 9h 35m 37s
2024-09-25T08:50:10.978851000Z [error]   Healthcheck failed
2024-09-25T08:50:10.984553000Z [api]     GET /v1/healthy 500 - 13.699 ms
2024-09-25T08:54:24.021085000Z [info]    Reported current state to the cloud
2024-09-25T08:55:11.276431000Z [info]    Healthcheck failure - memory usage above threshold after 9h 40m 37s
2024-09-25T08:55:11.290136000Z [error]   Healthcheck failed
2024-09-25T08:55:11.293805000Z [api]     GET /v1/healthy 500 - 16.594 ms
2024-09-25T08:59:25.590129000Z [info]    Reported current state to the cloud
2024-09-25T09:00:11.626193000Z [info]    Healthcheck failure - memory usage above threshold after 9h 45m 37s
2024-09-25T09:00:11.631596000Z [error]   Healthcheck failed
2024-09-25T09:00:11.635314000Z [api]     GET /v1/healthy 500 - 7.959 ms
2024-09-25T09:00:11.906955000Z [info]    Received SIGTERM. Exiting.

However, in this last instance I was monitoring the memory consumption by our services and it was minimal. The dashboard shows ~1.5GB/7.7GB for the whole device.

One of the devices has these parameters:

  • Supervisor version: 16.7.0
  • Host OS version: balenaOS 5.3.25+rev6
  • Type: Generic x86_64 (legacy MBR)

The other one:

  • Supervisor version: 16.7.0
  • Host OS version: balenaOS 5.3.27+rev1
  • Type: Raspberry Pi 4 (using 64bit OS)

What is this ‘threshold’ that the logs mention? Can I change it?
Any suggestions on how I can troubleshoot this issue?

1 Like

Hello @igork first of all welcome to the balena community!

The Healthcheck failure messages are warnings. The supervisor won’t be restarted by the engine until 3 consecutive healthchecks failures are reported within an interval.

Thank you for letting us know about the logs, we understand how that can be disruptive. The logs are related to how the supervisor backfills container logs when starting, but the backfilling can sometime overlap with the previously reported logs.

We are researching this issue and we will keep you posted!

After speaking with the Supervisor team, the threshold still exist in the version 16.7.0. It is a threshold for the supervisor memory usage, which triggers the healthcheck if the memory increasing over 15mb from the memory usage at start.

Unfortunately since you are using local mode you are hitting this issue @igork Supervisor reverts to the first locally pushed version when restarted in local mode · Issue #2367 · balena-os/balena-supervisor · GitHub

We are working on this balena-supervisor/src/memory.ts at 6455dbd0a181e24cd9f35e447054a5feb34b80c9 · balena-os/balena-supervisor · GitHub

Hi Marc, thank you for the reply. Looks like the threshold issue is reproducible only in local mode (at least from my experience).
Do you know if there’s a specific version of the supervisor where this issue was introduced, so I can maybe use an older version for the time being?

1 Like

@igork we are working on this, but remember that this only happens in Local Mode!

We will keep you posted!