I’ve noticed this multiple times on 2 completely different devices. Both devices were in local mode. Everything seems to be working fine. No excessive memory usage and suddenly supervisor restarts and reverts the services to some older version.
The supervisor logs have healthcheck failures just before the restart:
2024-09-25T08:50:10.969233000Z [info] Healthcheck failure - memory usage above threshold after 9h 35m 37s
2024-09-25T08:50:10.978851000Z [error] Healthcheck failed
2024-09-25T08:50:10.984553000Z [api] GET /v1/healthy 500 - 13.699 ms
2024-09-25T08:54:24.021085000Z [info] Reported current state to the cloud
2024-09-25T08:55:11.276431000Z [info] Healthcheck failure - memory usage above threshold after 9h 40m 37s
2024-09-25T08:55:11.290136000Z [error] Healthcheck failed
2024-09-25T08:55:11.293805000Z [api] GET /v1/healthy 500 - 16.594 ms
2024-09-25T08:59:25.590129000Z [info] Reported current state to the cloud
2024-09-25T09:00:11.626193000Z [info] Healthcheck failure - memory usage above threshold after 9h 45m 37s
2024-09-25T09:00:11.631596000Z [error] Healthcheck failed
2024-09-25T09:00:11.635314000Z [api] GET /v1/healthy 500 - 7.959 ms
2024-09-25T09:00:11.906955000Z [info] Received SIGTERM. Exiting.
However, in this last instance I was monitoring the memory consumption by our services and it was minimal. The dashboard shows ~1.5GB/7.7GB for the whole device.
One of the devices has these parameters:
- Supervisor version: 16.7.0
- Host OS version: balenaOS 5.3.25+rev6
- Type: Generic x86_64 (legacy MBR)
The other one:
- Supervisor version: 16.7.0
- Host OS version: balenaOS 5.3.27+rev1
- Type: Raspberry Pi 4 (using 64bit OS)
What is this ‘threshold’ that the logs mention? Can I change it?
Any suggestions on how I can troubleshoot this issue?