Balena supervisor restarting due to failing healthcheck

We are continually getting issues where the Balena Supervisor service is restarted due to a failing health check. It looks like the service is going above it’s memory limit. Here is the supervisor memory usage from a day when it restarted around 4am:.

We use the supervisor API to stream logs to a device service which sends them to our logging service. This is a critical function of our device to ensure we have a record of all logs. When the supervisor reboots we loose about 30 seconds of logging.

Do you have any suspicions of what might be the root cause of the supervisor going above it’s memory limit? Ideally we could find a way to reduce the failure rate of the health check.

One suspicion is that the supervisor memory usage might be tied to streaming the logs for all containers. What would be the advised approach for resolving? Do the Balena team recommend use of the supervisor API for logs, or should we use io.balena.features.journal-logs feature to directly access the logs from our container?

Hello @henry-mytos could you please confirm the balenaOS and supervisor versions that you use? What device type?

Thanks!

Hello @mpous, we are using:

Please let me know if you need any more details?

Thanks

Hey @henry-mytos , we have fixed a few memory leaks in the supervisor since that release and it would probably be worth updating to the latest supervisor to retest. Can you run that for a while and see if you can still reproduce?

The supervisor container doesn’t actually have a healthcheck that includes memory usage. How did you come to the conclusion that this was related to a failed healthcheck?

Definitely mounting the journal logs via io.balena.features.journal-logs is going to be less work for the supervisor, but either way I would not expect anything via the API to cause a crash. What local API commands are you running currently to get the logs, and how often are you calling that endpoint?

Hi @klutchell, thanks for your response.

I am happy to update a number of our internal development devices to the latest Balena Supervisor. We generally updated the Balena OS and supervisor some time after they are released to ensure bugs are caught.

The assumption that we are facing issues with the supervisor healthcheck is from the supervisor logs. Just prior to exiting with 143 the supervisor reports a healthcheck failure.

Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:10:26 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:11:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000099, skew: 0.100
Jan 1 15:12:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 200 - 3.119 ms
Jan 1 15:13:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000101, skew: 0.100
Jan 1 15:15:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000103, skew: 0.100
Jan 1 15:17:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000104, skew: 0.100
Jan 1 15:17:47 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 8m 34s
Jan 1 15:17:47 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:17:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.530 ms
Jan 1 15:19:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000106, skew: 0.100
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:19:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:21:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000108, skew: 0.100
Jan 1 15:22:48 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 13m 35s
Jan 1 15:22:48 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:22:48 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.316 ms
Jan 1 15:23:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000110, skew: 0.100
Jan 1 15:24:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:25:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000111, skew: 0.100
Jan 1 15:27:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000113, skew: 0.100

We are using the /v2/journal-logs endpoint with request data:

{
    follow: true,
    all: true,
    format: "json"
}

Which results in the following supervisor log:

Jan 1 15:28:30 square-meadow balena-supervisor debug [debug]   Spawning journalctl -a --follow -o json

I would be keen to understand root cause, but if that is not possible we could move to direct access of journal logs.

Ah yes, good catch! The supervisor has an in-process healthcheck that does indeed look for memory leaks. I was expecting a container engine healthcheck and looked in the wrong place.

I would definitely recommend updating to the latest supervisor on some of your test devices and see if the memory usage remains stable and avoids restarts. Let us know how it goes!

Hi @klutchell,

We are running:

  • balenaOS 6.5.24+rev5
  • supervisor 17.0.2
  • intel NUC
  • 8GB RAM

and we are encountering similar issues. Here a snapshot of our logging. Balena supervisor fails the healthchecks a few times, then restarts.

[balena_supe][INFO] 2025-05-19 22:54:37,742: [info]    Reported current state to the cloud
[balena-supe][INFO] 2025-05-19 22:54:37,743: [info]    Reported current state to the cloud
[balena_supe][INFO] 2025-05-19 22:55:42,978: [info]    Healthcheck failure - memory usage above threshold after 16h 45m 28s
[balena_supe][ERROR] 2025-05-19 22:55:42,982: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 22:55:42,982: [info]    Healthcheck failure - memory usage above threshold after 16h 45m 28s
[balena-supe][ERROR] 2025-05-19 22:55:42,982: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 22:55:42,982: [api]     GET /v1/healthy 500 - 1.224 ms
[balena_supe][INFO] 2025-05-19 22:55:42,983: [api]     GET /v1/healthy 500 - 1.224 ms
[healthdog][INFO] 2025-05-19 22:56:33,936: try: 1, refid: C17B26AC, correction: 0.000785253, skew: 0.333
[healthdog][INFO] 2025-05-19 22:58:33,947: try: 1, refid: C17B26AC, correction: 0.000783230, skew: 0.333
[balena_supe][INFO] 2025-05-19 22:59:38,002: [info]    Reported current state to the cloud
[balena-supe][INFO] 2025-05-19 22:59:38,003: [info]    Reported current state to the cloud
[fake-hwcloc][INFO] 2025-05-19 23:00:04,043: [fake-hwclock] Saving system time to /etc/fake-hwclock/fake-hwclock.data.
[fake-hwcloc][INFO] 2025-05-19 23:00:04,049: Saving system time to /etc/fake-hwclock/fake-hwclock.data.
[healthdog][INFO] 2025-05-19 23:00:33,965: try: 1, refid: C17B26AC, correction: 0.000781208, skew: 0.333
[balena_supe][INFO] 2025-05-19 23:00:43,124: [info]    Healthcheck failure - memory usage above threshold after 16h 50m 28s
[balena-supe][INFO] 2025-05-19 23:00:43,128: [info]    Healthcheck failure - memory usage above threshold after 16h 50m 28s
[balena-supe][ERROR] 2025-05-19 23:00:43,128: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 23:00:43,128: [api]     GET /v1/healthy 500 - 0.964 ms
[balena_supe][ERROR] 2025-05-19 23:00:43,129: [error]   Healthcheck failed
[balena_supe][INFO] 2025-05-19 23:00:43,129: [api]     GET /v1/healthy 500 - 0.964 ms
[healthdog][INFO] 2025-05-19 23:02:33,978: try: 1, refid: C17B26AC, correction: 0.000779185, skew: 0.333
[healthdog][INFO] 2025-05-19 23:04:33,992: try: 1, refid: C17B26AC, correction: 0.000777163, skew: 0.333
[balena_supe][INFO] 2025-05-19 23:04:48,324: [info]    Reported current state to the cloud
[balena-supe][INFO] 2025-05-19 23:04:48,325: [info]    Reported current state to the cloud
[balena_supe][INFO] 2025-05-19 23:05:43,256: [info]    Healthcheck failure - memory usage above threshold after 16h 55m 28s
[balena-supe][INFO] 2025-05-19 23:05:43,261: [info]    Healthcheck failure - memory usage above threshold after 16h 55m 28s
[balena-supe][ERROR] 2025-05-19 23:05:43,261: [error]   Healthcheck failed
[balena-supe][INFO] 2025-05-19 23:05:43,261: [api]     GET /v1/healthy 500 - 1.320 ms
[balena_supe][ERROR] 2025-05-19 23:05:43,261: [error]   Healthcheck failed
[balena_supe][INFO] 2025-05-19 23:05:43,261: [api]     GET /v1/healthy 500 - 1.320 ms
[balenad][INFO] 2025-05-19 23:05:43,266: time="2025-05-19T23:05:43.266361049Z" level=info msg="Unhealthy container 8ccecaa06b30a36b0e5bd2129984f5c9cf4891eeed63739b66dea71cc08cb2c9: restarting..."
[balena_supe][INFO] 2025-05-19 23:05:43,444: [info]    Received SIGTERM. Exiting.
[balena-supe][INFO] 2025-05-19 23:05:43,444: [info]    Received SIGTERM. Exiting.

What would you recommend?
Thanks!

Hello @Jeanine first of all welcome to the balena community.

Did you perform any action to the device before experiencing this issue?

Is this the only device in your fleet with this issue? or you have more?

Thanks!

Hey,

We updated 2 devices in our testfleet recently from balenaOS 2.95.12+rev1/supervisor 14.12.1 to this version. So this is first time/first device this occurs just after upgrading. On the same device, we do see the health check failures but no restart of the supervisor (yet). We are keeping an I on it though, because we will not upgrade in production with this risk.
Nothing notable happens before the crash. Last action in logs is a switch of wifi access point, 3 hours earlier.
Is there something extra I can provide you to help solving this?

** correction: Both updated devices show this behavior. One of the devices restarted the supervisor 4 times in 24 hours

Hi,

We are experiencing similar issue across several production units.
It’s kind of trickier to diagnose or get information, since the units are remote and have connectivity issues that might acelerate this issue specifically.

The unit ends up freezing and seems like only a reboot brings it back.
Some log sampling from some units:


UNIT1 — Supervisor v17.0.2 — Memory threshold at 115h

2026-02-11 10:15:02 UTC a62888e114fc: INFO    Request for measurement 881cc142… succeeded with value: 611  service=modbusrtu
2026-02-11 10:15:02 UTC a62888e114fc: INFO    Request for measurement b3970b73… succeeded with value: 374803  service=modbusrtu
2026-02-11 10:15:02 UTC a62888e114fc: INFO    Request for measurement b22f96ae… succeeded with value: 3588  service=modbusrtu
2026-02-11 10:15:02 UTC a62888e114fc: INFO    Request for measurement ef375482… succeeded with value: 990  service=modbusrtu
2026-02-11 10:15:46 UTC a62888e114fc: INFO    27 acquisitions was successfully sent to harvest  service=harvester
2026-02-11 10:15:46 UTC a62888e114fc: INFO    No acquisitions found in the database  service=harvester
2026-02-11 10:16:00 UTC a62888e114fc: INFO    Waiting for 1m0s seconds until next batch of metering  service=metering
2026-02-11 10:16:08 UTC balena-supervisor[3032]: [info]    Healthcheck failure - memory usage above threshold after 115h 8m 47s
2026-02-11 10:16:08 UTC balena-supervisor[3032]: [error]   Healthcheck failed
2026-02-11 10:16:08 UTC balena-supervisor[3032]: [api]     GET /v1/healthy 500 - 19.890 ms
2026-02-11 10:16:38 UTC healthdog[23818]: try: 1, refid: A29FC801, correction: 0.000044431, skew: 0.023
2026-02-11 10:16:46 UTC a62888e114fc: INFO    No acquisitions found in the database  service=harvester
2026-02-11 10:17:00 UTC a62888e114fc: INFO    Waiting for 1m0s seconds until next batch of metering  service=metering
Note: healthcheck returned 200 on next check (10:21), but the unit became completely unresponsive ~4 hours later with no further logs.




UNIT2 — Supervisor v16.4.6 — Memory threshold at 15h (with restart)

2026-03-31 04:15:40 UTC 39616d17dc16: INFO    No acquisitions found in the database  service=harvester
2026-03-31 04:15:40 UTC 39616d17dc16: DEBUG   Sleeping for 60 seconds  service=harvester
2026-03-31 04:15:47 UTC balena-supervisor[29489]: [info]    Reported current state to the cloud
2026-03-31 04:16:00 UTC 39616d17dc16: INFO    All acquisitions done  service=currents
2026-03-31 04:16:00 UTC 39616d17dc16: INFO    Waiting for 1m0s seconds until next batch of metering  service=metering
2026-03-31 04:16:10 UTC balena-supervisor[29489]: [info]    Healthcheck failure - memory usage above threshold after 15h 6m 45s
2026-03-31 04:16:10 UTC balena-supervisor[29489]: [error]   Healthcheck failed
2026-03-31 04:16:10 UTC balena-supervisor[29489]: [api]     GET /v1/healthy 500 - 10.399 ms
2026-03-31 04:16:10 UTC balenad[3747]: “Unhealthy container 7b9e0590908d…: restarting…”
2026-03-31 04:16:10 UTC balena-supervisor[29489]: [info]    Received SIGTERM. Exiting.
2026-03-31 04:16:11 UTC balenad[3766]: “shim disconnected” id=7b9e0590908d…
Note: balenad detects the unhealthy container and kills+restarts the supervisor. This happens every ~15 hours on this unit.
||
UNIT2 — Supervisor v16.4.6 — Memory threshold at 10h (during connectivity outage)

2026-03-17 00:11:16 UTC 39616d17dc16: ERROR   error sending to harvest: Post “https://harvest:***@harvest.wisemetering.com/data”: context deadline exceeded (Client.Timeout exceeded
while awaiting headers)  service=harvester
2026-03-17 00:11:16 UTC 39616d17dc16: ERROR   Failed to POST data to harvest  service=harvester
2026-03-17 00:11:41 UTC balena-supervisor[29620]: [info]    Healthcheck failure - memory usage above threshold after 10h 31m 13s
2026-03-17 00:11:41 UTC balena-supervisor[29620]: [error]   Healthcheck failed
2026-03-17 00:11:41 UTC balena-supervisor[29620]: [api]     GET /v1/healthy 500 - 12.283 ms
Note: this occurrence happened during an active connectivity outage (harvest POST timing out). The leak accelerated — threshold hit at 10h vs the usual 15h when connectivity is
healthy.




UNIT 3 — Supervisor v17.0.2 — Memory threshold at 141h, escalating

2026-02-24 09:26:55 UTC balena-supervisor[3014]: [info]    Healthcheck failure - memory usage above threshold after 141h 18m 6s
2026-02-25 03:54:55 UTC balena-supervisor[3014]: [info]    Healthcheck failure - memory usage above threshold after 159h 46m 7s
2026-02-25 12:11:17 UTC balena-supervisor[3014]: [info]    Healthcheck failure - memory usage above threshold after 168h 2m 28s
2026-02-25 12:51:23 UTC balena-supervisor[3014]: [info]    Healthcheck failure - memory usage above threshold after 168h 42m 35s
2026-02-25 13:06:26 UTC balena-supervisor[3014]: [info]    Healthcheck failure - memory usage above threshold after 168h 57m 37s
Note: same supervisor PID (3014) across all entries — memory keeps growing, not recovering between healthchecks.