Balena supervisor restarting due to failing healthcheck

henry-mytos · December 12, 2024, 3:09pm

We are continually getting issues where the Balena Supervisor service is restarted due to a failing health check. It looks like the service is going above it’s memory limit. Here is the supervisor memory usage from a day when it restarted around 4am:.

We use the supervisor API to stream logs to a device service which sends them to our logging service. This is a critical function of our device to ensure we have a record of all logs. When the supervisor reboots we loose about 30 seconds of logging.

Do you have any suspicions of what might be the root cause of the supervisor going above it’s memory limit? Ideally we could find a way to reduce the failure rate of the health check.

One suspicion is that the supervisor memory usage might be tied to streaming the logs for all containers. What would be the advised approach for resolving? Do the Balena team recommend use of the supervisor API for logs, or should we use io.balena.features.journal-logs feature to directly access the logs from our container?

mpous · December 12, 2024, 5:00pm

Hello @henry-mytos could you please confirm the balenaOS and supervisor versions that you use? What device type?

Thanks!

henry-mytos · December 13, 2024, 4:23pm

Hello @mpous, we are using:

balenaOS 6.0.7+rev1
supervisor 16.4.4
running on a Raspberry Pi 4 (using 64bit OS)
with 8GB RAM.

Please let me know if you need any more details?

Thanks

klutchell · January 2, 2025, 8:06pm

Hey @henry-mytos , we have fixed a few memory leaks in the supervisor since that release and it would probably be worth updating to the latest supervisor to retest. Can you run that for a while and see if you can still reproduce?

The supervisor container doesn’t actually have a healthcheck that includes memory usage. How did you come to the conclusion that this was related to a failed healthcheck?

Definitely mounting the journal logs via io.balena.features.journal-logs is going to be less work for the supervisor, but either way I would not expect anything via the API to cause a crash. What local API commands are you running currently to get the logs, and how often are you calling that endpoint?

henry-mytos · January 3, 2025, 10:31am

Hi @klutchell, thanks for your response.

I am happy to update a number of our internal development devices to the latest Balena Supervisor. We generally updated the Balena OS and supervisor some time after they are released to ensure bugs are caught.

The assumption that we are facing issues with the supervisor healthcheck is from the supervisor logs. Just prior to exiting with 143 the supervisor reports a healthcheck failure.

Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:10:26 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:11:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000099, skew: 0.100
Jan 1 15:12:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 200 - 3.119 ms
Jan 1 15:13:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000101, skew: 0.100
Jan 1 15:15:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000103, skew: 0.100
Jan 1 15:17:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000104, skew: 0.100
Jan 1 15:17:47 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 8m 34s
Jan 1 15:17:47 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:17:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.530 ms
Jan 1 15:19:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000106, skew: 0.100
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:19:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:21:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000108, skew: 0.100
Jan 1 15:22:48 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 13m 35s
Jan 1 15:22:48 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:22:48 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.316 ms
Jan 1 15:23:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000110, skew: 0.100
Jan 1 15:24:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:25:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000111, skew: 0.100
Jan 1 15:27:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000113, skew: 0.100

We are using the /v2/journal-logs endpoint with request data:

{
    follow: true,
    all: true,
    format: "json"
}

Which results in the following supervisor log:

Jan 1 15:28:30 square-meadow balena-supervisor debug [debug]   Spawning journalctl -a --follow -o json

I would be keen to understand root cause, but if that is not possible we could move to direct access of journal logs.

klutchell · January 3, 2025, 2:03pm

Ah yes, good catch! The supervisor has an in-process healthcheck that does indeed look for memory leaks. I was expecting a container engine healthcheck and looked in the wrong place.

I would definitely recommend updating to the latest supervisor on some of your test devices and see if the memory usage remains stable and avoids restarts. Let us know how it goes!

Topic		Replies	Views
If used a lot of RAM the supervisor crashes and does not recover balenaOS	6	558	August 27, 2019
Restart logs for services balenaOS raspberrypi4	6	161	March 7, 2024
Balena supervisor has insane memory leak Product support	2	359	August 12, 2022
Supervisor restarts and reverts services after seemingly unreasonable healthcheck failure Product support	4	35	September 26, 2024
balenaOS 2.80.5+rev1, supervisor version 12.8.10 keeps rebooting. balenaOS	6	628	July 15, 2021

Balena supervisor restarting due to failing healthcheck

Related topics