Balena supervisor restarting due to failing healthcheck

We are continually getting issues where the Balena Supervisor service is restarted due to a failing health check. It looks like the service is going above it’s memory limit. Here is the supervisor memory usage from a day when it restarted around 4am:.

We use the supervisor API to stream logs to a device service which sends them to our logging service. This is a critical function of our device to ensure we have a record of all logs. When the supervisor reboots we loose about 30 seconds of logging.

Do you have any suspicions of what might be the root cause of the supervisor going above it’s memory limit? Ideally we could find a way to reduce the failure rate of the health check.

One suspicion is that the supervisor memory usage might be tied to streaming the logs for all containers. What would be the advised approach for resolving? Do the Balena team recommend use of the supervisor API for logs, or should we use io.balena.features.journal-logs feature to directly access the logs from our container?

Hello @henry-mytos could you please confirm the balenaOS and supervisor versions that you use? What device type?

Thanks!

Hello @mpous, we are using:

Please let me know if you need any more details?

Thanks

Hey @henry-mytos , we have fixed a few memory leaks in the supervisor since that release and it would probably be worth updating to the latest supervisor to retest. Can you run that for a while and see if you can still reproduce?

The supervisor container doesn’t actually have a healthcheck that includes memory usage. How did you come to the conclusion that this was related to a failed healthcheck?

Definitely mounting the journal logs via io.balena.features.journal-logs is going to be less work for the supervisor, but either way I would not expect anything via the API to cause a crash. What local API commands are you running currently to get the logs, and how often are you calling that endpoint?

1 Like

Hi @klutchell, thanks for your response.

I am happy to update a number of our internal development devices to the latest Balena Supervisor. We generally updated the Balena OS and supervisor some time after they are released to ensure bugs are caught.

The assumption that we are facing issues with the supervisor healthcheck is from the supervisor logs. Just prior to exiting with 143 the supervisor reports a healthcheck failure.

Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:09:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:10:26 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:11:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000099, skew: 0.100
Jan 1 15:12:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 200 - 3.119 ms
Jan 1 15:13:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000101, skew: 0.100
Jan 1 15:15:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000103, skew: 0.100
Jan 1 15:17:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000104, skew: 0.100
Jan 1 15:17:47 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 8m 34s
Jan 1 15:17:47 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:17:47 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.530 ms
Jan 1 15:19:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000106, skew: 0.100
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Attempting container log timestamp flush...
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug]   Container log timestamp flush complete
Jan 1 15:19:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:21:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000108, skew: 0.100
Jan 1 15:22:48 square-meadow balena-supervisor info [info]    Healthcheck failure - memory usage above threshold after 219h 13m 35s
Jan 1 15:22:48 square-meadow balena-supervisor error [error]   Healthcheck failed
Jan 1 15:22:48 square-meadow balena-supervisor INFO [api]     GET /v1/healthy 500 - 5.316 ms
Jan 1 15:23:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000110, skew: 0.100
Jan 1 15:24:40 square-meadow balena-supervisor info [info]    Reported current state to the cloud
Jan 1 15:25:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000111, skew: 0.100
Jan 1 15:27:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000113, skew: 0.100

We are using the /v2/journal-logs endpoint with request data:

{
    follow: true,
    all: true,
    format: "json"
}

Which results in the following supervisor log:

Jan 1 15:28:30 square-meadow balena-supervisor debug [debug]   Spawning journalctl -a --follow -o json

I would be keen to understand root cause, but if that is not possible we could move to direct access of journal logs.

Ah yes, good catch! The supervisor has an in-process healthcheck that does indeed look for memory leaks. I was expecting a container engine healthcheck and looked in the wrong place.

I would definitely recommend updating to the latest supervisor on some of your test devices and see if the memory usage remains stable and avoids restarts. Let us know how it goes!