Hi @klutchell, thanks for your response.
I am happy to update a number of our internal development devices to the latest Balena Supervisor. We generally updated the Balena OS and supervisor some time after they are released to ensure bugs are caught.
The assumption that we are facing issues with the supervisor healthcheck is from the supervisor logs. Just prior to exiting with 143 the supervisor reports a healthcheck failure.
Jan 1 15:09:15 square-meadow balena-supervisor debug [debug] Attempting container log timestamp flush...
Jan 1 15:09:15 square-meadow balena-supervisor debug [debug] Container log timestamp flush complete
Jan 1 15:10:26 square-meadow balena-supervisor info [info] Reported current state to the cloud
Jan 1 15:11:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000099, skew: 0.100
Jan 1 15:12:47 square-meadow balena-supervisor INFO [api] GET /v1/healthy 200 - 3.119 ms
Jan 1 15:13:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000101, skew: 0.100
Jan 1 15:15:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000103, skew: 0.100
Jan 1 15:17:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000104, skew: 0.100
Jan 1 15:17:47 square-meadow balena-supervisor info [info] Healthcheck failure - memory usage above threshold after 219h 8m 34s
Jan 1 15:17:47 square-meadow balena-supervisor error [error] Healthcheck failed
Jan 1 15:17:47 square-meadow balena-supervisor INFO [api] GET /v1/healthy 500 - 5.530 ms
Jan 1 15:19:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000106, skew: 0.100
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug] Attempting container log timestamp flush...
Jan 1 15:19:15 square-meadow balena-supervisor debug [debug] Container log timestamp flush complete
Jan 1 15:19:40 square-meadow balena-supervisor info [info] Reported current state to the cloud
Jan 1 15:21:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000108, skew: 0.100
Jan 1 15:22:48 square-meadow balena-supervisor info [info] Healthcheck failure - memory usage above threshold after 219h 13m 35s
Jan 1 15:22:48 square-meadow balena-supervisor error [error] Healthcheck failed
Jan 1 15:22:48 square-meadow balena-supervisor INFO [api] GET /v1/healthy 500 - 5.316 ms
Jan 1 15:23:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000110, skew: 0.100
Jan 1 15:24:40 square-meadow balena-supervisor info [info] Reported current state to the cloud
Jan 1 15:25:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000111, skew: 0.100
Jan 1 15:27:05 square-meadow healthdog INFO try: 1, refid: C1399032, correction: 0.000000113, skew: 0.100
We are using the /v2/journal-logs
endpoint with request data:
{
follow: true,
all: true,
format: "json"
}
Which results in the following supervisor log:
Jan 1 15:28:30 square-meadow balena-supervisor debug [debug] Spawning journalctl -a --follow -o json
I would be keen to understand root cause, but if that is not possible we could move to direct access of journal logs.