Supervisor keeps restarting during image download

Hi,

I have a problem with regularly restating supervisor during the image download.
OS: Balena OS 2.43.0, supervisor 10.2.2, RPI 3b

I created brand new app, freshed SD flashed with Etcher. I pushed and successfully built the image. However the download of the image restarts every now and then randomly due to supervisor container being restarted. Logs follow

Please advice - could it be a bad connection? (I tried at home, at work - same result). Could it be bad SD card? (I tried 2 of them - Kingston, A-Data, where speed shouldn’t be a problem).

I’m really stuck here. Don’t know, why the balena engine thinks it’s necessary to restart the supervisor.

Thanks,

Jiri

Log 1

Oct 29 15:01:49 03a14ed resin-supervisor[3433]: [event] Event: Docker image download {“image”:{“name”:“registry2.balena-cloud.com/v2/cc890989d5a02f97b8a3114f5c128862@sha256:851f1ec0d4ba1964b2b2fe692e9aac309ce062eb2167eb7c95bdcaa9dfce1455",“appId”:1535704,“serviceId”:349980,“serviceName”:“main”,“imageId”:1634391,“releaseId”:1126541,“dependent”:0,"dockerImageId”:null}}
Oct 29 15:05:46 03a14ed resin-supervisor[3433]: [api] GET /v1/healthy 200 - 53.443 ms
Oct 29 15:10:54 03a14ed resin-supervisor[3433]: [debug] Attempting container log timestamp flush…
Oct 29 15:10:54 03a14ed resin-supervisor[3433]: [debug] Container log timestamp flush complete
Oct 29 15:11:02 03a14ed resin-supervisor[3433]: [api] GET /v1/healthy 200 - 13.342 ms
Oct 29 15:12:29 03a14ed resin-supervisor[3433]: time=“2019-10-29T15:12:29.787033389Z” level=error msg=“error waiting for container: unexpected EOF”
Oct 29 15:12:29 03a14ed systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Oct 29 15:12:29 03a14ed systemd[1]: resin-supervisor.service: Failed with result ‘exit-code’.
Oct 29 15:12:39 03a14ed systemd[1]: resin-supervisor.service: Service RestartSec=10s expired, scheduling restart.
Oct 29 15:12:39 03a14ed systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 2.
Oct 29 15:12:40 03a14ed systemd[1]: Stopped Resin supervisor.
Oct 29 15:13:49 03a14ed systemd[1]: Starting Resin supervisor…
Oct 29 15:13:50 03a14ed resin-supervisor[4504]: resin_supervisor
Oct 29 15:13:50 03a14ed resin-supervisor[4512]: active
Oct 29 15:13:50 03a14ed systemd[1]: Started Resin supervisor.

Log 2

Oct 28 16:47:25 eef28c0 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Oct 28 16:47:25 eef28c0 systemd[1]: resin-supervisor.service: Killing process 31996 (start-resin-sup) with signal SIGABRT.

Hello, could you please enable support access for this device and provide us its uuid?

Thank you zvin. I tried once again with the Sandisk Extreme Pro card and it went finally through. :slight_smile:

Anyway, it would be great to mention these cards as a mandatory requirement for Balena to save many headaches as it’s not very obvious what’s going on from journal logs.

Another enhancement would be to set custom times for all killer watchdogs present in Balena OS.
My otherwise healthy supervisor container got killed after 3 minutes of unresponsiveness due to the image download.

Thanks,

Jiri

Hi @filemon,

We don’t mandate these cards, although we do recommend them, as customers tend to be able to source their own high-end cards. However, we do always recommend using a model that has had appropriate levels of manufacturer testing to ensure low wear levels.

Custom healthdog timeouts are an interesting topic, we have discussed this internally before, and I’ll raise this with the team again. Having said that, from your logs it looks like the Supervisor was actually unable to use a container it thought it should be, which doesn’t sound like a timeout but an issue with container startup/shutdown. Is there any more information you could give us on exactly what the device was doing at the time?

Best regards,

Heds

Hello,

Just for a bit more info, we have an issue out for configurable healthdog/watchdog timeout here: https://github.com/balena-os/meta-balena/issues/1724

Best regards,

Heds

1 Like

Thank you hedss. I overcame the timeout watchdog (Log 2 from my original report) by making the balena OS fs writable and manually set WatchdocTimeout in resin-supervisor.service. However, other watchdogs jumped on my back immediately after. :slight_smile:

So the configuration of timeouts is very welcome to allow poor owners of slow SD cards enjoy your awesome platform. And maybe even better it could be tweaked dynamically driven the results of your balena healthcheck utility.

Hi,

Totally understood, though I think we should also be able to mitigate this with some dynamic I/O testing, and I’m going to raise this as an issue too. We’ll obviously let you know and update this thread when anything moves on these issues!

Best regards,

Heds