Supervisor keeps restarting during image download

filemon · October 29, 2019, 3:25pm

Hi,

I have a problem with regularly restating supervisor during the image download.
OS: Balena OS 2.43.0, supervisor 10.2.2, RPI 3b

I created brand new app, freshed SD flashed with Etcher. I pushed and successfully built the image. However the download of the image restarts every now and then randomly due to supervisor container being restarted. Logs follow

Please advice - could it be a bad connection? (I tried at home, at work - same result). Could it be bad SD card? (I tried 2 of them - Kingston, A-Data, where speed shouldn’t be a problem).

I’m really stuck here. Don’t know, why the balena engine thinks it’s necessary to restart the supervisor.

Thanks,

Jiri

Log 1

Oct 29 15:01:49 03a14ed resin-supervisor[3433]: [event] Event: Docker image download {“image”:{“name”:“registry2.balena-cloud.com/v2/cc890989d5a02f97b8a3114f5c128862@sha256:851f1ec0d4ba1964b2b2fe692e9aac309ce062eb2167eb7c95bdcaa9dfce1455",“appId”:1535704,“serviceId”:349980,“serviceName”:“main”,“imageId”:1634391,“releaseId”:1126541,“dependent”:0,"dockerImageId”:null}}
Oct 29 15:05:46 03a14ed resin-supervisor[3433]: [api] GET /v1/healthy 200 - 53.443 ms
Oct 29 15:10:54 03a14ed resin-supervisor[3433]: [debug] Attempting container log timestamp flush…
Oct 29 15:10:54 03a14ed resin-supervisor[3433]: [debug] Container log timestamp flush complete
Oct 29 15:11:02 03a14ed resin-supervisor[3433]: [api] GET /v1/healthy 200 - 13.342 ms
Oct 29 15:12:29 03a14ed resin-supervisor[3433]: time=“2019-10-29T15:12:29.787033389Z” level=error msg=“error waiting for container: unexpected EOF”
Oct 29 15:12:29 03a14ed systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Oct 29 15:12:29 03a14ed systemd[1]: resin-supervisor.service: Failed with result ‘exit-code’.
Oct 29 15:12:39 03a14ed systemd[1]: resin-supervisor.service: Service RestartSec=10s expired, scheduling restart.
Oct 29 15:12:39 03a14ed systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 2.
Oct 29 15:12:40 03a14ed systemd[1]: Stopped Resin supervisor.
Oct 29 15:13:49 03a14ed systemd[1]: Starting Resin supervisor…
Oct 29 15:13:50 03a14ed resin-supervisor[4504]: resin_supervisor
Oct 29 15:13:50 03a14ed resin-supervisor[4512]: active
Oct 29 15:13:50 03a14ed systemd[1]: Started Resin supervisor.

Log 2

Oct 28 16:47:25 eef28c0 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Oct 28 16:47:25 eef28c0 systemd[1]: resin-supervisor.service: Killing process 31996 (start-resin-sup) with signal SIGABRT.

zvin · October 29, 2019, 4:08pm

Hello, could you please enable support access for this device and provide us its uuid?

filemon · October 30, 2019, 8:36am

Thank you zvin. I tried once again with the Sandisk Extreme Pro card and it went finally through.

Anyway, it would be great to mention these cards as a mandatory requirement for Balena to save many headaches as it’s not very obvious what’s going on from journal logs.

Another enhancement would be to set custom times for all killer watchdogs present in Balena OS.
My otherwise healthy supervisor container got killed after 3 minutes of unresponsiveness due to the image download.

Thanks,

Jiri

hedss · October 30, 2019, 12:46pm

Hi @filemon,

We don’t mandate these cards, although we do recommend them, as customers tend to be able to source their own high-end cards. However, we do always recommend using a model that has had appropriate levels of manufacturer testing to ensure low wear levels.

Custom healthdog timeouts are an interesting topic, we have discussed this internally before, and I’ll raise this with the team again. Having said that, from your logs it looks like the Supervisor was actually unable to use a container it thought it should be, which doesn’t sound like a timeout but an issue with container startup/shutdown. Is there any more information you could give us on exactly what the device was doing at the time?

Best regards,

Heds

hedss · October 30, 2019, 1:45pm

Hello,

Just for a bit more info, we have an issue out for configurable healthdog/watchdog timeout here: https://github.com/balena-os/meta-balena/issues/1724

Best regards,

Heds

filemon · October 30, 2019, 2:45pm

Thank you hedss. I overcame the timeout watchdog (Log 2 from my original report) by making the balena OS fs writable and manually set WatchdocTimeout in resin-supervisor.service. However, other watchdogs jumped on my back immediately after.

So the configuration of timeouts is very welcome to allow poor owners of slow SD cards enjoy your awesome platform. And maybe even better it could be tweaked dynamically driven the results of your balena healthcheck utility.

hedss · October 30, 2019, 2:59pm

Hi,

Totally understood, though I think we should also be able to mitigate this with some dynamic I/O testing, and I’m going to raise this as an issue too. We’ll obviously let you know and update this thread when anything moves on these issues!

Best regards,

Heds

Topic		Replies	Views
Supervisor restarting before end of image downloading balenaOS	2	598	February 25, 2020
Slow download speed, causes restart of download Product support	13	1550	February 7, 2020
Watchdog restarts the device during a release update balenaOS raspberrypi3 , network	16	1419	July 15, 2020
Download error, URL wrong Product support	24	2565	April 28, 2020
Services keeps on downloading Product support	9	402	October 8, 2020

Supervisor keeps restarting during image download

Related topics