Slow download speed, causes restart of download

Hi

I have a Rpi in Brazil on a super slow connection, it starts downloading some docker images, but never manages to complete due to the supervisor getting restarted.

Is there a way to download the images manually, or prevent the supervisor from doing its healthchecks for some time?

Hi, thanks for contacting support.
We have an open request to allow for configurable healthcheck timeouts https://github.com/balena-os/meta-balena/issues/1724.
Similar issues were reported in the forums Supervisor keeps restarting during image download. You could try a similar workaround, that is, remounting the root filesystem read-write and modifying the WatchdocTimeout in the resin-supervisor.service, and re-stating the service.
Hope it helps.

Thanks, I was looking at balena-engine and its timeout, but noticed root was RO.

mount -o remount,rw /
vi balena-engine.service
#change WatchdogSec=4000
systemctl daemon-reload
# Docker had completely filled up with failed downloads..
rm -rf /mnt/data/docker
reboot

I would say its not so much a healthcheck timeout increase, but a way to still healthcheck on downloads?

1 Like

Hi @axlrod you are correct you have to remount / as RW in order to edit the services files. Can you clarify your second question?

Well I’m saying that docker image downloads that are clearly progressing shouldn’t cause a Watchdog Trigger on the Balena service, I don’t know how it’s checked, but I assume that is not how it was intended.

Hey.

downloading some docker images

Can you confirm, as it’s not obvious from the thread, that you’re referencing your service images from balenaCloud here?

Correct

hey @axlrod,
yes, the supervisor or balena service shouldn’t restart if a download is in progress. could you provide logs for your device or enable support access and providing the device url ?

[removed uuid]

You have a week of access.

Feel free to test things/redeploy service, bear in mind the watchdog is set to 4000sec on the balena-service atm.

I was checking the device, ran the diagnostics, and it seems that the device is struggling with slow disk writes. I am not sure if that is related to the original issue, but it might point to a corrupted or low-quality SD card, so you might want to have a look at it. Regarding the original issue, what I see in the logs is HTTP 503 status from the API, which might be the reason why the supervisor restarted. Have you witnessed this behavior multiple times, or just this one time?

if the watchdog is at the default setting for timeout, and I push a new version, it will start downloading the images, and fail around 60% because it restarts… it will then restart at 0% again, it will continue to do that until the /mnt/data is full, I’m guessing it doesn’t cleanup properly.

Hi,

Looks like all services are working right now. Looking at the logs, I can see that the kernel killed a couple of its threads due to them hanging. Both threads were in codepaths that, guessing from the stacktrace, were doing some operations on the SD card. This, together with the diagnostics @sradevski referenced above, indicates that the SD card is getting overloaded.

This could very well be the root cause of the supervisor restarting: slow SD card IO while downloading an image (which is IO-intensive) triggering the kernel’s hung task checker. The supervisor would then be killed and restarted, bypassing the watchdog. However, the supervisor hasn’t restarted since the last reboot and all services are already updated, so there are no logs to confirm this hypothesis.

It would be best to replace the SD card (from our experience, the Sandisk Extreme Pro works very well). If that is not possible, you could try increasing the kernel’s hung task timeout: see http://beautifulwork.org/hung_task_timeout_secs-hung-task-timeout/ for an example of how to do that.

Please update us if this solves the issue for you.

oh my issue was solved by remounting / and changing the timeout, the SD card is already a high quality high speed SD card, I think it’s more related to the download speed on that line…

Good to know that it works out for you. Please let us know if you need further support.