Failed to download image when disk usage very high

I have a Pi4 failing to download images which looks to the due to the available disk space. Currently the diagnosis reports that the disk space is 1%. However, I am not sure what would have caused such a high use of disk space.

I think that this topic Persistent "Failed to download image due to 'connect ECONNREFUSED /var/run/balena-engine.sock" error is related. The logs are displaying a similar issue and the disk space usage could be due to an issue with the SD card.

When looking into the disk space usage there seems to be a huge number of folders (472 to be precise) in /var/lib/balena/overlay2 which I have tried deleting, in order to clear up space. However, the command often fails during the deletion.

Should I just write off the SD card and start fresh? Is this a common failure mode if there are issues with the SD card? Could there be a different reason for data usage? I have been sure to remove all the files in the /data folder.

Hi, can you enable support access and share the UUID of the device please? We’d like to take a look at the device if possible.

Thanks for your quick response, the device UUID is 030a41d6e205dfeed48d0131cfd769cc and I have enabled support access for a week.

This morning I ran another batch of health checks, it would seem that continual attempts to delete folders in overlay2 has meant disk usage has been reduced, however, now there are issues with write speed (among other things).

Hello Henry,
I checked your device. Regarding disk space usage, is there a reason why you deleted files out of /var/lib/balena/overlay2? Also, to get better metrics overall, I would recommend upgrading your balenaOS to the latest 2.56.0 for which we have shipped the device metrics feature, you can read about it here https://www.balena.io/blog/introducing-lightweight-high-level-device-metrics-balenacloud/

I wasn’t able to debug the disk usage properly since the device was updating at the moment. Seeing the health checks, since the supervisor is not running the update won’t go through even if it downloads (it has been stuck at 1%). Have you run any commands to update the supervisor manually? Seeing the extent of problems and actions already taken, I would recommend starting fresh with the latest balenaOS version and the same SD card. From there we can start to narrow down the problems. Thanks!

I deleted these files as my assumption was that images were continually getting downloaded, then when the downloads were failing not all the data was cleared. It could well have been a different reason that the downloads were failing.

I attempted to remotely update the BalenaOS, as we run our devices remotely it is interesting to see which failure modes we cannot recover from without physically accessing the device. I assume that this will not work, in which case, we will fall back to a complete update of the SD card.

I have not run any supervisor commands manually. What commands are available to invoke updates or attempt to recover when updates are failing?

Looks like the update worked so it would be great to see if it is possible to recover the device remotely. In previous posts I have seen people suggesting removal of most of the folders in /var/lib/balena. Although I know this is a fairly destructive approach.

Here are the current device metrics (very nice feature by the way).

Hi Henry,
Thanks for providing the information. I wouldn’t recommend deleting any files from balenaOS without specific instructions and reasons. It could lead to the device behaving erratically. I am glad to see the device is looking good (Running healthcheck) and updated remotely without the need for an SD card reflash. One of the many features balena provides.

Previously, I saw the supervisor was not running with the reason being a version mismatch. I assumed you tried to update it manually. Another thing we don’t really recommend. Glad you liked the device metrics feature. Let us know if your original issue regarding disk usage got resolved? Thanks.

The device is still failing to download images with the same error connect ECONNREFUSED /var/run/balena-engine.sock. Any ideas how best to approach debugging? At this point I would sometimes consider performing a reboot from the HostOS. Again this is likely not a recommended approach.

06.10.20 12:19:38 (+0100) Failed to download image 'registry2.balena-cloud.com/v2/f125701d15b0ab5699b8a9cb8171eaf7@sha256:e493ba911cc7bcd570f5719c8bc90fe1dbd1c4de1657e4da7f698f666376ca4d' due to 'connect ECONNREFUSED /var/run/balena-engine.sock'
06.10.20 12:25:31 (+0100) 🔥 Firewall mode not applied due to error
06.10.20 12:25:31 (+0100) Supervisor starting

The outputs from a recent health check are not particularly positive:

Hello @hpgmiskin

I’ve tried making this device download the latest release for some time but it looks like the SD card is in a very bad condition.

It’s not corrupting any data but it has a very low write speed (I’ve measured between 32kB/s and 280kB/s) and most things just hang.

This log message from dmesg seems to go in that direction too:

[  725.983630] INFO: task kworker/0:0:1173 blocked for more than 120 seconds.
[  725.983651]       Not tainted 5.4.58 #1
[  725.983664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  725.983678] kworker/0:0     D    0  1173      2 0x00000028
[  725.983717] Workqueue: events_freezable mmc_rescan
[  725.983733] Call trace:
[  725.983753]  __switch_to+0xdc/0x248
[  725.983772]  __schedule+0x2dc/0x710
[  725.983788]  schedule+0x40/0xd8
[  725.983801]  __mmc_claim_host+0xb8/0x200
[  725.983815]  mmc_get_card+0x38/0x48
[  725.983832]  mmc_sd_detect+0x24/0x90
[  725.983845]  mmc_rescan+0xc0/0x368
[  725.983862]  process_one_work+0x1ec/0x4a0
[  725.983877]  worker_thread+0x48/0x490
[  725.983891]  kthread+0x124/0x128
[  725.983905]  ret_from_fork+0x10/0x18

I would recommend trying to flash another SD card, SanDisk Extreme PRO ones are quite reliable.

Good to know that SanDisk Extreme PRO are the way to do. I have ordered a couple to start using in our devices.

I am a little concerned that this issue might have been brought about the device crash-looping and continually downloading services. We have a number of devices which have been running a good deal longer than this one with no issue.

Hi there – glad to hear that you’re going with the SanDisk Extreme PRO; in our experience they’ve been quite solid.

If the device was doing a lot of writes, whether from the image download or anything else, it certainly wouldn’t have helped anything. Lowering the number of writes a device does will always extend the life of your media. However, at this point it seems likely that the original SD card would have been encountering problems first, which then were reflected the download problems.

Thanks, and please let us know if there’s anything else we can do to help here.

All the best,
Hugh