stuck in deployment loop

hi guys,

i have 10 pi4 devices. we pushed an update last night and 1 of our devices is stuck in an app deployment loop.

so far i checked the sd card space, i rebooted the pi, i tried stopping and restarting the container manually, but it keeps staying stuck.

i even tried to upgrade the hostOS to latest (was using 2.47.0 and now i have 2.47.1)but still nothing

the other machines are fine.
https://dashboard.balena-cloud.com/devices/8c86b19de34a413f6483a06aa02205a5/summary

Can you please help me figure this out?

Hi. Could you enable support access for that device so we can take a look? And by stuck, what do you mean?

support access enabled.

i mean the device downloads the image, then by the time it reaches 99% it starts downloading the new release again and again. i think i saw it doing like 6-7 times today.

balenaEngine is getting killed by the watchdog, and when that happens with an application update it usually means that the SD card is not taking the increased load well. We recommend the SanDisk’s Extreme Pro in these cases as in our experience they work well. Do you know if all Pis are using the same SD card model?

yeah all the pi’s are using the same sd card model. might mean this card is about to fail or there could be something else? all pis are using the same card/pi model/same application so the load should kinda be indentical on all of them.

Hi,

There is something strange with your device since as I’m checking now, there’s no space left:

root@8c86b19:/mnt/data# df -h
Filesystem                         Size  Used Avail Use% Mounted on
devtmpfs                           1.8G     0  1.8G   0% /dev
/dev/disk/by-partuuid/a1fb009e-02  300M  295M     0 100% /mnt/sysroot/active
/dev/disk/by-label/resin-state      19M  226K   17M   2% /mnt/state
overlay                            300M  295M     0 100% /
tmpfs                              1.9G     0  1.9G   0% /dev/shm
tmpfs                              1.9G   37M  1.9G   2% /run
tmpfs                              1.9G     0  1.9G   0% /sys/fs/cgroup
tmpfs                              1.9G     0  1.9G   0% /tmp
tmpfs                              1.9G  276K  1.9G   1% /var/volatile
/dev/mmcblk0p1                      40M  8.2M   32M  21% /mnt/boot
/dev/mmcblk0p6                      29G   29G     0 100% /mnt/data

and looks like balenaEngine takes all the diskspace:

root@8c86b19:/mnt/data# du -h -d1
12K     ./lost+found
26K     ./resinhup
34M     ./root-overlay
66K     ./resin-data
29G     ./docker
29G     .

We can help you clean up and try to restart the engine so please let me know if we have your permissions to do so.

yeah definitely looks like something filled up /mnt/data. our app hasn’t been running so it’s not that. go ahead please.

Hello, the device seems to be recovered and is downloading the application now. As Trong mentioned there seemed to be a lot of unused docker layers, I tried performing some cleanup to specifically remove the dangling ones, but balena engine kept not being able to start on the device.
I ended up removing the supervisor and application images which finally allowed the balena service to start and begin downloading the new release correctly. Since the device had to re-download the supervisor I also took the liberty of updating it to 10.8.0 from 10.6.27, there are a lot of crucial improvements in stability and error reporting in the latest version, and we are making sure to take any chance to update devices to the new version, let me know if that is ok with you.

that is very okay, thanks. should i update the supervisor on all other devices or just leave it as is?

I don’t think it works. it’s still stuck in downloading . it was at 7% a minute ago and now it’s at 2%.

Hi there, we’ve taken a look at the device and it appears to be struggling with disk I/O. Running the device diagnostics show rather high disk write latency (4s to write 1mb):

Slow disk writes detected: mmcblk0: 4101.79ms / write, sample size 980673
mmcblk0p1: 3390.78ms / write, sample size 9
mmcblk0p2: 1483.47ms / write, sample size 34
mmcblk0p3: 1696.64ms / write, sample size 176
mmcblk0p5: 1677.99ms / write, sample size 10134
mmcblk0p6: 4127.64ms / write, sample size 970320

The best course of action at this point would be to replace the SD card in this device and run the health check again to verify disk write latency is reasonable.

For reference, we’ve had very few problems with SanDisk Extreme PRO cards.

got it. will do that. thanks!

just wanted to give an update. we didn’t do anything to the machine and it just recovered. didn’t swap the sd card. it’s awkward. you guys have any idea what could have caused it to recover?

The update might have proceeded in just the right way for your SD card to take it without raising any other problems. On the next update, keep an eye to see if this issue reappears or not.