I have just moved a device from one Balena app to another and it is struggling to download the service images. The images start downloading and some of them reach as far as 80% but then one after another the images fail to download.
In writing this forum post I managed to solve the problem. I am unsure if this was due to my actions or a change to the container registry. I no longer need help on this issue, however, thought it would be useful to share what occurred. Perhaps a member of the balena team would be able to comment if any of the analysis or resolution steps would be helpful if faced with a similar issue? In high likelihood the actual resolution was just to reboot the device.
Here is an extract from the logs that are present when the downloads fail. I have done my best to preserve information in the above logs but prevent sharing unique information about the device. If more logs would be useful then I am happy to share.
17.12.19 11:00:18 (+0000) Downloading image '2f1897@sha256'
17.12.19 11:00:19 (+0000) Downloading image 'f58e28@sha256'
17.12.19 11:00:19 (+0000) Failed to download image 'f58e28@sha256' due to '(HTTP code 500) server error - Get https://: dial: lookup registry2.balena-cloud.com on 127.0.0.2:53: read udp 127.0.0.1:33869->127.0.0.2:53: read: connection refused '
17.12.19 11:00:19 (+0000) Failed to download image '2f1897@sha256' due to '(HTTP code 404) no such image - no such image: 2f1897@sha256: No such image: 2f1897@sha256 '
17.12.19 11:00:19 (+0000) Downloading image 'f58e28@sha256'
17.12.19 11:00:20 (+0000) Downloading image '2f1897@sha256'
17.12.19 11:00:20 (+0000) Failed to download image '2f1897@sha256' due to '(HTTP code 500) server error - Get https://v2/2f1/manifests/sha256: Get https://api.balena-cloud.com/auth/: dial tcp: lookup api.balena-cloud.com on 127.0.0.2:53: no such host '
17.12.19 11:00:21 (+0000) Failed to download image 'f58e28@sha256' due to '(HTTP code 404) no such image - no such image: f58e28@sha256: No such image: f58e28@sha256 '
17.12.19 11:00:22 (+0000) Downloading image 'f58e28@sha256'
17.12.19 11:00:22 (+0000) Downloading image '2f1897@sha256'
17.12.19 11:00:22 (+0000) Failed to download image '2f1897@sha256' due to '(HTTP code 500) server error - Get https://: dial 52.72.159.244:443: connect: network is unreachable '
17.12.19 11:00:23 (+0000) Downloading image '2f1897@sha256'
17.12.19 11:00:23 (+0000) Failed to download image '2f1897@sha256' due to '(HTTP code 500) server error - Get https://: dial: lookup registry2.balena-cloud.com on 127.0.0.2:53: read udp 127.0.0.1:33629->127.0.0.2:53: read: connection refused '
17.12.19 11:00:23 (+0000) Downloading image '2f1897@sha256'
Having looked at other posts and responses I have run some internet connectivity tests using ping -c 120 8.8.8.8
to see that it is not a local network issue. I got an overall package loss of 3% so I do not think the network is to blame for this issue.
--- 8.8.8.8 ping statistics ---
120 packets transmitted, 116 packets received, 3% packet loss
round-trip min/avg/max = 1.860/4.185/22.879 ms
In addition I have run device health checks and diagnostics. The only concern is the output from check_write_latency
which reports slow disk writes. Although I think this could be just due to the amount of data being continually written to the SD card as the images are repeatedly downloaded.
Name | Success | State |
---|---|---|
check_write_latance |
Failed | Slow disk writes detected: mmcblk0: 2054.46ms / write, sample size 33089 mmcblk0p6: 2065.03ms / write, sample size 32916 |
There is some concerning output from systemctl status balena --no-pager
which suggests data is being written outside of the image. I am unsure how to resolve this but started investigating which actions could be run on the host OS to cleanup old images.
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.736821347Z" level=warning msg="found leaked image layer sha256:a48e90 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.736886555Z" level=warning msg="found leaked image layer sha256:beb774 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.736960408Z" level=warning msg="found leaked image layer sha256:810278 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737133219Z" level=warning msg="found leaked image layer sha256:0473b2 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737211447Z" level=warning msg="found leaked image layer sha256:044f8c platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737276498Z" level=warning msg="found leaked image layer sha256:1c579e platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737327123Z" level=warning msg="found leaked image layer sha256:627bf5 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737397278Z" level=warning msg="found leaked image layer sha256:6365e0 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737455715Z" level=warning msg="found leaked image layer sha256:b7fbf1 platform linux"
Dec 17 11:06:45 af1f06f balenad[879]: time="2019-12-17T11:06:45.737504464Z" level=warning msg="found leaked image layer sha256:ea3a56 platform linux"
I have run balena image prune
in the host OS and it managed to delete enough dangling images to reclaim 3.2GB
of space. Could this space be just the images that were partially downloaded? In which case this could be a symptom of the problem as opposed to the cause.
As a final step I rebooted the device and at the moment the services are downloading without error.