Upgraded OS and app - diagnostics show errors

Hi,

I recently updated the OS of one of our field devices to the latest OS version 2.48, and then updated our application. It seemed that everything went fine, but then I ran the diagnostics and got two errors:

Container engine balena is up, but has 1 unclean restarts and may be crashlooping (most recent start time: Tue 2020-05-19 18:35:27 UTC)
Saving images timed out, check data incomplete: 0ccc272d385f cc42f1e9307e

I believe the second always happens when the engine has issues.

Looking into the error logs

root@ea6ccbb:~# journalctl  --unit balena.service --no-pager | grep error
May 19 15:11:45 ea6ccbb balenad[882]: time="2020-05-19T15:11:45.241374328Z" level=warning msg="failed to load plugin io.containerd.internal.v1.opt" error="mkdir /opt: read-only file system"
May 19 18:33:13 ea6ccbb balenad[882]: time="2020-05-19T18:33:13.014384074Z" level=error msg="a36a9850d27b41886df6afae9afa4a2f6b78fc6479699e3530f06436921ab2ff cleanup: failed to delete container from containerd: no such container"
May 19 18:33:13 ea6ccbb balenad[882]: time="2020-05-19T18:33:13.215879164Z" level=error msg="Handler for POST /v1.39/containers/a36a9850d27b41886df6afae9afa4a2f6b78fc6479699e3530f06436921ab2ff/start returned error: OCI runtime start failed: container process is already dead: unknown"
May 19 18:35:26 ea6ccbb balenad[882]: error   0x0
May 19 18:35:32 ea6ccbb balenad[6914]: time="2020-05-19T18:35:32.670617638Z" level=warning msg="failed to load plugin io.containerd.internal.v1.opt" error="mkdir /opt: read-only file system"
May 19 18:35:38 ea6ccbb 3064e329d44e[6914]: [event]   Event: Image download error {"error":{"message":"connect ECONNREFUSED /var/run/balena-engine.sock","stack":"Error: connect ECONNREFUSED /var/run/balena-engine.sock\n    at PipeConnectWrap.afterConnect [as oncomplete] (net.js:1097:14)"},"image":{"name":"registry2.balena-cloud.com/v2/199149ff1febbecdbd25aa4fa3430f87@sha256:a918255d4d17079316e9cbc45af1441f0fbcf51618d2fea246701749aab42cc9","appId":1273333,"serviceId":239090,"serviceName":"solmon","imageId":2171655,"releaseId":1342867,"dependent":0,"dockerImageId":null}}
May 19 18:35:38 ea6ccbb 3064e329d44e[6914]: [event]   Event: Device state report failure {"error":{"message":"connect ECONNREFUSED /var/run/balena-engine.sock","stack":"Error: connect ECONNREFUSED /var/run/balena-engine.sock\n    at PipeConnectWrap.afterConnect [as oncomplete] (net.js:1097:14)"}}
May 19 18:35:38 ea6ccbb 3064e329d44e[6914]: [error]   Scheduling another update attempt in 1000ms due to failure:  Error: connect ECONNREFUSED /var/run/balena-engine.sock
May 19 18:35:38 ea6ccbb 3064e329d44e[6914]: [error]         at PipeConnectWrap.afterConnect [as oncomplete] (net.js:1097:14)
May 19 18:35:38 ea6ccbb 3064e329d44e[6914]: [error]   Device state apply error Error: connect ECONNREFUSED /var/run/balena-engine.sock

… repeated many times

and then later

May 19 18:36:26 ea6ccbb balenad[6914]: time="2020-05-19T18:36:26.030744760Z" level=warning msg="error locating sandbox id 5f7760cf3cc8d95da79f8bcbc0cee5e149cd2e2044740656eae172d2b3bdc3a7: sandbox 5f7760cf3cc8d95da79f8bcbc0cee5e149cd2e2044740656eae172d2b3bdc3a7 not found"
May 19 18:36:30 ea6ccbb balenad[6914]: time="2020-05-19T18:36:30.890081824Z" level=warning msg="unmount task rootfs" error="no such file or directory" id=3064e329d44ea8164f764b54f5bdc13e77ab7001a30357c9b862dece6e708f32 path=/var/run/balena-engine/containerd/daemon/io.containerd.runtime.v1.linux/moby/3064e329d44ea8164f764b54f5bdc13e77ab7001a30357c9b862dece6e708f32
May 19 18:36:30 ea6ccbb balenad[6914]: time="2020-05-19T18:36:30.909994734Z" level=error msg="delete bundle" error="Failed to remove both bundle and workdir locations: rename /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/3064e329d44ea8164f764b54f5bdc13e77ab7001a30357c9b862dece6e708f32 /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/.3064e329d44ea8164f764b54f5bdc13e77ab7001a30357c9b862dece6e708f32: no such file or directory: rename /var/run/balena-engine/containerd/daemon/io.containerd.runtime.v1.linux/moby/3064e329d44ea8164f764b54f5bdc13e77ab7001a30357c9b862dece6e708f32 /var/run/balena-engine/containerd/daemon/io.containerd.runtime.v1.linux/moby/.3064e329d44ea8164f764b54f5bdc13e77ab7001a30357c9b862dece6e708f32: no such file or directory"
May 19 20:16:02 ea6ccbb balenad[6914]: time="2020-05-19T20:16:02.929304875Z" level=warning msg="error locating sandbox id e1a2203150018d902ad1824548959a6b1831d824049dba82854168eeddf60e59: sandbox e1a2203150018d902ad1824548959a6b1831d824049dba82854168eeddf60e59 not found"
May 19 20:16:06 ea6ccbb balenad[6914]: time="2020-05-19T20:16:06.660753031Z" level=warning msg="unmount task rootfs" error="no such file or directory" id=f8b46322ee3f003d9391429cceeec4d8482442a7c34b1eab9c10e629c91188a7 path=/var/run/balena-engine/containerd/daemon/io.containerd.runtime.v1.linux/moby/f8b46322ee3f003d9391429cceeec4d8482442a7c34b1eab9c10e629c91188a7
May 19 20:16:06 ea6ccbb balenad[6914]: time="2020-05-19T20:16:06.681130939Z" level=error msg="delete bundle" error="Failed to remove both bundle and workdir locations: rename /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/f8b46322ee3f003d9391429cceeec4d8482442a7c34b1eab9c10e629c91188a7 /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/.f8b46322ee3f003d9391429cceeec4d8482442a7c34b1eab9c10e629c91188a7: no such file or directory: rename /var/run/balena-engine/containerd/daemon/io.containerd.runtime.v1.linux/moby/f8b46322ee3f003d9391429cceeec4d8482442a7c34b1eab9c10e629c91188a7 /var/run/balena-engine/containerd/daemon/io.containerd.runtime.v1.linux/moby/.f8b46322ee3f003d9391429cceeec4d8482442a7c34b1eab9c10e629c91188a7: no such file or directory"
May 19 20:32:01 ea6ccbb balenad[6914]: time="2020-05-19T20:32:01.953071497Z" level=error msg="Can't add file /var/lib/docker/tmp/docker-export-610895789/1b4817bdbc01b72a913735288434a993f35c8cbc206a5632bdb20466cbb003fd/layer.tar to tar: io: read/write on closed pipe"
May 19 20:32:01 ea6ccbb balenad[6914]: time="2020-05-19T20:32:01.959058461Z" level=error msg="Can't close tar writer: io: read/write on closed pipe"
May 19 20:36:22 ea6ccbb balenad[6914]: time="2020-05-19T20:36:22.048232467Z" level=error msg="Can't add file /var/lib/docker/tmp/docker-export-473862662/1fb4db48c9d6acea64de14855f1cda9f0fef512bec80d785044f41e6e827017e/layer.tar to tar: io: read/write on closed pipe"
May 19 20:36:22 ea6ccbb balenad[6914]: time="2020-05-19T20:36:22.053704435Z" level=error msg="Can't close tar writer: io: read/write on closed pipe"

As far as I can tell, everything is working ok. and balena ps seems to show that the containers are running. My guess is that there was an issue with the engine but somehow it recovered?

I’m interested in understanding these errors - what do they mean and are they indicative of a real problem?

  • the container which is missing prior to delete
  • the issue with connecting to the balena-engine socket
  • the various files missing in the second set of errors
  • errors in writing to one of the docker layers b/c of closed pipe

I see container engine issues quite often in our deployed states but despite the excellent masterclass documentation don’t have enough of a mental model of the system.

@0xff, are there errors in the output of journalctl -au resin-supervisor --no-pager?

As for the errors that you quoted, did they only happen shortly after the host OS or application update, or are they reoccurring? If it is only after the update, some of the errors could be a result of balenaEngine being killed by the watchdod because it was too slow to respond, and it might have been slow because of a large app update (container download and replacement).

I will also ask the OS team for more specific explanations to your questions.

At the beginning of the logs. Then it was clean:

=============================================================
root@ea6ccbb:~# journalctl -au resin-supervisor --no-pager
-- Logs begin at Tue 2020-05-19 15:11:20 UTC, end at Tue 2020-05-19 23:57:22 UTC. --
May 19 17:36:22 ea6ccbb resin-supervisor[1618]: Error response from daemon: No such container: resin_supervisor
May 19 17:36:23 ea6ccbb resin-supervisor[1641]: active
May 19 17:36:46 ea6ccbb resin-supervisor[1642]: Error: No such object: resin_supervisor
May 19 17:36:48 ea6ccbb resin-supervisor[1642]: Error: No such container: resin_supervisor
May 19 17:37:25 ea6ccbb resin-supervisor[1642]: Starting system message bus: dbus.
May 19 17:37:25 ea6ccbb resin-supervisor[1642]:  * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
May 19 17:37:25 ea6ccbb resin-supervisor[1642]:    ...done.
May 19 17:38:50 ea6ccbb resin-supervisor[1642]: [info]    Supervisor v10.8.0 starting up...

As for the errors that you quoted, did they only happen shortly after the host OS or application update, or are they reoccurring? If it is only after the update, some of the errors could be a result of balenaEngine being killed by the watchdod because it was too slow to respond, and it might have been slow because of a large app update (container download and replacement).

I think it was only after startup. However, the diagnostic checks continue to show the same errors.

Hi there, what is the current situation of your application? Does it run or not?
Could you please copy and paste the errors on the Device Health Checks Diagnostics page you are referring to?
If you click on the errors you get redirected to some documentation about them that can be useful to understand what is the issue.
Moreover, it would be helpful if you download the Device Diagnostics from the Diagnostics page and attach them here.
It is possible that the engine crashed but we need to understand the current situation.
Thanks

Hi,

Yes, as far as I can tell, the situation is the same as I indicated in the post - everything seems fine but the device health checks still show the same 2 errors I posted above.

I’m sorry I don’t understand what you’re asking beyond the two errors that I posted above. Here they are again:

Container engine balena is up, but has 1 unclean restarts and may be crashlooping (most recent start time: Tue 2020-05-19 18:35:27 UTC)
Saving images timed out, check data incomplete: 0ccc272d385f cc42f1e9307e

The first is for check_container_engine and the second is for check_image_corruption.

Yes, the documentation is generally excellent, especially the masterclasses. In this case, the documentation says:

check_container_engine

Summary

This check confirms the container engine is up and healthy. Additionally, this check confirms that there have been no unclean restarts of the engine. These restarts could be caused by crashlooping. The container engine is an integral part of the balenaCloud pipeline.

Triage

It is best to let balena’s support team take a look before restarting the container engine. At the very least, take a diagnostics snapshot before restarting anything.

ea6ccbb98c7894c164141ae7c64c9b7f_2020.05.20_16.10.31 0000.log (639.8 KB)

Thank you for looking at this!

Hi, reading through this thread, I understand that even though the device is working normally you seek clarification about the two error messages shown in the device checks.

The second message check_image_corruption means that a check on specific images was interrupted before completion. As there are no messages indicating corruption there is no need to worry. We are in the process of collecting data to tune this timeout as it is affected by the storage media and image size. See https://github.com/balena-io/device-diagnostics/issues/203. It would help if you would provide us with data from your device so we can take it into account as per the instructions provided.

The first message, check_container_engine indicates that the balena engine crashed but obviously recovered. To help us understand why please send us the report generated after running the checks in Diagnostics > Device Diagnostics. The download button will appear to the right of the “Run Diagnostics” button after they finish running. It will also help if you could remember how the error came to happen. You mention you performed a hostOS update and then an application update:

  • Which hostOS version where you updating from?
  • Did you perform the application update right after or did you reboot the device in the middle?
  • Do you also see balena engine errors after rebooting the device? In other words, how reproducible this problem is?
    Thanks,

Thanks for the quick response!

This is attached to the previous post.

2.32 => 2.48

I did not reboot. Doesn’t a Host OS update cause a reboot itself? After I noticed that the Host OS update worked (it took several hours), I updated the app image.

I didn’t reboot in case you guys wanted to look at it per the documentation suggestion. I have granted support access. ea6ccbb98c7894c164141ae7c64c9b7f

Hi, thanks for your answers and apologies for not realizing the report had already been attached. I will forward all the details to the engine maintainer and we will let you know what the next steps are.

Hi 0xff – thanks again for letting us know about this issue. We’ve passed your logs to our internal engineers, and at this point we do not require further access to the device. Thanks very kindly for keeping it open for us, and please let us know if this happens again.

All the best,
Hugh