on-device node core-dumps

I noticed today that on one of sites the single service had status exited. I was able to easily restart it from the dashboard but journalctl showed that it was constantly exiting. Further investigation showed that node was corrupted and would core dump.

What could be the reason for this? Presumably, SD card corruption? Any other possible reason?

I know that ideally we would send a new SD card but that is not feasible right now so are there any other options such as fsck?

ID 71b132b65186d1c80107c86a5af6cd35

I have granted access.

Hi there, thanks for reaching out. I am curious, did you do any changes to the device (pushing a new release, etc.) before it exited? Also, what base image are you using? I ran diagnostics on the device and I see saving the image to disk timed out, which usually suggests slow disk (SD card) write speeds. There weren’t any disk failure detected, but that doesn’t guarantee that there is no SD card corruption. What SD card are you using by the way?

Hi, Thanks for looking into it. I hadn’t pushed a new image in some time - it was working fine and then suddenly it wasn’t. I then pushed a bunch of images as I was trying to figure out why the app was continuously failing.

The base image is balenalib/%%BALENA_MACHINE_NAME%%-alpine-python:3.7-run with node added on via apk.

The SD card is Sandisk Ultra. Now we do write quite a bit of logs to the persistent /data volume, and I know that’s generally frowned upon as causing wear and tear on the SD. However, I don’t understand why writing to a particular location would cause corruption on an image that is already on the disk.

Hi there, could you please take a look at the Device Health Checks? Move to the ‘Device Diagnostics’ tab on the ‘Diagnostics’ page and click the ‘Run checks’ button and send us the errors? You should be able to see some indicators there about what is wrong with the device or the services. Moreover if you click on the error, you get redirected to our documentation.
Moreover please “Run diagnostics” from the Device Diagnostics tab and attach the logs here. It will help us make some further investigation.

Georgia

@georgiats Here are the device health checks and diagnostics:

71b132b65186d1c80107c86a5af6cd35_2020.05.27_00.01.53+0000.log (533.7 KB)

In particular - there are many incidences of

May 26 23:57:39 balenad[878]: Error: Could not create NMClient object: Timeout was reached.
M

which we see quite often when things go sour, as well as:

May 26 19:25:30 balenad[878]: time="2020-05-26T19:25:30.927311191Z" level=warning msg="unknown container" container=21d1e5fd54262ddd4353350df749e81179a3473abe8b5d4c587fa7574c1c845f module=libcontainerd namespace=plugins.moby

Thanks

Hi,

Does the device have a static IP or DHCP? The diagnostics show some network issues, which could prevent the device from communicating properly (and getting your latest pushes). Have you tried pinning to your last working release to see if it’s still failing? Based on what you’ve reported, it sounds like you still have connectivity…just trying to rule some things out.

John

@jtonello It has dynamic IP from DHCP. This version runs lots of other places, and was running here for quite some time until node suddenly got corrupted.

It could be worth trying a few things. Firstly, we could just try to recreate the container. I don’t imagine that would help if the image has been corrupted, but it’s certainly worth a try. Secondly, we can delete the image and have the supervisor redownload it. This should definitely help, again assuming it’s corruption.

What would worry me with these methods is that if it corruption then likely it will happen again, as these tend not to be isolated incidents. Fsck can certainly be used as well, although the success rates for this is typically low.

To remove the container, ssh into the host OS and do balena ps, finding the specific service id, and doing balena rm -f <id>. The supervisor should start it back up fairly quickly and you can check the output. (If the supervisor doesn’t immediately start it back up you can restart the supervisor to force a re-application with systemctl restart resin-supervisor). First see if this fixes the issue, otherwise:

To delete the image, use balena inspect on the container ID that can be found with the method above, and run balena inspect <container-id> | jq '.[].Image' and remember that id (we’ll call it image-id). Then run balena rm -f <container-id>, followed by balena rmi -f <image-id>. Then proceed to restart the supervisor again.

This should give us a decent insight into whether this is corruption or not.