RPi 3 B+ multicontainer losses connection completeley after some days

Hi,
I have a few RPi 3B+ deployed with multicontainers. 3 of them have lost connectivity after a couple of weeks. I have direct access to one of them. The RPI is properly booting, ethernet network and cabling looks fine, but the device is NOT getting IP. I took the SD and check the resin-boot/system-connections folder which contains the 2 “ignore” files. As far as I understood, there is no need to have a resin-ethernet file since NetworkManager should handle it automatically.
I can confirm that it is not a problem with any other thing that the balena/RPi (no router, internet, cabling … since I have the problem in different locations where everything was working fine).
I was thinking if a container growing too much could cause this. I’m not sure but since I have a HomeAssistnant container running, maybe it is logging too much info. But it seems weird that this would cause Balena/Resin to be unable to get and IP…
Since I have info in the containers that I don’t want to lose (not yet implemented a transfer of config files…) I don’t want to re-flash a new image and deployed it again (also, I don’t have access to some remote devices).
Any suggestion? I’m running out of ideas. I thought testing it with a resin-ethernet file in the resin-boot/system-connections folder, but it is read-only (and wouldn’t solve my remote problems).
Thanks

UPDATE: After a few hours investigating, I have the feeling that the SD cards are getting corrupted. They became read-only with no way to reformat them (I tried many tools). They are supposedly good ones (Samsung EVO Plus 32GB), but at this moment is the only explanation I have. The Raspberry PIs are all using AC Power (not USB). Maybe Home Assistant recording data continuously is too much for them…I don’t know.

Hi,

What os version are you using?
It is true that continuously logging on the SD card itself can lead to corruption.

Since I have info in the containers that I don’t want to lose

I was confused a bit by this. Are you using the data partition to store your files? If so you should be able to mount it on your computer and retrieve the files from there.
Let me also share with you the documentation page about this

About detecting sd card corruption, you can ssh to the hostOS from the device summary page, run dmesg and search for ext4 or similar errors

Where? In the containers I am using Debian (one with Home Assistant, and another with an MQTT to read data from a device).
I have, indeed, recovered the data files from an SD card. The problem is that I don’t have access to the others SD cards (remote locations). Since the connectivity is completely down, I can’t access to the devices by any mean (SSH, or whatever).
I can’t connect to the hostOS either obviously. But I take note of the checking remotely the status of the SD cards for the others devices I have currently running (so I can monitor them just in case things start to fail…).
I don’t how if there would be a way to make the resin-boot and host-os part resilient to SD corruptions in other “parts” of the SD (where the containers or data are stored). It’s a pitty that a corruption in thaose partitions of the SD may cause the whole balena to boot…

Hi @Kloonich,

If you have another device available locally on the same subnet as the downed devices, you may be able to ssh there and then ssh to the device that is “down” (there is always a chance it is disconnected from balena, but still alive).

While writing is expensive in terms of SD health, if you are publishing the data via MQTT you should be covered from losing data if a device fails in some way.

It is in our roadmap to provide better monitoring and metrics for when SD cards start to fail, though in my experience these corruptions are more like a waterfall than a gentle degradation.

We have optimized the OS in various ways to reduce the stress on the SD card, but unfortunately there is only so much we can do in the face of failing hardware.

I hope this information helps!