Corruption of resin-state partition?

First things first, my use case is definitely non-standard, so I’m open to the possibility I’m doing something wrong, as well as the fact that any support that can be given will be limited at best.

I’ve got a bunch of devices running a custom build of ResinOS on Raspberry Pi Compute Module 3s. I recently found that two separate devices stopped working due to a corrupted state partition. I suspect, but have not confirmed, that this corruption is being caused by power failure during a write operation. The affected devices were not corrupted at the same time or in the same circumstances, so some environmental causes can probably be ruled out.

The corruption made it so that the filesystem on the partition could not be mounted. This failure to mount causes many important things to break, such as the docker and dropbear ssh services.

Some other info: I mentioned that I’m using a custom build. It is based on Resin v2.3.0. The main difference that I believe may be relevant to this problem is that I’m using a 4.9.24 version of the kernel, as opposed to the 4.4 that resin uses by default.

My questions would be:

Has anyone else seen similar corruption of the state partition?

Is this to be expected if the devices are regularly shut down due to power loss, or should they be more resilient?

Could the corruption be caused by something other than unexpected shutdowns?

Are there any ways I could try to protect against this corruption (besides the obvious of shutting down properly, which I will try to do, but I still need to account for potential power loss)?

Thanks for your help!

Hey Sean,

This is a tricky one to answer. Is the main reason you are running a custom build to get the newer kernel? Because I see resin-raspberrypi will soon ship with 4.9.29, so you may be able to avoid this.

We do see a fair amount of corruption, but often I believe it is attributed to SD cards, so I’d assume you’d be better off. A definite contributor is brownout so the best thing you can do is get a good power supply.

I also know that we try and limit the writes we do, is it possible that your custom build is performing a lot more write operations? For instance logging to disk somewhere?

Hey Craig,

The kernel version is a big reason we’re using a custom build, but not the only one. I also am including some out-of-tree kernel modules that our hardware needs, using htpdate/fake-hwclock instead of systemd-timesyncd, tweaking NetworkManager’s configuration a bit, adding an authorized ssh key, and some other miscellaneous changes.

I’m pretty sure I’m not doing any more writing than the base image though. I don’t have persistent logging enabled, and the only regular writes I’m doing is that fake-hwclock is updating a file’s timestamp with the current time every 15 minutes.