Monitor SD card tear/wear

We have issues with our current solution wearing out SD cards, resulting in unstable devices after a period of operation. One initiative taken has been to re-write parts of our solution to rely more on RAM storage than disk storage.

Now, my question is: how would you monitor if this re-write has indeed had the intended effect (less SD card writing)? - put another way: how would you monitor how hard a current solution is to the SD card?

We have prometheus node-exporter installed on all devices and try to analyse some of that, but I’d be curious about the community’s take, and possibly more detailed guidance on what to look for in the prometheus data.

Thanks you

Hello, I would try and use iotop to gather metrics WRT disk usage. In general if my use case demanded disk-write heavy operations I think I would have gone down the path of using a device that doesn’t use SD cards and use eMMCs maybe. Sure let’s see what the community’s take is on this.

1 Like

+1 for iotop. If I may add to the question, it would be very useful to classify, even, broadly, what “heavy” writes means exactly? I think defining quantitatively, although broadly, the meaning of low-medium-high writes could really help developers understand the hardware requirements for their projects. Thank you!

1 Like

There isn’t exactly a definition but in my mind, if I have a device with sensors that I am gathering data from and saving them on the disk with high frequency (multiple times per second) then I am constantly writing to disk, so this is a write heavy use case. I can’t exactly categorize them, because every use case is unique and has different requirements. But our approach here is correct we are trying to figure out how often and what process causes the disk IO.

1 Like

Thank you for the input. Will go back to iotop. What I liked about the prometheus node-exporter approach is that is allows for simpler collection of metrics for subsequent analysis. With iotop - I would think - I’d need to eyeball a box during operation. So I’m still curious if anyone has hints on how to monitor SD card writes with prometheus.

However, a few followups on the iotop approach:

  1. in a multi container setup, I’d then need to analyze iotop in each container? Or how would you approach this?
  2. in a multi container setup, I have issues running iotop. It seems to fail with “OSError: Netlink error: No such file or directory (2)” (iotop v0.6), even in privileged containers. Thoughts?

Thank you again.

Hi, iotop will read the stats from the kernel so you can run just from a single container in your multi-container application.

Depending on your device type, you might need to configure your kernel with some extra configuration in order to use iotop. We have an open issue to add these by default to a future OS release (https://github.com/balena-os/meta-balena/issues/2066).

Another possibility you might want to consider is using BCC tools. We are also working on a solution that will make it easier to use these with the default BalenaOS (https://github.com/balena-os/meta-balena/issues/2065). At the moment you will also need to rebuild the OS and the kernel to add the support that BCC tools require.

1 Like

Hi, I have done some measurement on a system using bcc tools that you can use to compare with your own results. This was using 2.60.1+rev5, without any application container but the one used to run the bcc tools themselves.
The following are file accesses:

root@f2d5749d8657:/# /usr/share/bcc/tools/filetop -C 1800
Tracing... Output every 1800 secs. Hit Ctrl-C to end

17:08:08 loadavg: 0.22 0.06 0.02 7/230 586

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
9545   biosnoop         6206   0      11534   0       R kallsyms
1470   balenad          26     80     104     320     R local-kv.db
2107   balena-engine-c  36     0      396     0       R config.json
2638   balena-engine-c  32     0      352     0       R config.json
1733   balenad          19     65     76      260     R local-kv.db
5095   balena-engine-c  30     0      330     0       R config.json
1483   balenad          20     60     80      240     R local-kv.db
1669   balenad          18     60     72      240     R local-kv.db
2071   balena-engine-c  28     0      308     0       R config.json
11301  balena-engine-r  76     0      295     0       R cgroup
9768   balena-engine-r  76     0      295     0       R cgroup
10652  balena-engine-r  76     0      295     0       R cgroup
10259  balena-engine-r  76     0      295     0       R cgroup
10408  balena-engine-r  76     0      295     0       R cgroup
10017  balena-engine-r  76     0      295     0       R cgroup
11067  balena-engine-r  76     0      295     0       R cgroup
11464  balena-engine-r  76     0      295     0       R cgroup
9612   balena-engine-r  76     0      295     0       R cgroup
1667   balenad          18     50     72      200     R local-kv.db
11014  cat              2      0      256     0       R timestamp

Apart from the above there will also be writes to the journal log. I have run another trace in parallel that reflects those too.

A file has been uploaded using Jellyfish: https://jel.ly.fish/9b185cc3-2245-4bd8-97ff-67430fd5d4f0