We have issues with our current solution wearing out SD cards, resulting in unstable devices after a period of operation. One initiative taken has been to re-write parts of our solution to rely more on RAM storage than disk storage.
Now, my question is: how would you monitor if this re-write has indeed had the intended effect (less SD card writing)? - put another way: how would you monitor how hard a current solution is to the SD card?
We have prometheus node-exporter installed on all devices and try to analyse some of that, but I’d be curious about the community’s take, and possibly more detailed guidance on what to look for in the prometheus data.
Hello, I would try and use iotop to gather metrics WRT disk usage. In general if my use case demanded disk-write heavy operations I think I would have gone down the path of using a device that doesn’t use SD cards and use eMMCs maybe. Sure let’s see what the community’s take is on this.
+1 for iotop. If I may add to the question, it would be very useful to classify, even, broadly, what “heavy” writes means exactly? I think defining quantitatively, although broadly, the meaning of low-medium-high writes could really help developers understand the hardware requirements for their projects. Thank you!
There isn’t exactly a definition but in my mind, if I have a device with sensors that I am gathering data from and saving them on the disk with high frequency (multiple times per second) then I am constantly writing to disk, so this is a write heavy use case. I can’t exactly categorize them, because every use case is unique and has different requirements. But our approach here is correct we are trying to figure out how often and what process causes the disk IO.
Thank you for the input. Will go back to iotop. What I liked about the prometheus node-exporter approach is that is allows for simpler collection of metrics for subsequent analysis. With iotop - I would think - I’d need to eyeball a box during operation. So I’m still curious if anyone has hints on how to monitor SD card writes with prometheus.
However, a few followups on the iotop approach:
in a multi container setup, I’d then need to analyze iotop in each container? Or how would you approach this?
in a multi container setup, I have issues running iotop. It seems to fail with “OSError: Netlink error: No such file or directory (2)” (iotop v0.6), even in privileged containers. Thoughts?
Hi, iotop will read the stats from the kernel so you can run just from a single container in your multi-container application.
Depending on your device type, you might need to configure your kernel with some extra configuration in order to use iotop. We have an open issue to add these by default to a future OS release (https://github.com/balena-os/meta-balena/issues/2066).
Another possibility you might want to consider is using BCC tools. We are also working on a solution that will make it easier to use these with the default BalenaOS (https://github.com/balena-os/meta-balena/issues/2065). At the moment you will also need to rebuild the OS and the kernel to add the support that BCC tools require.
Hi, I have done some measurement on a system using bcc tools that you can use to compare with your own results. This was using 2.60.1+rev5, without any application container but the one used to run the bcc tools themselves.
The following are file accesses:
root@f2d5749d8657:/# /usr/share/bcc/tools/filetop -C 1800
Tracing... Output every 1800 secs. Hit Ctrl-C to end
17:08:08 loadavg: 0.22 0.06 0.02 7/230 586
TID COMM READS WRITES R_Kb W_Kb T FILE
9545 biosnoop 6206 0 11534 0 R kallsyms
1470 balenad 26 80 104 320 R local-kv.db
2107 balena-engine-c 36 0 396 0 R config.json
2638 balena-engine-c 32 0 352 0 R config.json
1733 balenad 19 65 76 260 R local-kv.db
5095 balena-engine-c 30 0 330 0 R config.json
1483 balenad 20 60 80 240 R local-kv.db
1669 balenad 18 60 72 240 R local-kv.db
2071 balena-engine-c 28 0 308 0 R config.json
11301 balena-engine-r 76 0 295 0 R cgroup
9768 balena-engine-r 76 0 295 0 R cgroup
10652 balena-engine-r 76 0 295 0 R cgroup
10259 balena-engine-r 76 0 295 0 R cgroup
10408 balena-engine-r 76 0 295 0 R cgroup
10017 balena-engine-r 76 0 295 0 R cgroup
11067 balena-engine-r 76 0 295 0 R cgroup
11464 balena-engine-r 76 0 295 0 R cgroup
9612 balena-engine-r 76 0 295 0 R cgroup
1667 balenad 18 50 72 200 R local-kv.db
11014 cat 2 0 256 0 R timestamp
Apart from the above there will also be writes to the journal log. I have run another trace in parallel that reflects those too.
We now enable I/O accounting in the kernel configuration from release v2.64.2 onwards. So you should be able to use iotop as well as bcc tools - as previously mentioned by Alex