data filesystem parameters

Hi,

I’m looking for a way to influence filesystem mount options or parameters for balena devices.
There’s a few things I’m missing that concern resilience or reliabity.

Currently, on a x86 test node, I can see the following, using resin-data as an example.

mount options:

root@f355edf:~# grep /resin-data /var/lib/docker /proc/mounts
/proc/mounts:/dev/sda6 /resin-data ext4 rw,relatime 0 0

filesystem parameters (tune2fs -l)

root@f355edf:~# tune2fs -l /dev/sda5  | grep -e "Filesystem features" -e "Errors behavior"
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Errors behavior:          Continue

So what?

I’m missing a few things here thinking of embedded devices or anything that’s supposed to keep running for a while.

I would like to:

  • mount things with discard on systems where I know the SD card/eMMC controller supports it or where we boot off M.2 SSD. We pick industrial disks but I would like to not counter that by not treating them properly. I don’t want to have a TRIM cronjob as that will cause heavy IO bursts
  • use remount-ro or panic for devices with IO failure. the current continue means they will shred themselves. We’ve had the exact thing happen with a balena node, it just got worse by restarting. in the end it wasn’t possible to remotely repair it or anything because it just destroyed itself so badly, and it took longer to diagnose.
  • use block_validity to ensure the system tests for successful writes right away
  • also handle journal settings, there’s one that handles how it aborts journal writes, and whether it checksums…

But why?

it’s been a while and maybe something of the above has finally become more of a default. In any case I would like to make things more robust, and especially be able to influence those little things that make crash behaviour more consistent. i.e. remount-ro is rather similar to continue but will not end up corrupting even more or losing state. In the systems I’ve worked with I usually went for onerror=panic, but my point here is more along the lines that it depends on your specific fleet what the best parameters are, but in any case you would benefit from setting them.

I saw Making sure filesystems are mounted with sync which makes me assume these people are in a similar ‘boat’ and if you read that thread in full it says that the writes are “always consistent” yet also that the system runs on the default commit interval and there are no further measures. As I see it’s using onerror=continue that is not gonna work out magically. It’s gonna be OK most of the cases anyway - since a system might just as well crash in the course of the whole endeavour, i.e. when a power instability causes the disk error and also causes a crash.
Still there’s a lot of ground between OK and consistent.

But how?

So, question: how do I go about properly setting those parameters for a fleet in a way that is in concert with Balena’s tooling?