This is all really interesting, thanks for sharing. A really big take away for me on this is your comment on the root-cause of issues being SD card failure and it being hard to diagnose, This is so true, I have seen this in the past and I think it deserves its own bullet point as a key consideration. Many hours lost in diagnosis of weird issues that are hard to attribute to the SD card without swapping it out first. It would be a major consideration for me in support.
I wouldnāt want to have to advise businesses on the best way to cost their hardware (SD card vs no card), itās such a difficult task. If you are of the scale of 15,000 units, I would probably keep a box of SD cards pre-flashed and then if in doubt just swap it out. Harder to tell someone to do that though if they only have one device, they donāt want to pay an extra 5USD just on a hunch.
I have had similar issues with power supplies, if you get one without the right spec it can cause issues that will occupy you for days (network limitations, CPU throttling, the list goes on and doesnāt always provide sufficient indication). For that reason I tend to suggest people buy official ones (such as for the RPI) for a little bit more just to avoid those undiagnosable problems (āI picked up a USB cable from the market for 1USD and now my device does x or yā).
Another consideration is the data stored on the device. For many use cases, swapping an SD card for another one can work if there isnāt any important data stored locally. You just provision the device with the stock software and off it goes. But if there is local data stored in volumes on the device that canāt be replaced, such as offline collecting data from sensors or users in a database, then SD card may be a no for me. Too high risk of loosing irreplaceable data, plus if it is a device doing that sort of thing then the writes to the card are much higher (the writes being the bit that damages cards, not the reads).
I donāt envy your role having to advise on these things. There really are too many dimensions to be able to produce a useful guide (although I would love to be proved wrong), especially on the cost per unit debate. I guess if I was advising someone I would be sure that they had fully understood the risk in SD card to prevent me being as liable.
Very interesting to hear about the attempts to provide some metrics. I imagine it would be very possible though. I suspect something similar to what Apple did for their battery monitor on iPhones, it measures how many cycles it has gone through (writes in the case of an SD card) and then provides an indication of those cycles/writes. Like Appleās battery monitor though, it would be very unpredictable. For some users though it could help, you could establish a policy to replace cards after every x writes, which would reduce failures and allow for much better long-term costing of servicing.
I suspect something to accompany that would be doing some stress tests on a bunch of SD cards. Writing to it over and over and get an average of how much data tends to be written before it fails. Much harder than it sounds of course, as you mention itās not always clear when a card has failed, environmental factors and blind luck/bad luck with certain cards may play a part. Some indication is probably better than none though I guess, as long as itās heavily caveated.
I think another area that might be worth exploring (which I havenāt yet) is external storage for running Balena. I have come across a bunch of cases where people boot off a USB pen to run Raspberry OS. It provides far quicker read/write speeds than an SD card, is less prone to corruption, cheap hardware, easy to swap out (you could leave hardware in place and just swap the USB pen), easy to flash, familiar to non-techy users, easy to expose the port on cases without worrying about tampering, could offer TB of data storage, I could probably think of more. Being able to provision a USB pen directly from my computer rather than hunting around for the card reader that never seems to be where I put it sounds appealing too.
A quick Google search and Iām sure you will find examples, but I think the idea in a Balena context would be an extremely extremely light Balena image on the SD card that does nothing but tells the device where to boot from (this card would have very little data on it and never have writes so life span would be improved considerably, or on at least some hardware you can write to the hardware the instruction to boot from USB and then not need the SD card in the device anymore). Then the main Balena OS on the USB pen which it would boot and run from. Come to think of it, this would also make those cheaper boards that contain tiny eMMC (like 2mb) to good use, maybe those could be flashed with the simple instruction to boot from the USB pen keeping the cost of hardware down (much cheaper than 32GB eMMC). Iām convincing myself of this more and more as I write, wow what a boon this would be for development! I would have an SD card in my devices that boots off a USB, then only ever have to change the software on my USB pen. No more slow flashes of SD cards, finding card readers, slow read and writes when doing complex build processes, that would transform my whole workflow! Plus I wouldnāt be running the risks of burning through corrupt cards in development. This may have been what I would have considered for your 15,000 unit customer (without knowing more details) as a middle ground (or eMMC as additional hardware RasPiKey: Plug And Play EMMC Module For Raspberry Pi | The Pi Hut, although with that many units I would take some convincing to pay out).
Making sure I circle all the way back around to the original topic of the user who posted this thread, seems to me there could be an equal number of use cases for SD card and fixed hardware for both hobbyists, small business and corporate, all dependent on their needs. I would encourage supporting both.
Would I be correct in thinking it was an SD card based OS that went to space? Not a bad advert for SD cards: Beyond the cloud: Docker containers in space