Hey @tacLog, good questions, let’s see!
Testing, testing, and more testing. Try everything out on a lab device before rolling out to a fleet, and try to keep it as close to how the real setup looks as possible.
For code development, pinning versions of libraries, OS versions, base images, as much as possible, so things don’t change unexpectedly on a repush (when silently new versions are pulled in).
These are the top generic ones, I guess there would be others too, just need to think. You can also check our Going to production entry in our documentation, should give some other suggestions too.
Define dead-on-arrival? Devices that never deploy properly? If so, then very tiny, and usually due to local network limitations (ie. devices cannot reach the cloud to finish their provisioned). Once the network issue is taken out, I don’t really see dead-on-arrivals in this sense. The over time losses are very small as well. Some part of the losses are due to OS updates, but those we are adding more and more protection against (by using health checks and rollbacks, which we have seen to save devices in practice already). Many other “losses” are temporary as well in the sense that a physical power cycle recovers a bunch of cases. One of my tasks is looking into these losses in general, and eliminate them over time. The third kind of loss is due to storage corruption, this is very rarely resulting in a lost device, but in a misbehaving device (that can still remain accessible and frequently we can recover them). For this, better SD cards or using eMMC storage is a good protection against (hence our use of eMMC on the balenaFin, learning from support ) In general device loss is an extraordinary event, and don’t happen with any appreciable frequency.
I was working in academia (physicist by trade), then other parts in the tech industry (marketing, then community, both involving creating a wide variety of projects), and was a sysadmin for ages. The FleetOps idea came out from what we’ve seen in our support to be a gap. Then I got to do it mostly by amassing experience with the platform, with the OS, and how it fits together, plus a willingness to stay with problems longer than most people, to get to the bottom of them. At least I think this was how it happened
The support model was developed due to a strong vision of how to make a better product. Support questions guide as much better what the practical uses cases are, give us much more varied experiences than what our team can do (even though we dogfood a lot - using balena to develop balena, among other things, so we can be our own customers too), there’s an incredible amount of learning in there. In a way, fixing support issues of t a single person is a side-effect of digging in deep and trying to fix the issues for everyone (and fixing the right issues). The scalability is something that we are thinking a lot, and working on. So far it seems it is scalable (we can handle much higher number of devices and customers with not-as-much increased support load). Also, the sign is that the support questions we have over time are more and more niche, more subtle, more unusual, suggesting how the common issues (that would increase the support load in general) are decently taken care of. But there’s always more to do.
Happy making!