I'm a balena FleetOperations engineer, AMAA

Hi everyone,

I’m a Fleet Operations (FleetOps for short) enginneer at balena. FleetOps usually works with tricky support cases, issues that affect many devices - even our whole fleet potentially in some cases, and also build some custom tools when such required (and the product features haven’t yet caught up to with some requirements that pop up in support often.

Thought it would be interesting to drop in and see if you are interested in anything FleetOps does, ask me almost anything!

Hi Gergely, I’ll take you up on your offer :slight_smile:

The concept of apps in Balena makes sense, as the build environment is determined based on which app is being pushed to. I’m curious to know if there are plans to allow a single app (ie. our ‘production fleet’) to include devices on different platforms. Right now we have to ensure our app is deployed to our Raspberry Pi app devices, our RevPi devices, and our new x86 devices separately.

Also, and this is probably a larger concern - I would say 85% of our devices are offline, some for many months, yet we have to pay the same amount each month whether they are online or offline. Even if we decommission them, the only way not to pay for them is to delete them from the system, which means they turn into bricks if I’m not mistaken. How do you recommend companies manage costs based on what’s actually being used in the field rather than just provisioned to be used in the future?

Thanks for all your help in the past!

-James

1 Like

Hey @jgentes, this goes a bit beyond FleetOps, and more on the product side :slight_smile:, will give it a shot anyways!

a) the mixed device type applications are already in the platform to some extent: devices with the same architecture can be mixed. So Raspberry Pi and RevPi can be in the same app. We are planning to do mixed architecture apps in the future as well, but there’s a lot of foundation / UX work that is need to make it easy to use and work well too. Would also turn around the question (which our customer success team might have asked you, if not, bug them ;), how do you end up with the same code base on different architectures? What’s your story? Would help us to make the best tools that actual projects need. (yeah, this feeds into FleetOps a bit, keeping an eye out for the needs of fleets, just like you mentioned these use cases)

b) for the offline devices, have you heard about inactive devices? If you expect the device not being online for a while, you can deactivate them for a one time fee, and then don’t pay until they come back online again. Would this help your use case? I think that might be a good start for cost management, but I wonder if there are other ideas also from our customer success team, will ping them about this too :slight_smile:

And cheers, will be there for you in the future too!

Thanks for the reply - we have been upgrading our hardware over time, so we’ve ended up with different platforms that we support.

As for the inactive devices, I’ll give that a shot!

1 Like

Hey @imrehg,

I will try and ask some hard ones?

Lets get some war stories that you can share?

Also, what are you the top best practices you wish all clients would follow but few do? Including myself most likely. :slight_smile:

Also more seriously, what percentage of complete dead on arrival rate do you see among deployments? What percent get lost over time?

What did you do before this that lead you to this career?

How did balena develope it’s support model and is it sustainable as you scale?

Thanks for the many issues you personally help me and my team with.

-Thomas

Hey @tacLog, good questions, let’s see!

Testing, testing, and more testing. Try everything out on a lab device before rolling out to a fleet, and try to keep it as close to how the real setup looks as possible.

For code development, pinning versions of libraries, OS versions, base images, as much as possible, so things don’t change unexpectedly on a repush (when silently new versions are pulled in).

These are the top generic ones, I guess there would be others too, just need to think. You can also check our Going to production entry in our documentation, should give some other suggestions too. :slight_smile:

Define dead-on-arrival? Devices that never deploy properly? If so, then very tiny, and usually due to local network limitations (ie. devices cannot reach the cloud to finish their provisioned). Once the network issue is taken out, I don’t really see dead-on-arrivals in this sense. The over time losses are very small as well. Some part of the losses are due to OS updates, but those we are adding more and more protection against (by using health checks and rollbacks, which we have seen to save devices in practice already). Many other “losses” are temporary as well in the sense that a physical power cycle recovers a bunch of cases. One of my tasks is looking into these losses in general, and eliminate them over time. The third kind of loss is due to storage corruption, this is very rarely resulting in a lost device, but in a misbehaving device (that can still remain accessible and frequently we can recover them). For this, better SD cards or using eMMC storage is a good protection against (hence our use of eMMC on the balenaFin, learning from support :wink: ) In general device loss is an extraordinary event, and don’t happen with any appreciable frequency.

I was working in academia (physicist by trade), then other parts in the tech industry (marketing, then community, both involving creating a wide variety of projects), and was a sysadmin for ages. The FleetOps idea came out from what we’ve seen in our support to be a gap. Then I got to do it mostly by amassing experience with the platform, with the OS, and how it fits together, plus a willingness to stay with problems longer than most people, to get to the bottom of them. At least I think this was how it happened :slight_smile:

The support model was developed due to a strong vision of how to make a better product. Support questions guide as much better what the practical uses cases are, give us much more varied experiences than what our team can do (even though we dogfood a lot - using balena to develop balena, among other things, so we can be our own customers too), there’s an incredible amount of learning in there. In a way, fixing support issues of t a single person is a side-effect of digging in deep and trying to fix the issues for everyone (and fixing the right issues). The scalability is something that we are thinking a lot, and working on. So far it seems it is scalable (we can handle much higher number of devices and customers with not-as-much increased support load). Also, the sign is that the support questions we have over time are more and more niche, more subtle, more unusual, suggesting how the common issues (that would increase the support load in general) are decently taken care of. But there’s always more to do. :bar_chart:

Happy making!