Infrastructure and scalability

Hi all,

I’d like to know some more about the infrastructure and scalability of openBalena. Just to know how it works and some practical things.

First of all, I know openBalena consists of the following things (and please, correct me if I’m wrong):

  • haproxy (functions as a forwarder afaik)
  • VPN (openVPN based)
  • API
  • Registry (For provisioning devices?)
  • S3 (For storing container images/releases?)
  • Database
  • Redis
  • cert-provider (For Let’s Encrypt certificates)

Now, for future reference, I’d like to know some more about the following things:

  • Infrastructure
    I’d like to setup an infrastructure around my openBalena instance. In other words, my openBalena server now runs on 1 server, which has 2vCPU’s, 4GB RAM and 80GB SSD. It only has a few devices, so probably no issues there. But in the near future, it’s possible that a couple of hundred devices will be registered and connected to this instance. What’s the best way to setup my server infrastructure? And do you have some predictions about what hardware to use? Can I create some kind of load balancer? And instead of the S3 container running on my 80GB SSD, is it supported that I use another volume for this?

  • Scaling
    Because more and more devices will be connected, scaling is a big thing. The VPN is probably one of the things that need scaling the most. In the open-balena vpn repository, there’s a small explanation that you can scale. But is it as simple as setting BALENA_VPN_GATEWAY=<openbalena-server> and off it goes? And how about some of the other containers, how can I scale those? For example, running multiple API’s for load balancing or multiple redis/db instances for redundancy?

  • Server hardware
    And last but not least, are there some hardware requirements that you’ve to take into account? For example, a VPN server needs 8GB RAM, but the API needs lots of CPU. Of course it’s very hard to say: You’ll need this much RAM and this much CPU for XXX devices, but an estimate would be more than welcome!

I can’t imagine that I’m the only one with these questions, because openBalena is an awesome project and I’m sure many others use it. So feel free to add some more questions (and answers of course!), because it’s not a topic only created for me, but for many others that are using or are going to use openBalena!

Thanks in advance!

1 Like

Hi there,

Just touching base to let you know that we’re working on getting an answer for you.

Thanks,
James.

1 Like

Hey, I can give you a few pointers but I first have to warn you that there are caveats about deploying openBalena for critical production use, with the most important one being that there’s no ability to remotely update the host OS of devices provisioned to an instance.

What hardware to use for X number of devices is a really hard question to answer even in broad strokes, because it really depends on what you’ll be doing. Merely maintaining 200 online devices shouldn’t be too hard on the base specs you mention, the baseline usage metrics of the instance will in fact probably be on the lower side. You’ll also need however the headroom to handle spikes of activity, which are larger the bigger a fleet is (i.e. number of devices associated with a single application) which is where the question what you’ll be doing comes in.

To give you a sense of what component is stressed under what conditions, I’ll briefly go through the cycle of deploying a new release to devices. You’ll obviously have to first push the built images, which means uploading to S3 via the Registry; so there’s some I/O there. As soon as the release is complete, the API will reach out through the VPN to every online device belonging to the application the release is for, and notify it; some more I/O there then. In response, each device requests more details for its target state from the API (from the “state endpoint”), which involves a few relatively big queries to the database; so there’s some more I/O and a potentially noticable spike in CPU from the database.

Each device then starts downloading the images for the new release, which initially involves queries to the Registry and API for authorization; database activity. Download of each image starts, involving I/O on the Registry and S3, and while doing so the device keeps updating the API every second or so about progress (the progress bar you see on the balenaCloud Dashboard during an update :); much activity on multiple components at this stage, especially on the database, due the amount of writes that happen. As each device completes the download and applies the update, traffic dies down.

This is all on top of any “ambient” activity that sets the baseline I mentioned above – devices polling the “state endpoint”, sending logs (these end up in Redis via the API), maintaining the VPN connection, and maybe a couple more things I’m forgetting now.

From the above, it’s evident there’s a number of dimensions this traffic depends on: the number of devices, the average size of the images, the average number of services of each release, the rate of new releases, the poll interval of the devices, etc.

A couple of rough guidelines: the components you’ll primarily concern yourself with are the VPN and the API, and both can easily be scaled horizontally if needed. The more RAM and CPU you can get your hands on, the better you’ll be able to handle spikes in traffic. It’s best to dedicate hardware resources on each component. It’s worth considering to “outsource” as many of the surrounding components as possible, for example using a cloud provider’s solution for the database, Redis and S3 so that you’ve got fewer things to concern yourself with. All of these are trivially configured to point to external services instead of the bundled containers.

Hopefully this helps. The base specs you mention should be able to handle the baseline load from 200 devices and should definitely get you started. In any case, as the load increases, you’ll see what’s needed and provision new resources accordingly. And keep us updated!

1 Like

Thanks for the quick and very detailed response about openBalena.

You’re right about your first point, openBalena doesn’t support host OS updates. I’ve mentioned this quite a few times on the forum and even had contact via mail with some of you guys. This is obviously the main issue when using openBalena, but we can’t use cloudBalena for our project because of the monthly fee per device. So it’s great that there’s an alternative, and I’d love to dig deep into openBalena and know all ins and outs.

It isn’t the intention to monitor all devices all day via the VPN or all logs, but when a device misbehaves, we’d like to login on it and see where it went wrong. But hearing that these specs will probably be fine for the first 200 devices is a good thing to know. Because if it was fine for the first 10, then we had to come up with another solution :slight_smile:

Is there any guidance on how to setup these things horizontally? Because as far as I know, the openBalena project uses multiple docker containers which wait for each other and are linked to each other in order for it to work. But running the API on server A and the VPN on server B, and the DB etc on other servers (so scale horizontally) is more complex obviously. And is the best way to run all those components in docker containers, or is it best to start a Redis instance and postgres instance and let the VPN etc connect with that?

I’ll take a look at the containers, because they’re open source (which is great!) and see how far I can come. But all input is welcome from the maintainers of the project! :slight_smile:

There’s two ways to scale the API and VPN horizontally: one is by configuring them to automatically fork themselves internally (up to NUM_CPUs each) and use up all available CPUs of the hardware they’re running on and the other is obviously deploying multiple instances (i.e. containers) of each. Both ways heavily depend on the available hardware. With 2 CPUs on the base hardware and all services running in it as you describe, you’ll potentially get worse performance if you try to scale either way, due to overhead and the fact they will all be fighting for the same fixed set of resources.

Our recommendation would be to keep things simple and use the services as they come out of the box (which means one process per container per service) and upgrade CPU and RAM as you observe issues. This should get you quite far before you hit real performance issues. At that point you could consider deploying each service into its own hardware, and even then we strongly suggest you start with the database, S3 and Redis first. The key is to have good visibility into the server in order to identify problems early and have the ability to easily perform upgrades.

1 Like

Thanks for your response!
I’ll keep you guys informed about when we’ll setup a production server infrastructure and which problems we’ll run into or which successes we’ve had :slight_smile: