balenaCloud outages: their causes and what we’re doing to resolve them

Please use the thread below to discuss the related blog post:

2 Likes

I’ve never fully grokked why there’s a VPN connection…but if it’s necessary, are the issues related to which protocol you use? WireGuard is a thing in more current linux kernels, if that helps…but I’m sure the devs are aware.
Is there a way to get off your servers and just SSH into my balenasound machine from wherever, or update with SFTP? or have a direct login from a machine on the same network? (takes away the cloud aspect, but…)

Hello, the performance issues we’ve described in the blog post are not directly related to our VPN subsystem. We use a “VPN” to enable features such as web terminal and SSH. While we currently use OpenVPN to provide this functionality, technically any secure transport can be used, as long as it is supported by the kernel and/or user-space, even SSH tunnel. Currently, we don’t see an immediate technical/business requirement to move off OpenVPN, but if such a need will arise in the future, we will certainly consider it.

Please take a look at openBalena - Home for an Open Source version of balenaCloud, which can be run privately/internally. You could also continue to use balenaCloud, but inject your SSH key(s) into config.json to enable openssh on devices without having to go via the CLI/WebUI methods.

Hi, we are really interested in any updates on these database issues and the proposed solutions. Is there anything further to report at this stage?

We have experienced large-scale outages at an unacceptable level across our deployed devices due to these issues, and are in the process of building our own replacement infrastructure so that we can stop using Balena, if necessary. Our preference would be to continue with Balena, however we can’t commit to doing so without knowing that we won’t see similar outages in the future.

So, any info you have on the progress of your proposed database changes would be much appreciated. Thank you

Hi @martink,

Apologies for the delay on this response! There are quite a few things we’re working on, but I’ll do my best to summarize:

  1. we are making some longer-term fixes relating to our Supervisor and its methods for using our API to remove unnecessary usage, which should reduce load across our entire customer base.

  2. we did make the move over to Aurora earlier this month, but there are still some improvements we intend to make with regard to replica lag that will be of help over the long term.

  3. I got a very long list of specific improvements we are making across our API, Registry, Deltas, Proxy, and Supervisor that are likely more specific than most people want to know, but I counted 28 items and would be happy to share them with you in email if you are genuinely interested! :grimacing:

Overall, we’ve had a much more stable experience since the Aurora migration, and the team is confident that the experience our customers have should be smoother than ever now.

However, if that’s not been your experience the past few weeks, please do let us know. We’re keeping a close eye on things the next few months, and that includes customer feedback to make sure we hear about experiences that aren’t matching the expectations we have based on the improvements we’ve put in place. So thank you for reaching out, and please do keep in touch with us about your experience going forward!

2 Likes

Thanks very much for the update, that’s really helpful.
I’m guessing it will take some time for Supervisor changes to make their way out to user devices?

Thanks for the offer of getting into all the details, I think for our team we don’t need to know about that stuff at this stage :slight_smile:

It sounds like the change to Aurora made a significant improvement? I haven’t had a significant outage on our fleet reported since late last year. So anecdotally at least things are looking better - long may that continue!

1 Like