What are we working on? The Balena Notes

Thanks Carlo!

Since joining Balena at the beginning of this year, I’ve been pushing forward a CI system that will make our lives a lot easier! We have 100s of repositories with many cross dependencies (because we reuse as much as possible to do a lot with a small team), we build server apps, CLIs, desktop apps and operating systems - and a lot of that for multiple architectures/machines!

Our so called “Transformers” build architecture is all based on the idea of having many small building blocks with opinions on how to build a specific thing into one or many other things. The big difference to GitHub actions and other similar systems is, that every “thing” has a type and well-defined meta-data, so that re-use becomes a breeze and, in fact, automatic!

Besides that, I’m currently trying to get my Pine64 to run BalenaOS by contributing there. Yes, doing work all across the company - if you want to! - is definitely a thing here :slight_smile:

And as he’s helping me with the latter, I nominate next @klutchell !

6 Likes

Thanks for the nomination @Hades32 !

As part of the OS team I’ve recently been focused on a new suite of automated tests to reduce the friction of releasing new versions to production. This involved reviving an old framework (not alone – with a small team) and rewriting a lot of tests that didn’t align with the current state of the OS.

It’s no surprise that some of the improvements to testing have also resulted in improvements in the OS itself!

On top of that I’ve taken over maintenance of our balena-preload module as used by the balena CLI. We have some cool ideas planned for other ways we could use this module in production but first it requires some TLC and I’ve been doing my best in that regard.

The rest of my time recently has been split between docs, labs, and devops, helping out wherever my experience takes me. Today I’m trying to get proper docker buildx support into balena-ci and I’m learning about a ton of backend components I’ve never seen before!

I’ve been here ~10 months and before I started I told myself that I would get balenaOS running on my RockPro64 and I haven’t even cloned a repo yet. Kudos to @Hades32 for going as far as he has with Pine64!

I’d love to hear what my partner in docs, testbot, and general shenanigans @vipulgupta2048 has been up to!

7 Likes

Thanks for the ping @klutchell but before that thanks for being the awesome dude you are and bringing your best each and every day. Actually, that’s true for all my balenistas, I love y’all.

As part of the testbot team (R&D), my primary focus is automated hardware testing.

  • The hardware aspect of it is the testbot.
  • The software aspect is Leviathan, the distributed testing framework we are developing/reviving.
  • The research aspect is finding alternative solutions to hardware & software problems, test optimization, or even debugging tricky issues with several variables at play.

I work with the hardware and on the software with the whole team pitching in with their respective expertise. I spend my time improving Leviathan’s UX, overall architecture, maintaining testbotSDK, developing new features and new test suites. I am also one of the remote caretakers of our testbot rig making sure it’s in top shape with regular updates, attention to logs, and in-depth debugging.

My secondary focus is docs. With the product surface being as big as balena, docs are nothing short of their own challenge. The team is collectively taking it head-on especially with the release party we organized recently. I work on new docs initiatives or major changes that need to be planned & executed.

The rest of my time is split between projects, support, self-learning, and running memeservice at balena. Balena is actually my first job after college and what an adventure it has been through and through. And, to continue telling you about the adventure that is balena, I like to nominate my fellow testbot caretaker, and recent balena Hack Week winner @rcooke-warwick :trophy:

4 Likes

Thanks Vipul :slight_smile:

My mission at the moment is to make automated testing on real hardware a reality.

  • This means working on the hardware (testbot) - that we use to remotely control, provision and power devices under test (DUTs)
  • Creating automated tests to replace manual ones being done at the moment, as well as adding new test cases
  • Working on the testing framework to remotely test things like OS releases on real hardware
  • Making use of these testing tools for other applications, such as hardware qa

Over the past 6 months, the focus has been on stabilising our testing stack, and replacing the need for manual testing of OS releases - we’ve managed to create an automated test suite that is essentially functionally equivalent to the set of manual tests, and the current challenge is scaling this up so it gets run for every device type! (currently its only running on a rig of ~10 devices from the Rpi family)

Another area I’ve worked on is the hardware development process - A process that aims to reduce friction in developing hardware, reducing iteration times.

I’ll nominate @anujdeshpande next!

5 Likes

Thanks @rcooke-warwick !

I am working on hardware related things at balena, with one side project that I sometimes get back to :slight_smile:

  • I am working on the software that will go on the Fin 2 as part of it’s onboard microcontroller’s firmware w/ AlexB.
  • In the last couple of weeks, I have been focused on improving our hardware process. Hardware development has traditionally been a slow (compared to software). But there are some obvious improvements that we can make to fix that. We are essentially building a CI that is suitable for hardware projects like Fin, Etcher Pro, their respective accessories, and more! So this meta project aims to take some of the work that hardware designers do, but don’t need to. Ryan, Nico, Konstantinos and AlexB are working with me on this :slight_smile:
  • I started working on a small project w/ @andrewnhem to enable SSDs on the Raspberry Pi 4 running balenaOS. This would mean we have better storage than SD cards! Great for projects like NextCloud. If anyone’s interested to help in on that, let us know!

There’s loads to be done to build the next generation of hardware, as well as improving the process of developing hardware itself!

I nominate @phil-d-wilson next!

5 Likes

“Thank you” @anujdeshpande ? :slight_smile:

I’ve got my fingers in lots of pies at balena including improvements to exposing device ports to the internet, mDNS to discover and use balena devices, and local device management for open fleets. But the main focus of my work is currently balenaBlocks and helping to lead the balena labs.

Blocks are a form of app enablement, the idea being to give developers (of all abilities) little parcels of functionality that they can drop into their IOT/edge application. My current work here is paving the way for everyone to build and contribute blocks, to build the ecosystem. To do this I’m writing blog posts and documentation, speaking with other companies who want to make blocks for their products (watch this space!) and pushing improvements to how we expose blocks on hub and allow users to deploy them.

The labs is balena’s team of pet users. We do what our users do: make projects, use the product, find areas to improve and hack on stuff. Lately we’ve started to induct all new balena joiners into the labs for a residency, where they make a project and tell everyone about it. The residency allows people to come into balena and work on their own idea, with zero pressure, whilst they get to know the product and the team. It’s a great way to absorb the unique culture of balena and learn how we work. I’m super excited about the projects the first batch of labs residents are currently working on, and can’t wait to see their project blog posts!!!

I nominate @20k-ultra to share next - he’s always got interesting things on the go!

:heart: :balena:

3 Likes

Ah, what a cool thread! Thanks for the ping @phil-d-wilson.

Well, I focus primarily on the Supervisor which is the application on device that manages services and configurations (this trivializes the complexities of our target state funnel algorithm). This project is really intricate and has worked from the beginning when Balena was just deploying single containers, to multi container applications now and in the future multi-applications. These paradigms have different requirements for the features we offer so being reliable is our #1 priority. The team have done a lot of work improving test coverage, and even better we’ve made some tooling to sanely mock docker engine in our tests using a lib called mockerode that we’ve built. I’ll let @pipex talk about that more when he gets pinged and we are really excited to improving the Supervisor’s E2E testing with testbot!

So, aside from continuing to polish the Supervisor, we have a spec in the works for a fallback state. This allows the Supervisor to know when it’s time to give up applying the target state and fall back to a reliable state known as the Goldilock State. It’s very important that we don’t break the universal law that the Supervisor listens to what the target state wants so this spec is pretty different in what we’ve normally worked on. It hopes to prevent devices trying to run a state even when it knows that it has been failing for hours.

In the spirit of reliability to be able to build good testing you need to automate some stuff. A bunch of people are working on a big change to our release model which allows first class versioning and marking a release as draft or final. With this new model we can then make pipelines that push draft release in a pull request and then mark it as final on merge. I’m in charge of the github action which will offer this workflow in a very easy to use package. In fact, we’ll be using this ourselves with the Supervisor! Imagine having designated test devices in a fleet that will run the draft releases but the rest of your production fleet only tracks final versions. It will all happen automatically via your git workflow.

Oh, I also answer lots of security reports! We do a great job of answer all reports and always get back to you. If you think you’ve found something, reach out to support@balena.io and checkout our security page: Security - Balena Documentation.

As per most people that work at Balena, we have a lot of projects going on since we are encouraged to help where-ever we find that we can provide value to the team. This list of things I’m working on are just the main ones I focus on, there’s a lot of stuff behind the scenes I’d like to get to eventually like an interactive flow chart for debugging devices. Intended for our support agents but let’s make it open source and allow anyone to use it!

I’m going to ping @codewithcheese because I always learn a lot when talking with Tom and he’s always working on some super meta project.

5 Likes

Thanks @20k-ultra you’ve done amazing work with the supervisor!

I am currently focusing on building a next-generation DevOps tool, to automate the complete pipeline from source code change all the way through to product creation and deployment.

In addition to balenaCloud, there is openBalena, and we also offer a version of balenaCloud for on-prem customers. Each of these products is composed of different combinations of the same components, and they are all deployed in different ways. Keeping all the product and components in-sync currently requires many actions by the team, can take time, and each step can introduce bugs.

To be able to provide the most stable products possible, we are working on automating the complete process. So when a balenista or an outside contributor pushes new source code, that code can be tested, incorporated into all the relevant products, those products can be tested and automatically release and deployed.

A number of concepts have been developed to automate this. First, is contracts. Contracts define a component and its dependencies and help our systems reason how components can be composed together. Second is transformers, this is our CI pipeline on steroids. It can automate all the steps by reacting to changes in contracts. Finally, is Katapult, which I am working on, it can combine low-order contracts, such as a service, into higher order contracts, such as a product. The higher-order contracts are then used to build product releases, development environments, and deployment configuration.

Note: there is an open source Katapult repo. However it is a basic prototype which doesn’t represent the concepts in development now.

Next I nominate… @Ereski i’m curious what magical solutions your working on next!

3 Likes

Thank you @codewithcheese! Unfortunately, nothing magical at the moment :joy:

I am currently dirty with virtual grease, working at the heart of our internal management system, Jellyfish. Recently we deployed a partial redesign to our database schema that yielded considerable improvement to queries we routinely run in Jellyfish. Better UX and less $$$ required by infrastructure are always good :smiley:

I am deep in our CI/CD pipeline. In balena, and in special with Jellyfish, every PR that is merged creates a new version that is then, ultimately, deployed in production. There are several steps in this pipeline, but a single change may take a few hours to be fully tested and make it to production. There has been a lot of work in improving that time, but it is still not near comfortable. Trying to deploy urgent fixes, in special, can be troublesome. The goal of Jellyfish itself is to reduce friction for everyone in balena, so it is a twist of irony that there’s quite a bit of friction to work with jellyfish itself.

Apart from that, it’s mostly grunt (but necessary) work at the moment. Might do magic later :stuck_out_tongue:

I nominate @lucianbuzzo. I’m sure you have a lot to say after more than two years since you last wrote here.

2 Likes

Thanks for the nomination @Ereski !
It’s pretty wild reading my last post in this thread from Feb '19 - we actually released the schema form component in rendition, which you can see here.

As has already been mentioned above, the cycle time on Jellyfish has been painfully slow and improving the situation has been my number 1 priority over the last month or so. Slow development and CI/CD times get exponentially worse, if a test takes a long time, you can end up tackling some other minor task, or get distracted from your main task and this context switching eats into your productivity. Additionally, if you have flakey tests, then the CI/CD run might fail without you realising it, but you’ve already been distracted by a minor task and didn’t notice the failure, so don’t immediately re-attempt the CI run. Before you know it, you’ve spent 4 hours effectively waiting for loading screens!
For any project to be successful, I think it is essential that the software remains malleable (easy to change). At the very least trivial changes must be trivial to implement. Once it becomes difficult to change software, it begins to atrophy, refactors become enormously difficult tasks and everyone hates working on the project. The end result is often not catastrophic, but insidious. Because small changes and fixes are difficult and time-consuming, they aren’t done (engineers look for a bigger “bang for their buck”) resulting in the system dying “a death of a thousand cuts”, as all the small problems go unfixed and the software becomes unusable.

So with all this in mind, we committed ourselves to bring down the complexity and cycle time attached to making code changes, and we’ve been able to dramatically improve the reliability of our CI runs as well as more than tripling the speed of them! We were also able to identify many parts of the system that could be refactored to reduce the amount of code required and the number of module interdependencies needed, which is next on the roadmap and something we’ve already started work on.

Outside of working on Jellyfish, I’ve also been spending a lot of time designing and codifying our hiring process. Like everything at balena we really tried to approach hiring from first principles and to build it as a product, and I think we’ve done a lot to not only help us hire better candidates but to also provide them with a much better experience, whether they join the company or not!

Many software companies have hiring processes that optimise for hiring people who have large egos and/or are good at competitive programming, rather than people who would be great team members.
Rather than trying to emulate how Google or Facebook hire, we talk a more focused approach, applying first principles thinking to understand our hiring process better. I really wanted to create a process for hiring engineers that allowed candidates to showcase their logical reasoning, problem-solving, critical thinking and communication skills, whilst being open-ended enough to not constrain people to predetermined answers. I think there is still a lot of unexplored areas to explore, but what we’ve achieved so far has been very promising :blush:

@fisehara What have you been up to?

By the way, we’re hiring!

2 Likes

Thanks for the nomination @lucianbuzzo !

I’m lacking any audit track in the forums, as I’ve joined Balena’s backend team in May and am still onboarding into backend related code bases and especially the Balena API. Our Balena API mainly builds on top of the open source projects open-balena-api and pinejs where the latter one is the foundation and provides a business data model generator, a just in time SQL query generator / executor and an OData API to query the data model. Right now this elegance of letting pinejs provide resource endpoints at runtime without implementing them manually comes with a higher complexity of backend implementation and pretty much outlines my daily challenge.

Besides working and onboarding into the Balena API codebase I’ve started on figuring out how to auto generate the Balena API documentations. Which includes some intermediate steps namely a balena API model to OData specification generator, an OData specification to openAPI specification converter and finally a rendering step of openAPI specification to a hosted documentation page. I’m undergoing evaluations to utilise the oasis OData-to-openAPI converter and openAPI rendering tools like Redocly.

For the Balena build process I’ve pushed forward a feature for our Balena builder to support env_file tags in the docker-compose file. Laying the foundation is done and now I’m working on integrating the feature into our cloud builder codebase. Finalising this feature includes integrating it into the Balena CLI

During our recent hack week I’ve been working on utilising weaviate to add a semantic search engine as an augmentation for Jellyfish. This tool transfers Jellyfish entities and indexes them in a vector space database. From this database similarity queries and distance metrics can be evaluated by just measuring the euclidean distance between entries. Maybe it will help us understand our knowledge base better and supports empirical decisions.
Hopefully I’ll keep on working on this to ship it as a real augmentation, that everyone in our team can use it to explore our knowledge base Jellyfish.

Last but not least, I’m starting some real hardware hacking with a Raspberry Pi, the Raspberry Pi Sense HAT and a stereo microphone HAT. As I’m a hobby saxophonist who hasn’t played for a decade now I need to train my embouchure. So I came up with the idea, that the Raspberry Pi will show a note to play on the Sense HAT and measure the deviance to the played tone via the microphones. Maybe this evolves into a real play along trainer, let’s se how things are going.

I’m calling out for @markcorbinuk to give us insight into the depths of hardware bringups - How is RISC-V doing? :slight_smile:

1 Like

Oh well, unfortunately Mark has moved on to new adventures, so I’ll try to sum up what he has been up to as part of the balenaOS team.

During the last few months Mark has been working on improving the OS time synchronization, fixing issues that we have identified through our support loop. Things like implementing a fake hardware clock so time is maintained across reboots on devices without a real RTC, implementing an early https time sync service so that system services can start with a good enough time even though NTP sources have not yet synched, ordering time dependent services to start after the time has been synched, and fixing several other smaller issues with the use of Chrony, the NTP client used in balenaOS.

An important project that Mark has also been maintaining is our brownfield migrator tool which allows to remotely migrate devices on the field running a variety of operating systems to balenaOS.

And finally, Mark has been adding support for the Jetson TX2 device type to our automated test framework. OS testing is an important focus of the OS team right now, as there is a lot of overhead in the manual testing and release process that we are using at the moment. The near term roadmap includes moving to continuous automated releases of OS updates for all devices types and completely automated testing.

Thank you Mark for your contributions to making balenaOS a better operating system.

And hijacking the thread, I might as well fill up on the stuff I am working on.

A big part of my time has gone into a review of the OS architecture with the aim of simplifying as much as possible the addition of new devices types. The new balenaOS architecture is multi-container based, and defines OS blocks that can be developed and maintained independently. This new design will allow to support new device types by just providing the BSP specific container block while re-using the rest of the OS.

Moving along that roadmap, I have been working along with @pipex from the supervisor team to support overlay container extensions, the first type of OS blocks that are managed by the supervisor. For this, a new v3 target state is being introduced that moves the product closer to the multi-app utopia, where the system is running several apps (OS, supervisor and user application), while managing them all in the same way not only in the supervisor but also in the cloud API.

Part of the work above has seen me merge the development and production OS variants into a single image that is now runtime configurable into development or production mode, and I am currently working on revamping the OS release versioning so it uses the same mechanism as the supervisor and user apps. Release versioning requires the introduction of an OS contract that will also allow OS blocks to be identified as compatible.

Proper OS release versioning is also required in order to deploy draft OS releases to the cloud, which will allow to move to a continuous release process in which every PR is deployed as a draft release to the cloud, and automatically marked as final by the automated test framework once validation passes. This will avoid the current delay between a feature being available in meta-balena and released for a specific device type.

For the last few weeks I have also been working alongside @mtoman to finalize the secure boot and full disk encryption work for x86_64 devices that Michal has been working on for the last few months. This is in its final stretch and we hope to have something released very soon.

There are always other minor tasks that need attention, but the above sums up the core of my current work and focus.

And next, I’d like to call out to @lmbarros to explain some of the great work he has been doing on improving the engine pulls reliability and healthcheck mechanisms!

3 Likes

Hello @alexgg, and thanks for the callout!

Indeed, a good deal of my work at balena has being around the theme of reliability of balenaEngine in a broad sense. Some months ago I delivered my first sizable contribution to balena: a rework of balenaEngine’s image download mechanism. With this change, the Engine keeps trying to resume downloads after network failures for a longer time and in a smarter fashion. This was a large source friction for customers using cellular modems or other kind of unstable networking on their devices.

By the way, this is a nice illustration of one of the key reasons balenaEngine exists: Docker was pretty much designed for the magic world of data-centers, where bandwidths are unbounded and the network is always up. IoT and edge devices often run in far less welcoming conditions. (I hope nobody takes my description of a data-center literally :sweat_smile:)

Right now I am working to improve the reliability of the balenaEngine on devices under extreme loads. Under these circumstances the Engine may become unresponsive for a while and indeed we have seen a number of cases in which it was killed by systemd’s watchdog. While the workaround is obvious (just increase the timeout), the real solution is to make sure balenaEngine (and balenaOS in general) remains always responsive. This has been a big challenge for me, involving long hours tormenting poor Raspberry Pi Zeros and some deep dives on topics like Linux process and I/O scheduling, cgroups, and systemd.

A secondary current focus of mine is improving image downloads on balenaEngine even further. I am still investigating and doing some proof-of-concept implementations to check what we can do – very early work, but I believe we’ll improve the download reliability a bit more, and hopefully make pulls run faster.

And that’s where I am spending most of my balena time these days! :slightly_smiling_face:

Who’s next? Since I got hooked on the excellent balenaPodcast I got more and more curious about the work of @andrewnhem. So, Andrew, what have you been working on?

5 Likes

What’s up, everyone? I’m Andrew Nhem, Product Builder and Content Strategy Lead. I guess I’d sum up my focus at balena the same way I describe my “hard problem” to solve at balena:

Removing the friction from how balenistas communicate information about products, features, and other helpful things to one another and to the masses.

I help my teammates plan, build, and manage their informative and educational content. This could be publishing and maintaining content on our blog and site, contributing with product releases, the content that supports it, balenaLabs project content, videos, podcasts, and more.

I’m also steadily hacking away at balena’s content operations model, aka contentops. This includes creating protocol and interfacing with tools to make content creation and management smoother across all our active channels.

I’m looking forward to open-sourcing all of these channels so that we can involve more of our amazing community. Huge props to the folks out there whom are already working with me to iron things out.

Shameless plug: Please check out and contribute ideas to this edge developer guide that I’d like to make with the community.

Also, feel free to DM me with any questions, feedback, or observations. It’s important to me so that I can make sure what we’re communicating at balena is as useful and helpful as possible.

And with that, I pass the baton to my local homie, @nucleardreamer !

3 Likes