Canary deployment on Balena using GitLab CI/CD

Hi, I’ve made a small project to show how to achieve canary deployment / incremental roll-out of Balena releases using GitLab CI/CD. The project can be found here. I’ve included the project’s README as a reference:

Quick overview

We’ll add canary deployment to your existing Balena-based application using a few steps:

  1. Define the canary stages in your Balena application.
  2. Add required files to your repository.
  3. Configure the GitLab CI/CD pipeline.
  4. Creating a new release that uses canary deployment.

Goal

Releasing new features to remote IoT devices carries risks. To mitigate this risk, canary deployment can be used. Instead of updating the entire fleet at once, updates are first released to a small subset of the fleet, say, 10% of the devices. After some time delay the update is released to an increasing number of devices, until all devices have been updated. By following this strategy, there is a higher chance of catching issues during the deployment process. The deployment process is usually performed in an automated fashion using a technique called Continuous Deployment (CD).

Balena is a tool for IoT developers that takes care of (cross-)building your Docker-based application and robustly deploying it to devices in the field. Balena devices are grouped by applications. Out-of-the-box, there is no support for canary deployment. However, by combining the Balena CLI and Python SDK, it is quite easy to extend functionality for this use case.

This demo project shows how to combine GitLab CI/CD and Balena to set up canary deployment for your application. The implementation is easy to port to other CI/CD providers using the provided tips.

Add canary deployment to your existing application

Note: it’s assumed your project is hosted on gitlab.com and your Balena application is available from the root of your project (i.e., you can run balena push from there).

Define canary stages

You can define the canary stages by adding a tag to each device in your Balena application. To assign a particular device to stage N of your deployment (N = 1, 2, …), set the following tag:

rollout-stage: N

In this example, it is assumed that there are 3 stages, although it is easy to modify the code for fewer or more stages. A possible distribution for deployments with 3 stages is:

  1. 10% of your devices: rollout-stage: 1.
  2. 30% of your devices: rollout-stage: 2.
  3. 60% of your devices: rollout-stage: 3.

Finally, ensure the release policy of your application is set to “Pinned to…” instead of “Tracking latest”. This ensures that during your first canary deployment, your devices don’t immediately update (since they haven’t been pinned yet).

Add required files to your application repo

Add the following files from this project to the root of your repo:

  • .gitlab-ci.yml: defines build and deploy jobs, as well as the delays between them.
  • deploy.py: script to pin a set of tagged devices to a new release.
  • Dockerfile.ci: defines Docker image for the CI/CD runner. Contains Balena CLI and Python SDK.

Setting up GitLab CI/CD pipeline

In the GitLab project, go to “Settings” -> “CI/CD” -> “Variables” and add two environment variables:

  1. BALENA_API_KEY: a valid API key for your Balena account.
  2. BALENA_APP_NAME: the name of the Balena application.

Then, go to “Packages & Registries” -> “Container Registry” and note the tag, denoted by $GITLAB_CONTAINER_TAG here. Example: registry.gitlab.com/pascal.hwky/balena-canary. Build and upload the Docker image defining the runtime of the CI/CD runner to this tag by using the following commands:

docker build -t $GITLAB_CONTAINER_TAG -f Dockerfile.ci .
docker push $GITLAB_CONTAINER_TAG

Finally, open .gitlab-ci.yml with a text editor and update the values denoted by TODO. The default values are used to demonstrate canary deployment with the simple application defined in this project (a single Dockerfile with an infinite sleep command).

Pushing a new release using canary deployment

Instead of manually running balena push, create a release with canary deployment by simply committing to your Git repository. The GitLab CI/CD pipeline will, by default, run automatically on all branches. To restrict deployments to specific branches (commonly master), add the only keyword to .gitlab-ci.yml (see docs).

Customizing the deployment process

As described in the section below, this project contains a minimal implementation. It is likely that you need to extend the functionality by modifying deploy.py or .gitlab-ci.yml.

Integrating with other CI/CD providers

Although this example runs on GitLab CI/CD, the core of the implementation depends on the Balena CLI and the Python script deploy.sh. Therefore, it should be trivial to port this project to other CI/CD providers. Some things to consider:

  • The commands that should be run are defined for GitLab CI/CD in .gitlab-ci.yml. Other CI/CD providers will have different formats. Ensure that there is a way to delay execution of deploy stages and pass the release ID between stages.
  • The runtime environment needs the Balena CLI for balena push and the Balena Python SDK for running deploy.py. In GitLab CI/CD, this is ensured by using the Docker executor with an image built from Dockerfile.ci.
  • The Balena API key and application name need to be exposed to the runtime. In GitLab CI/CD, this is handled by adding runner variables.

Limitations

Since this is a demo project and not intended for production usage, there are a few limitations:

  • If a release takes longer to build than the time between the build stage and the first deploy stage, the entire deployment will fail. This is because deploy.py asserts whether release status is “success”. You should include additional checks and wait loops to allow for longer builds.
  • There is no check to see if all devices in the specified application are updated after the last stage. Two situations in which this might not be the case are when a device is added during the deployment, or when devices are not properly tagged with rollout-stage or the associated stage is higher than the number of stages defined in .gitlab-ci.yml.
  • The call to the Balena Python SDK in deploy.py are a bit slow. Since each device is processed sequentially, the deployment can take long for stages with a large number of devices. Consider adding concurrency to speed up.
1 Like

@pdboef this looks amazing, I am definitely going to give this a spin. Great work and thanks for sharing it here!

1 Like

Hey @pdboef, this is excellent work, going to send you an email - be on the lookout!

I’ve pushed a few updates to the project:

  • The deploy.py script will no longer fail if it is called while the release is still building. Instead, it will sleep for a while and try again.
  • After a release has been rolled out to the last stage, the application release policy will be pinned to the new release. In this way, devices added during deployment - or devices without the rollout-stage tag - will get the release as well.
1 Like

:tada: Nice one!