Speeding up compilation with multiple Balena boards

Speeding up compilation time

compiling

xkcd: Compiling

Hello friends,

One of the most annoying things in software development is compile time. Even if the computers are getting faster and faster, we are still doing some great piece of software that can be long to compile….

I am now using a decent ThinkPad T495 with Ryzen 7 PRO 3700U (4 cores, 8 threads) and 32Gb RAM (yep, I don’t care about tabs in my browser ;-P). But I can still have a long compile time on the open source project I work on. Here is some explanation on how I setup my computer and some Balena boards to help me to compile faster.

Standard setup

I am working a lot on the autopilot for micro vehicle ArduPilot. This is an autopilot coded mainly in C++ and thus, it needs to be compiled for its target boards. I compile quite often and use the in-build simulator on Balena boards for testing. That allows me to continue coding and have the boards do the simulation.

On ArduPilot, we are using WAF (https://waf.io/) as a build system. It is similar to CMake but in Python. I won’t discuss the pros and cons of this build system here, but it is quite efficient and is embedded into the project Git repository for reproducible builds. With the default installation a full build with waf takes 6m7s for the 7 default vehicle targets :

  • bin/antennatracker
  • bin/arducopter
  • bin/arducopter-heli
  • bin/arduplane
  • bin/ardusub
  • bin/blimp
  • bin/ardurover

In contrast to make that needs the number of jobs explicitly passed, waf is already taking care of maximizing the number of jobs on your machine. So we don’t need to pass the -j command line parameter. It stands for --jobs and is the number of compilation jobs you want to use. The more you use the more CPU cores and computation power for compilation you will use. Generally, you scale the number of jobs with the number of threads your computer supports. On my laptop, I got a 4 cores CPU with 8 threads. That means that I could do 8 compilation jobs in parallel !

So around 6 min for a build isn’t that long but we can do better. Fortunately, like make and other builds systems, waf is smart enough to not do full rebuild each time we made some change. If you don’t change anything, or only a few files, only what is needed will be rebuilt thanks to in-build cache. But generally, this won’t work when we switch the branch on git or architecture, obviously, do a waf clean that, obviously, removes the cache.

Best standard setup

If you have followed ArduPilot installation instructions correctly, you should have seen that when you are using waf configure, to configure your build target, the output looks like :


Checking for 'g++' (C++ compiler) : /usr/lib/ccache/g++

Checking for 'gcc' (C compiler) : /usr/lib/ccache/gcc

instead of


Checking for 'g++' (C++ compiler) : /usr/bin/g++

Checking for 'gcc' (C compiler) : /usr/bin/gcc

What does it mean ? In the second case, waf is detecting GCC and G++ as the C and C++ compiler, that is the intended setup. In the first and rightful case, waf is detecting ccache as the compiler. Ccache is a compiler cache. It will put in cache previous compilation objects to reuse them !

On Debian-like systems, you can install ccache easily with sudo apt install ccache. Then

You can use ccache -s to get a summary of your cache usage. In my case :


cache directory /home/khancyr/.ccache

primary config /home/khancyr/.ccache/ccache.conf

secondary config (readonly) /etc/ccache.conf

stats updated Thu Jul 22 18:14:12 2021

cache hit (direct) 424942

cache hit (preprocessed) 79330

cache miss 296550

cache hit rate 62.97 %

called for link 34323

called for preprocessing 6250

compiler produced stdout 1

compiler produced empty output 2

compile failed 1336

preprocessor error 955

cache file missing 61

bad compiler arguments 177

unsupported source language 7

autoconf compile/link 2410

unsupported compiler option 6

unsupported code directive 1

no input file 1037

cleanups performed 567

files in cache 2708

cache size 49.5 MB

max cache size 5.0 GB

We can see that the lastest ccache usage reuses 62.97% of the cache instead of compiling, and that is pretty interesting to speed up your builds !

After a small change on ArduPilot file, using waf but this time with ccache, I get a build time of 10s. Well, mostly everything is in cache, so I don’t have to recompile everything !

Sadly, that won’t work in all cases, and plenty of time it needs the full and long build.

Then, how to speed up compilation ?

Using another computer to speed up compilation

This is where Balena boards can be useful ! Using Docker, we can have a reproducible environment and then use an utility called distcc that allows us to distribute the compilation tasks across multiple computers ! So instead of using only the CPU of your computer, you will use the CPU on other computers that are on the distcc network. As this will rely on TCP/IP connection to distribute the tasks, it is better to have a good network and privilege a wired connection over WiFi.

That’s what I am going to use with 3 boards on Balena :

  • 1 RPI 3
  • 1 odroid C1+
  • 1 odroid XU4

Those aren’t faster at compilation than my computer but they all got 4 CPUs so in total, I should be able to compile with 8 + 3*4 = 20 CPU !

Unfortunately, there is an important limitation, distcc needs the same compiler across the computers on the network to work. That means that my computer, and all boards should have the same version of GCC. In this project, I have simplified the build to be a build for ARM architecture that is the processor architecture of the boards I use. My computer is a X86_64 architecture. If the cross-compilation for ARM computers is easy, the compilation for X86_64 from ARM boards isn’t straightforward.

Setup on Balena Side

On Ubuntu the installation is simple :

sudo apt install distcc gcc g++

Obviously, you need a compiler to make it work.

So that is what I put in my docker file. I create an instance with the same Ubuntu version as my computer, install a compiler and distcc and done !

Well, not exactly ! Remember, we should have the same compiler on both sides ! On my computer, I will do cross compilation, as I will build for an ARM architecture, and then use arm-linux-gnueabihf-gcc compiler. So I need this one on the Balena side.


Wait a minute. Balena boards are ARM boards so the default compiler is already arm-linux-gnueabihf-gcc, isn’t it ?

Yes, but no. Let me explain.

Yes the default gcc on RPI and Odroid is arm-linux-gnueabihf-gcc. But that isn’t necessarily true for all RPI. The RPI4 will surely have aarch64-linux-gnu-gcc as it defaults to ARM64. But that isn’t the main issue ! As distcc is expecting the same compiler on both sides, you have no guarantee that gcc is exactly the same as the toolchain you want. So let’s just install arm-linux-gnueabihf-gcc on the Balena boards. In my case that will be just a link to the default gcc but it will match my computer compiler name and version perfectly !

Here is the result on Docker.


FROM ubuntu:21.04

RUN apt-get update && apt-get install --no-install-recommends -y \

build-essential \

gcc \

g++ \

gcc-arm-linux-gnueabihf \

g++-arm-linux-gnueabihf \

distcc \

&& apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# This is the operations port

EXPOSE 3632

# This is the statistics port

EXPOSE 3633

# create a default user distcc for distcc daemon that don't like root user

RUN groupadd -r distcc && useradd --no-log-init -r -g distcc distcc

# use the distcc user

USER distcc

# launch distccd as a daemon, gather statistics, allow all computer to connect on this board on all ports, log level set as info, output log on terminal to get them on balena, explicitly pass a log name into tmp directory

ENTRYPOINT /usr/bin/distccd --no-detach --daemon --stats --user distcc --listen 0.0.0.0 --allow 0.0.0.0/0 --log-level info --log-stderr --log-file /tmp/distccd.log

# We check the health of the container by checking if the statistics

# are served. (See

# https://docs.docker.com/engine/reference/builder/#healthcheck)

HEALTHCHECK --interval=5m --timeout=3s \

CMD curl -f http://0.0.0.0:3633/ || exit 1

Of course there isn’t any login or security, it accepts every compilation demand, so do put your boards on the net without protection. I invite you to read the distccd documentation in case you want to do so. IMHO, the simplest way to access the boards from remote would be to use VPN.

Now you get distcc on your Balena boards waiting for compilation order. You need to show waf how to use it.

I have create a new file called distcc_config with content :


export CCACHE_PREFIX="distcc"

export CC="ccache arm-linux-gnueabihf-gcc"

export CXX="ccache arm-linux-gnueabihf-g++"

export DISTCC_HOSTS='localhost/8 192.168.1.42/4,lzo 192.168.1.25/4,lzo 192.168.1.27/4,lzo'

export DISTCC_JOBS=$(distcc -j)

echo "Building with $DISTCC_JOBS parallel jobs on following servers:"

for server in `distcc --show-hosts`; do

server=$(echo $server | sed 's/:.*//')

echo -e "\t$server"

done

Here is what it does :

  • CCACHE_PREFIX allows us to use distcc on our computer in combination with ccache. It would be a shame to not have it.

  • export CC and export CXX explicitly set the compilers for distcc.

  • DISTCC_HOSTS need, unfortunately, to be set manually. It said to distcc what computers use and the number of jobs they can handle. In my case, localhost/8 is for my laptop : 8 jobs. 192.168.1.42/4,lzo for my Odroid XU4 : 4 jobs and lzo to compress files to send.

Now load the distcc_config file and invoke waf -j $(distcc -j) to ask waf to compile with distcc max number of jobs, in my case 20 jobs.

You said we don’t need to pass the -j param on Waf ! Yes, but that doesn’t work with distcc as waf only gets the number of processors on the local computer.

You can see what happens with distccmon-gnome, here is an example of compilation on my computer.

The result is a full compilation in 4m45s with the drawback of using my network a lot, but as nobody watches 4k video on it, that isn’t an issue for me !

In my configuration, the Balena boards were all on the Wifi router but my computer was in another room connected to Wifi. So that definitively slows down the workload distribution.

Still we can see that even with some slow boards we could have some gain.

Limits

Limits to distcc usage :

  • You need the same version of the compiler on each computer.

  • It will use your network a lot to transfer files to compile and compilation results.

  • Using your dusted RPIs won’t bring much help against using a decent CPU. With 12 cores, I only gained 1m15 so 20% time reduction. Using my desktop computer that got an old Intel i5, I got around the same time reduction. So the number of processors does not guarantee the fastest compilation.

  • Those test results are from the average of 3 runs.

Future works

  • Try it on Balena Yocto build. I didn’t have time to do it as I need my computer to work !
  • Demonstrate how to create a new compiler to compile every target from everyboards!

Project repo

7 Likes

WOW Pierre. I need to take some time to thoroughly review this. Incredible stuff here though.