TLDR; Is there an upper limit in the number of containers we can run in a multi-container application on the BalenaFin?
We’re building a multi-container application using the BalenaFin with approx 6 containers. 4 containers are Flask servers handling different modules of the application while 2 containers run on while loops processing incoming data from an i2C bus.
After the first build completes in local mode, livepush becomes quasi-responive, often losing the ssh connection in local mode. Live push stops working, we cannot connect to the device through the CLI and we have to restart and rebuild the application. CPU usage runs at 170% across all containers, assuming we have close to 400% given the 4 cores of the compute module
We’re tracking down a few hypothesis, but one primary concern is that we cannot rebuild the docker image while all the containers are running in the background, that is when we typically lose connection.
Each of our containers is taking approximately 10-15mb of memory, nothing crazy there. So, we are lead to believe the CPU usage or the number of containers running is the culprit. As we investigate we would love to know if there is an upper limit to adhere to in number of containers and/or cpu utilization that could impact docker build and/or livepush stability issues.
Hi there – I’m not certain if we have a ceiling for the number of containers, but you definitely shouldn’t be hitting any such limit with 6 containers.
CPU could be a limit; one way to check this would be to comment out all but one container in your docker-compose.yml
, push that, then begin uncommenting the other containers one at a time to see where you run into problems. Can you give that a try and let us know what you see?
I’m also curious which Raspberry Pi module you’re using in your Fin.
All the best,
Hugh
Hi Hugh,
Thanks for the response. While experimenting with enabling/disabling containers we’ve found that we can either run the 2 high CPU intensive containers OR run the 4 individual low CPU intensive containers but not both. It seemed like it could be either CPU or # Containers but your answer is re-assuring that it is more likely a CPU issue.
We’re running on the BalenaFin with the CM3+L. We’re using while true loops to monitor i2c traffic and update device state on the CPU intensive containers. The python scripts themselves are not significantly demanding but the loop seems to take up the available CPU capacity (at least on that core) running at 400khz on the i2c bus.
I’m curious if you have any insight on how CPU load impacts Supervisor, Docker Build and Live Push reliability… Should we be trying to load balance across more cores actively using CPUSET etc via Docker.Yaml? That’s something we’re thinking of trying next. Just not sure how to give Balena OS more room to breath. Should we be trying to keep individual core utilization under some threshold like 50%? Is there a way to dedicate the Supervisor to it’s own core? Am I even correct in assuming we can utilize all 4 cores
We’re working on a robotics application so I anticipated diving into the deeper realms with Belana OS and the Fin but I assumed Docker would be doing core load balancing in the background.
I also agree with Hugh in that there are no container limits to follow, only those created by what the hardware resources can support. If the device is running low on CPU I imagine device API calls response time will be impacted, even cause timeouts, but more significantly local mode builds will be slowed. Live Push is intelligent enough to only execute commands that achieve your newest change and they are done inside the container. This means no image is being built so shouldn’t need a lot of CPU but if your change is to download and compile a new dependency then that could be affected.
It definitely sounds like you want to get the most out of the hardware you have which is awesome. I think we should try experimenting with the various CFS scheduler options from https://docs.docker.com/config/containers/resource_constraints/#configure-the-default-cfs-scheduler. We could run high CPU containers in their own cores via cpuset-cpus
and limit their usage via cpus
. This could allow other processes to not become blocked.
However, I believe this only works if live push, local mode builds, the OS, and supervisor all don’t run on the core we designated as the high load core. Try seeing where these services are running and if we can achieve the above core separation. I’ll look into this myself and see if this is a good idea / possible!
Thanks for the response. We’re definitely thinking along the same lines and exploring the CFS options to move containers around to redistribute the load. We’re running into two issues which you allude to, 1) Can we define what core Balena services run on so they do not conflict with high cpu containers and 2) Is there a way to identify what cores containers are running on in general?
I reached out to the OS team for their input on this strategy and I’ll let you know what I find. Currently, I don’t know of a way to set docker configs for the Supervisor container. This would allow us to make sure it runs on a specific core and never gets blocked. Same goes for the OS. I set a reminder for myself to get back to you within 24 hours
Hey Alexander, I have confirmed there is no way to set CPU affinity for the OS which would allow us to make the OS run on a specific core. It seems the best approach is just managing how much CPU the containers are using since they are blocking other processing. Another user recently reported a very similar scenario so you might be able to use some of the tricks they have tried. Balena-engine crashing during livepush
Great information, thank you very much. Core affinity would an interesting feature for high cpu applications like ours. Ebradbury is on our team and was looking at this from another angle and yes it seems like we’ve resolved the issue by reducing CPU usage in our looping and reducing logging output.
Balena-engine crashing during livepush
Awesome. This is a really cool application that you are working on and I’ll continue to think of and advocate for some more tooling that would help people to manage their high workload containers. Thanks for sharing your experience because I’m sure someone else will stumble on this.
Hi @alexanderkjones,
I was wondering what the realistic update rate of your sensors is - on Augie, my hexapod project I get updates from the IMU at 20Hz, but I’ve had it as high as 100Hz and still not had an issue because then 100Hz still involves sleeping a lot of the time. I have a few suggestions;
- Figure out what refresh frequency you actually need from your sensors and add corresponding sleeps to your loops. Even a teeny tiny sleep will give your machine much more breathing room.
- Double check the code that’s polling the sensors for you. It’s possible it’s retrieving more registers than it actually needs which will significantly increase the number of transactions over the i2c bus.
- You said you have two separate processes in tight loops polling these sensors? Only one process can access the bus at a time, so there would always be one process in
IOWAIT
while the other is blocking, however this is likely on a per-transaction basis (see 2). I would suggest combining both into a single process so that you can decide what order transactions happen on the i2c bus, rather than the OS.
- Logging is expensive. Kill it as much as possible.
- If all else fails, the Fin Coprocessor a Silicon Labs BGM111 is actually connected to the i2c bus too, so you could move your i2c code there and just stream the data in via the UART (I do this with Augie using a Teensy). You’ll get a much better timing guarantee and reading a stream from the UART is much less work for your Python code. See the balena-fin-coprocessor-firmata project for an example how to build and deploy a coprocessor firmware.
Hope that helps!
James.
Oh, and one more thing. Make sure you’re setting the i2c bus speed to the maximum 400kHz provided all your devices support it by setting BALENA_HOST_CONFIG_i2c_baudate=400000
in the application configuration variables. The default is 100kHz, so you’ll be able to access the devices much faster.
@jimsynz thanks for the tips. For anyone following this thread in the future… We’re using 400khz on the i2C and more and more it’s looking liking logging is the bottleneck when running more containers. We’re operating at 20Hz polling the i2C bus and attached sensors.
Are there performance resources we could look at describing the effects of logging @jimsynz or is your comment anecdotal
OK, we’ve found the issue which raises more questions! It appears that http requests between containers are impacting significantly on the CPU usage. We have 1 container running a loop requesting i2C data on a 400khz bus and that is working fine. We then have that send an http request to a second container (simple Flask server) to update a state object with 4 variables.
We are seeing 30% CPU usage for 15 http requests per second between two containers. When we reduce the number of requests between containers to 1 per second we see a CPU drop to 3%.
This seems to be the cause of our system instabilities and was surprising to us. Also, http requests between containers take approx 3ms. We had tested previously with MQTT and found the same results of 3ms per transaction on BalenaFin.
Are these performance results normal? Did not expect messaging between containers to be so expensive.
Hi Alexander, I can’t say I have thoroughly tested inter-container communication performance to provide any highly valuable context, but, if you are seeing roughly 3ms using HTTP as well as with MQTT, then that seems to be the bottleneck…No matter the protocol, 3ms, so I’d suspect thats the best the RaspPi Compute Module’s processor can do.
It might also be worthwhile to test the same containers / code on another device type if you happen to have anything else? A Jetson Nano, Beaglebone, or other similar board might make for an interesting comparison.
@dtischler if I get the chance I will test on another device. I’m curious though specifically about the CPU usage? IE that http requests on the BalenaFin would take so much CPU since multi-container is a preferred modality - and I’m assuming, that http is the best way to talk between services.
I dont know that we would state there is a “preferred” methodology for the balenaFin, as this is more of a software architecture issue. Your intended goals and outcomes drive the design of the software stack. I think that’s why I am curious about trying a different device for comparison purposes. I suspect it’s not a balenaFin issue per se, but more of a “this is what to expect from a low power IoT device based on Arm cores”, versus a desktop class processor. Hope that makes sense!
@dtischler et al., thanks for the responses. After more testing I’ve found that the issue may be using the alpine distro balenalib/raspberrypi3-alpine-python vs balenalib/fincm3-python. Performance seems to improve with live push when using the fincm3 image for containers but I will continue to verify. I was able to run 10 flask containers operating at 3 http requests per second each to the same client container at 60% CPU usage, with logging, without issue, which was a big improvement.
Most importantly, live push was much more responsive, even when running all these requests and logging. This leads me to think our use of our alpine distro may have been causing some of the live push build fragility. I tried to rebuild with Alpine and the build would not complete. Shooting in the dark here, really just the results of my heuristic tests.
Also, I believe we’re running into the limitations of http requests. They are running at 30ms not 3ms, which from my research is actually typical. Not the most performant option for inter-container communication for a robotic system and we’re researching moving to Redis which from the official site benchmarks should be closer to 3ms which gives a lot more room in a 20hz cycle.
In general the BalenaFin and Balena OS are a pleasure to work with, really just figuring out the best setup for us in this application. Any insight on the Alpine vs FinCM3 topic would be helpful, and will post final results when we find a stable scenario for our control system.
Interesting. My best guess would be that alpine uses musl-libc while the standard base-images, which are based on debian, have the glibc. Although I find it strange that you would get a noticeable performance boost from just switching between the two… Are you installing any dependencies in the containers that could affect your app?
OK, so we’ve confirmed the base image source in our docker files was indeed the issue. Our livepush and general system instability was caused by using : FROM balenalib/raspberrypi3-alpine-python:3.8-latest
When we replaced the base image for all of our 6 containers with the image used in the Balena multi-container example: FROM balenalib/fincm3-python:3-stretch-run
, All become well. Very snappy.
Curious about anyone’s thoughts. Note sure if we’re using docker image tags incorrectly which may have caused the issue? Regardless, breath of fresh air to have made progress on this.
Also! We can confirm for our application <1ms response time for requests using Redis as apposed to >30ms using http. We’ll be switching to Redis for sending data between containers from now on. This is a game changer for us and will make working in our 50ms hardware loop cycle much more friendly.