An average user CPU usage of 35% is reported on my raspberry pi which cannot directly be attributed to any container

Hi,

In the chart below you see the CPU usage of my raspberry pi device running the balenaOS.
So it reports an almost constant “user CPU” percentage of 35%.

but as you can see in the chart below this usage doesn’t correspond to the total CPU usage of all the containers (which is less than 5%).

Here below you can see the actual process that is responsible for this high user CPU percentage:

Mem: 949180K used, 49936K free, 12116K shrd, 60436K buff, 261860K cached
CPU:  32% usr   6% sys   0% nic  60% idle   0% io   0% irq   0% sirq
Load average: 1.40 1.54 1.72 4/498 14147
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  718     1 root     S    1182m 121%  30% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena --delta-storage-driver=aufs --log-driver=journald -s aufs --data-root=/mnt/sysroot/inactive/balena -H unix://
  612     1 root     S     9044   1%   1% @sbin/plymouthd --tty=tty1 --mode=boot --pid-file=/run/plymouth/pid --attach-to-session --kernel-command-line=plymouth.ignore-serial-consoles splash
29703 29686 root     S     918m  94%   1% balena-containerd --config /var/run/balena/containerd/containerd.toml
10

So why is this process using so much CPU and can this be fixed ?

FYI I think this issue started when I activated the monitoring of the containers in my telegraf container (see section [[inputs.docker]] of my telegraf.conf)

I am kind of thinking out loud here… First it would be useful to confirm whether the issue is related to the telegraf container, as you’ve hinted. If you pause or stop the telegraf container and run the top command again on a terminal (I guess the per-process CPU usage you shared was produced by the top command), then does the CPU usage of balenad drop considerably?

I can imagine that if telegraf was asking balenad for a lot of system metrics quite often, then the CPU burden of gathering the data could fall heavily on balenad. If this was the case, then some tweaks in telegraf.conf might help. Just to see if it makes a difference, perhaps you could try changing interval = "10s" to interval = "60s" under the [agent] section of telegraf.conf. Maybe also try changing quiet = false to quiet = true, which could reduce expensive I/O.

Let us know what you find, and we’ll go from there!

Kind regards,
Paulo

1 Like

Thanks for the feedback.
I have stopped the telegraf container and this didn’t make a difference:
the top command still showed an utilization of 30% for the balenad container (see screenshot below).

Note that I have also stopped all 8 containers and balenad process is still reporting 30%.

Oh, all containers stopped and balenad still uses 30% CPU? This should definitely not happen. For reference, I’m running 3 containers from this multicontainer-getting-started app on a Raspberry Pi 3, and balenad uses 0% CPU even while the containers are running:

Mem: 363432K used, 635676K free, 5168K shrd, 12632K buff, 135716K cached
CPU:   0% usr   0% sys   0% nic  99% idle   0% io   0% irq   0% sirq
Load average: 0.11 0.14 0.12 2/274 2108
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  602     1 root     S     9044   1%   1% @sbin/plymouthd --tty=tty1 --mode=boot --pid-file=/run/plymouth/pid --attach-to-session --kernel-com
  787   734 root     S     894m  91%   0% balena-engine-containerd --config /var/run/balena-engine/containerd/containerd.toml
  734     1 root     S     966m  99%   0% /usr/bin/balenad --experimental --log-driver=journald -s aufs -H fd:// -H unix:///var/run/balena.soc
  789   733 root     S     886m  90%   0% balena-engine-containerd --config /var/run/balena-host/containerd/containerd.toml
  733     1 root     S     885m  90%   0% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena --delta-storage-driver=aufs --log-driv
...

You’re also running on a Raspberry Pi 3, right? Something I noticed in the output of your top command is that the RAM memory is fully utilised, whereas my output above shows that my Pi is using only a third of its 1GB. Could your 8 containers be using too much memory? Just a thought.

I suggest that you try playing with some of the following command lines on the Host OS terminal. Run top after each of them to check how they’ve affected CPU usage:

  • balena ps - lists the running containers (including the balena supervisor).
  • balena stats - prints CPU, memory, network and disk usage for each container
  • systemctl stop resin-supervisor - this command stops the balena supervisor app, that runs in its own container. The supervisor is responsible for automatically starting and stopping your app containers as controlled through the web dashboard. The reason for manually stopping it is to prevent it from automatically restarting the app containers that the following commands will stop.
  • balena stop $(balena ps -aq) - stops all containers, after which balena ps should show an empty list. (Run systemctl stop resin-supervisor before running this.) Check what top says after running this command.
  • systemctl stop balena - this stops the balena engine itself, the daemon that executes the balena commands like balena ps. After running this command, even balena ps will fail to run. Run top after running this command. Surely the CPU usage will have dropped!
  • systemctl start balena - start the balena daemon again. Run top after running this command.
  • systemctl start resin-supervisor - this starts the balena supervisor again. After a few seconds, the supervisor will automatically start your app containers as controlled by the web dashboard. Run balena ps, balena stats and top after running this command.

If your app can partially function with only a subset of those 8 containers, try starting only that minimal subset to see if memory usage and CPU are reduced. The CPU usage may have nothing to do with memory usage, but the findings may help with the investigation.

1 Like

Hi,

I am indeed running on a raspberry Pi 3.

I did some more deployments (also increased the GPU memory and even added an additional container) recently and with one of those deployments the problem all of sudden disappeared. Here below you see the TOP output I executed this morning.

The grafana chart below shows the system metrics of my raspberry for the last week. It clearly shows the period from 11/24 until 11/29 that the CPU was high.

When looking at the chart, I think that the problem got fixed when I increased RESIN_HOST_CONFIG_gpu_mem to 160 as I was experimenting with a media player container (kodi)

Hi pdcastro,

This question was interesting to me since, using a similar setup with only 3 containers, I reach by using top a memory usage of almost 900MB in the RPi.

I have 3 containers:
debian stretch with a small Go application.
Node-red in Alpine.
Influxdb in Alpine.

However, you showed 363k.
Were you, maybe, running in production mode?

Also, if I run balena stats, I see 4 containers that use in sum no more than 220 MB. That would mean that 600k or so are being used by the entire balena solution.

What do you think or what am I missing?

update.
Added GPU up to 64 as indicated before and total used memory went down to 500MB or so.

Hi, memory usage can be quite volatile and hard to figure out. How did you detect the 900MB memory usage ?

Hi

I used top in the Host OS container, as it can be seen here.

By changing the GPU to 32 or 64, I get the following:

Of course, at this point it is worth wondering what does this number actually mean.

hey @mvargasevans, VSZ is the Virtual Memory Size.
It’s actually how much memory a process has available for its execution, including memory that is allocated, but not used, and memory from shared libraries. Does that make sense ?
For accessing some RES memory info (how much actual physical memory a process is consuming) you can check the output of: cat /proc/PID/status

Hey, just wanted to add something on this: increasing the amount of gpu_mem available is generally going to improve the memory usage, this is especially true if the application is performing and graphically intensive task. Looking at the two screens you provided: the memory usage of the balena daemon and the supervisor looks comparable (as expected since these shouldn’t benefit much from the gpu_mem). I suspect that the memory reduction you are reporting comes from something else.

I would suggest following the steps that pdcastro provided above to gather some information. It would be especially interesting to compare the difference between memory usage between stopping all containers (balena stop $(balena ps -aq)) and after stopping the balena engine itself (systemctl stop balena).

Hi both,

Great replies!

I first did cat proc/meminfo with everything running:

Top with all containers

Top after balena stop $(balena ps -aq), when everything exited

Finally, I manually stopped all my 3 containers and run free:
image

Coming back to cat proc/meminfo with my 3 containers stopped:

I find the MemAvailable value in meminfo critical.
A deep dive on what this means here.

MemAvailable: An estimate of how much memory is available for starting new
applications, without swapping. Calculated from MemFree,
SReclaimable, the size of the file LRU lists, and the low
watermarks in each zone.
The estimate takes into account that the system needs some
page cache to function well, and that not all reclaimable
slab will be reclaimable, due to items being in use. The
impact of those factors will vary from system to system.

If I read all these correctly, I have tons of memory available and the Balena system is consuming around 200 MB.

What do you think?

Hey @mvargasevans thanks for the detailed report :slight_smile: I don’t think we have a fixed reference number for how much memory should be used by balenaEngine when no containers are running, but I recall seeing similar results the last time we looked into this (hopefully I am not misremembering, as atm I couldn’t find the previous thread where we investigated this). I pinged the balenaEngine team to have a look at this conversation, as they might be able to share some more information on the matter.

Well, it looks like I was mistaken, the number should probably be lower than that. This is a screenshot I managed to find on a previous investigation on the matter, keep in mind this was for an older version of the OS and some things may have changed in the meantime. The first vertical line marks the point where the supervisor was stopped, while the second one marks where balena was stopped.


As you can see the expected memory footprint should be even lower. We also have an open issue in which we are tracking some of this information: https://github.com/balena-os/meta-balena/issues/1390

Hi
Thanks for the info.
At least the ball park figures are showing that memory consumption is not as high as I first thought so.
Super nice!