An average user CPU usage of 35% is reported on my raspberry pi which cannot directly be attributed to any container

raspberrypi3

#1

Hi,

In the chart below you see the CPU usage of my raspberry pi device running the balenaOS.
So it reports an almost constant “user CPU” percentage of 35%.

but as you can see in the chart below this usage doesn’t correspond to the total CPU usage of all the containers (which is less than 5%).

Here below you can see the actual process that is responsible for this high user CPU percentage:

Mem: 949180K used, 49936K free, 12116K shrd, 60436K buff, 261860K cached
CPU:  32% usr   6% sys   0% nic  60% idle   0% io   0% irq   0% sirq
Load average: 1.40 1.54 1.72 4/498 14147
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  718     1 root     S    1182m 121%  30% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena --delta-storage-driver=aufs --log-driver=journald -s aufs --data-root=/mnt/sysroot/inactive/balena -H unix://
  612     1 root     S     9044   1%   1% @sbin/plymouthd --tty=tty1 --mode=boot --pid-file=/run/plymouth/pid --attach-to-session --kernel-command-line=plymouth.ignore-serial-consoles splash
29703 29686 root     S     918m  94%   1% balena-containerd --config /var/run/balena/containerd/containerd.toml
10

So why is this process using so much CPU and can this be fixed ?

FYI I think this issue started when I activated the monitoring of the containers in my telegraf container (see section [[inputs.docker]] of my telegraf.conf)


#2

I am kind of thinking out loud here… First it would be useful to confirm whether the issue is related to the telegraf container, as you’ve hinted. If you pause or stop the telegraf container and run the top command again on a terminal (I guess the per-process CPU usage you shared was produced by the top command), then does the CPU usage of balenad drop considerably?

I can imagine that if telegraf was asking balenad for a lot of system metrics quite often, then the CPU burden of gathering the data could fall heavily on balenad. If this was the case, then some tweaks in telegraf.conf might help. Just to see if it makes a difference, perhaps you could try changing interval = "10s" to interval = "60s" under the [agent] section of telegraf.conf. Maybe also try changing quiet = false to quiet = true, which could reduce expensive I/O.

Let us know what you find, and we’ll go from there!

Kind regards,
Paulo


#3

Thanks for the feedback.
I have stopped the telegraf container and this didn’t make a difference:
the top command still showed an utilization of 30% for the balenad container (see screenshot below).

Note that I have also stopped all 8 containers and balenad process is still reporting 30%.


#4

Oh, all containers stopped and balenad still uses 30% CPU? This should definitely not happen. For reference, I’m running 3 containers from this multicontainer-getting-started app on a Raspberry Pi 3, and balenad uses 0% CPU even while the containers are running:

Mem: 363432K used, 635676K free, 5168K shrd, 12632K buff, 135716K cached
CPU:   0% usr   0% sys   0% nic  99% idle   0% io   0% irq   0% sirq
Load average: 0.11 0.14 0.12 2/274 2108
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  602     1 root     S     9044   1%   1% @sbin/plymouthd --tty=tty1 --mode=boot --pid-file=/run/plymouth/pid --attach-to-session --kernel-com
  787   734 root     S     894m  91%   0% balena-engine-containerd --config /var/run/balena-engine/containerd/containerd.toml
  734     1 root     S     966m  99%   0% /usr/bin/balenad --experimental --log-driver=journald -s aufs -H fd:// -H unix:///var/run/balena.soc
  789   733 root     S     886m  90%   0% balena-engine-containerd --config /var/run/balena-host/containerd/containerd.toml
  733     1 root     S     885m  90%   0% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena --delta-storage-driver=aufs --log-driv
...

You’re also running on a Raspberry Pi 3, right? Something I noticed in the output of your top command is that the RAM memory is fully utilised, whereas my output above shows that my Pi is using only a third of its 1GB. Could your 8 containers be using too much memory? Just a thought.

I suggest that you try playing with some of the following command lines on the Host OS terminal. Run top after each of them to check how they’ve affected CPU usage:

  • balena ps - lists the running containers (including the balena supervisor).
  • balena stats - prints CPU, memory, network and disk usage for each container
  • systemctl stop resin-supervisor - this command stops the balena supervisor app, that runs in its own container. The supervisor is responsible for automatically starting and stopping your app containers as controlled through the web dashboard. The reason for manually stopping it is to prevent it from automatically restarting the app containers that the following commands will stop.
  • balena stop $(balena ps -aq) - stops all containers, after which balena ps should show an empty list. (Run systemctl stop resin-supervisor before running this.) Check what top says after running this command.
  • systemctl stop balena - this stops the balena engine itself, the daemon that executes the balena commands like balena ps. After running this command, even balena ps will fail to run. Run top after running this command. Surely the CPU usage will have dropped!
  • systemctl start balena - start the balena daemon again. Run top after running this command.
  • systemctl start resin-supervisor - this starts the balena supervisor again. After a few seconds, the supervisor will automatically start your app containers as controlled by the web dashboard. Run balena ps, balena stats and top after running this command.

If your app can partially function with only a subset of those 8 containers, try starting only that minimal subset to see if memory usage and CPU are reduced. The CPU usage may have nothing to do with memory usage, but the findings may help with the investigation.


#5

Hi,

I am indeed running on a raspberry Pi 3.

I did some more deployments (also increased the GPU memory and even added an additional container) recently and with one of those deployments the problem all of sudden disappeared. Here below you see the TOP output I executed this morning.

The grafana chart below shows the system metrics of my raspberry for the last week. It clearly shows the period from 11/24 until 11/29 that the CPU was high.

When looking at the chart, I think that the problem got fixed when I increased RESIN_HOST_CONFIG_gpu_mem to 160 as I was experimenting with a media player container (kodi)