Why is balenad using 32% of my CPU?

jason10 · May 16, 2019, 10:36pm

running top from the HostOS container I found:

Mem: 824480K used, 6756072K free, 21536K shrd, 49780K buff, 152232K cached
CPU:  32% usr   2% sys   0% nic  65% idle   0% io   0% irq   0% sirq
Load average: 2.82 2.79 2.77 4/282 29634
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  734     1 root     S    1663m  22%  32% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena
28160 28139 root     S     298m   4%   2% ./camera_controller_resin http://10.0.0.110:8080
11805     1 root     S    1713m  23%   0% /usr/bin/balenad --experimental --log-driver=journald -s aufs
12334 12254 root     S    1205m  16%   0% node /usr/src/app/dist/app.js

Is the 32% of my cpu going to balenad expected or a result of my CPU being otherwise lightly loaded?
https://dashboard.balena-cloud.com/devices/e21504138d6433262e2bde08a61d49ff/summary

jason10 · May 16, 2019, 10:37pm

Here is a similar device:

Mem: 5356596K used, 2094940K free, 22076K shrd, 7984K buff, 113072K cached
CPU:  28% usr   1% sys   0% nic  69% idle   0% io   0% irq   0% sirq
Load average: 2.15 2.29 2.34 4/270 8645
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  712     1 root     S    1813m  25%  30% /usr/bin/balenad --experimental --log-driver=journald -s aufs
 5890  1556 root     S    5034m  69%   0% ./camera_controller_resin http://192.168.0.149:8080
    1     0 root     S     114m   2%   0% {systemd} /sbin/init
  713     1 root     S    1432m  20%   0% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena

https://dashboard.balena-cloud.com/devices/8057fa160e7781f273c6913344eb70e6/summary

Support access has been granted on both.

samothx · May 16, 2019, 10:42pm

Hi Jason,
I am taking a look…

samothx · May 16, 2019, 10:50pm

there is clearly stuff going on that is out of the ordinary:

May 16 22:44:41 e215041 kernel: camera_controll[31768]: segfault at 7efcab0b693e ip 000055edbbb40d77 sp 00007fff80c6e5c0 error 4 in camera_controller_resin[55edbbaf4000+112000]
May 16 22:44:41 e215041 kernel[636]: camera_controll[31768]: segfault at 7efcab0b693e ip 000055edbbb40d77 sp 00007fff80c6e5c0 error 4 in camera_controller_resin[55edbbaf4000+112000]
May 16 22:44:41 e215041 balenad[11805]: /usr/src/app/runCamera.bash: line 14:  3232 Segmentation fault      (core dumped) ./camera_controller_resin "${SERVER_URL}"
May 16 22:44:41 e215041 balenad[11805]: ls: cannot access '/tmp/*.bin': No such file or directory

And the supervisor is being restarted every 2 minutes:

May 16 22:45:31 e215041 systemd[1]: Starting Resin supervisor...
May 16 22:47:01 e215041 systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
May 16 22:47:01 e215041 resin-supervisor[1182]: active
May 16 22:47:01 e215041 systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
May 16 22:47:01 e215041 systemd[1]: Failed to start Resin supervisor.
May 16 22:47:11 e215041 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
May 16 22:47:11 e215041 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 16994.
May 16 22:47:11 e215041 systemd[1]: Stopped Resin supervisor.
May 16 22:47:11 e215041 systemd[1]: Starting Resin supervisor...

Still looking to see why the supervisor is restarting …

samothx · May 16, 2019, 11:02pm

I would probably like to restart the balenad - maybe event the device which would interrupt your services for a short while. Is that acceptable ?

jason10 · May 16, 2019, 11:11pm

Yes, that would be fine!

samothx · May 16, 2019, 11:30pm

Looks like a reboot fixed the issue for e215041. Unfortunately I could not detect the problem that caused it.
I will take a look at the other device now…

samothx · May 17, 2019, 12:11am

I have rebooted the second device too. Again there is some not so nice stuff in the logs like an out of memory situation but still no solid hint to the cause.
There appears to be a related issue https://github.com/balena-os/balena-engine/issues/142 which will be solved with with balenaOS 2.34.0 so it might be a good idea to update.
Devices both seem to be Ok after update with normal load and no supervisor restarts…

jason10 · May 17, 2019, 12:34am

Thanks. Both devices had an uptime of nearly 90 days. So who knows!

pdcastro · May 17, 2019, 1:28am

@jason10, although the “ultimate root cause” of the issue isn’t clear, we understand a bit about the mechanism that led to the higher CPU usage. At some point, the balena supervisor failed a health check by the OS watchdog. It’s not clear why it failed the health check – though the system running out of memory (OOM events) is not an unusual cause for a health check failure. Following that event, what should have happened was for the supervisor to be cleanly restarted by the watchdog. But instead, likely because of the bug linked by samothx, the supervisor failed to restart and the OS entered a failed supervisor restart attempt loop that led to higher CPU usage. Even in devices subject to that bug, this failure does not happen often – most of the time, the watchdog succeeds in restarting the supervisor. In this instance, unfortunately the only solution (as far as we were aware) was to reboot the device.

This explanation probably doesn’t help you much, but I thought I would share our understanding of what was going on. The bug mentioned is fixed in balenaEngine 18.09 (which catches up with Docker 18.09), which will ship with balenaOS 2.34.0 (to be available in production shortly).

Topic		Replies	Views
High load / CPU usage and container stopping Product support	10	1482	October 17, 2019
An average user CPU usage of 35% is reported on my raspberry pi which cannot directly be attributed to any container balenaOS raspberrypi3	14	4667	May 29, 2019
High CPU and Memory usage on resin-supervisor. balenaOS	7	547	June 16, 2021
High host OS CPU after some pushes on random boxes balenaOS	7	494	January 29, 2020
memory issues with supervisor issues balenaOS	6	801	January 6, 2021

Why is balenad using 32% of my CPU?

Related topics