Why is balenad using 32% of my CPU?

running top from the HostOS container I found:

Mem: 824480K used, 6756072K free, 21536K shrd, 49780K buff, 152232K cached
CPU:  32% usr   2% sys   0% nic  65% idle   0% io   0% irq   0% sirq
Load average: 2.82 2.79 2.77 4/282 29634
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  734     1 root     S    1663m  22%  32% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena
28160 28139 root     S     298m   4%   2% ./camera_controller_resin http://10.0.0.110:8080
11805     1 root     S    1713m  23%   0% /usr/bin/balenad --experimental --log-driver=journald -s aufs
12334 12254 root     S    1205m  16%   0% node /usr/src/app/dist/app.js

Is the 32% of my cpu going to balenad expected or a result of my CPU being otherwise lightly loaded?
https://dashboard.balena-cloud.com/devices/e21504138d6433262e2bde08a61d49ff/summary

Here is a similar device:

Mem: 5356596K used, 2094940K free, 22076K shrd, 7984K buff, 113072K cached
CPU:  28% usr   1% sys   0% nic  69% idle   0% io   0% irq   0% sirq
Load average: 2.15 2.29 2.34 4/270 8645
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  712     1 root     S    1813m  25%  30% /usr/bin/balenad --experimental --log-driver=journald -s aufs
 5890  1556 root     S    5034m  69%   0% ./camera_controller_resin http://192.168.0.149:8080
    1     0 root     S     114m   2%   0% {systemd} /sbin/init
  713     1 root     S    1432m  20%   0% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena

https://dashboard.balena-cloud.com/devices/8057fa160e7781f273c6913344eb70e6/summary

Support access has been granted on both.

Hi Jason,
I am taking a look…

there is clearly stuff going on that is out of the ordinary:

May 16 22:44:41 e215041 kernel: camera_controll[31768]: segfault at 7efcab0b693e ip 000055edbbb40d77 sp 00007fff80c6e5c0 error 4 in camera_controller_resin[55edbbaf4000+112000]
May 16 22:44:41 e215041 kernel[636]: camera_controll[31768]: segfault at 7efcab0b693e ip 000055edbbb40d77 sp 00007fff80c6e5c0 error 4 in camera_controller_resin[55edbbaf4000+112000]
May 16 22:44:41 e215041 balenad[11805]: /usr/src/app/runCamera.bash: line 14:  3232 Segmentation fault      (core dumped) ./camera_controller_resin "${SERVER_URL}"
May 16 22:44:41 e215041 balenad[11805]: ls: cannot access '/tmp/*.bin': No such file or directory

And the supervisor is being restarted every 2 minutes:

May 16 22:45:31 e215041 systemd[1]: Starting Resin supervisor...
May 16 22:47:01 e215041 systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
May 16 22:47:01 e215041 resin-supervisor[1182]: active
May 16 22:47:01 e215041 systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
May 16 22:47:01 e215041 systemd[1]: Failed to start Resin supervisor.
May 16 22:47:11 e215041 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
May 16 22:47:11 e215041 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 16994.
May 16 22:47:11 e215041 systemd[1]: Stopped Resin supervisor.
May 16 22:47:11 e215041 systemd[1]: Starting Resin supervisor...

Still looking to see why the supervisor is restarting …

I would probably like to restart the balenad - maybe event the device which would interrupt your services for a short while. Is that acceptable ?

Yes, that would be fine!

Looks like a reboot fixed the issue for e215041. Unfortunately I could not detect the problem that caused it.
I will take a look at the other device now…

I have rebooted the second device too. Again there is some not so nice stuff in the logs like an out of memory situation but still no solid hint to the cause.
There appears to be a related issue https://github.com/balena-os/balena-engine/issues/142 which will be solved with with balenaOS 2.34.0 so it might be a good idea to update.
Devices both seem to be Ok after update with normal load and no supervisor restarts…

Thanks. Both devices had an uptime of nearly 90 days. So who knows!

@jason10, although the “ultimate root cause” of the issue isn’t clear, we understand a bit about the mechanism that led to the higher CPU usage. At some point, the balena supervisor failed a health check by the OS watchdog. It’s not clear why it failed the health check – though the system running out of memory (OOM events) is not an unusual cause for a health check failure. Following that event, what should have happened was for the supervisor to be cleanly restarted by the watchdog. But instead, likely because of the bug linked by samothx, the supervisor failed to restart and the OS entered a failed supervisor restart attempt loop that led to higher CPU usage. Even in devices subject to that bug, this failure does not happen often – most of the time, the watchdog succeeds in restarting the supervisor. In this instance, unfortunately the only solution (as far as we were aware) was to reboot the device.

This explanation probably doesn’t help you much, but I thought I would share our understanding of what was going on. The bug mentioned is fixed in balenaEngine 18.09 (which catches up with Docker 18.09), which will ship with balenaOS 2.34.0 (to be available in production shortly).