running top from the HostOS container I found:
Mem: 824480K used, 6756072K free, 21536K shrd, 49780K buff, 152232K cached
CPU: 32% usr 2% sys 0% nic 65% idle 0% io 0% irq 0% sirq
Load average: 2.82 2.79 2.77 4/282 29634
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
734 1 root S 1663m 22% 32% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena
28160 28139 root S 298m 4% 2% ./camera_controller_resin http://10.0.0.110:8080
11805 1 root S 1713m 23% 0% /usr/bin/balenad --experimental --log-driver=journald -s aufs
12334 12254 root S 1205m 16% 0% node /usr/src/app/dist/app.js
Is the 32% of my cpu going to balenad expected or a result of my CPU being otherwise lightly loaded?
https://dashboard.balena-cloud.com/devices/e21504138d6433262e2bde08a61d49ff/summary
Here is a similar device:
Mem: 5356596K used, 2094940K free, 22076K shrd, 7984K buff, 113072K cached
CPU: 28% usr 1% sys 0% nic 69% idle 0% io 0% irq 0% sirq
Load average: 2.15 2.29 2.34 4/270 8645
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
712 1 root S 1813m 25% 30% /usr/bin/balenad --experimental --log-driver=journald -s aufs
5890 1556 root S 5034m 69% 0% ./camera_controller_resin http://192.168.0.149:8080
1 0 root S 114m 2% 0% {systemd} /sbin/init
713 1 root S 1432m 20% 0% /usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena
https://dashboard.balena-cloud.com/devices/8057fa160e7781f273c6913344eb70e6/summary
Support access has been granted on both.
Hi Jason,
I am taking a look…
there is clearly stuff going on that is out of the ordinary:
May 16 22:44:41 e215041 kernel: camera_controll[31768]: segfault at 7efcab0b693e ip 000055edbbb40d77 sp 00007fff80c6e5c0 error 4 in camera_controller_resin[55edbbaf4000+112000]
May 16 22:44:41 e215041 kernel[636]: camera_controll[31768]: segfault at 7efcab0b693e ip 000055edbbb40d77 sp 00007fff80c6e5c0 error 4 in camera_controller_resin[55edbbaf4000+112000]
May 16 22:44:41 e215041 balenad[11805]: /usr/src/app/runCamera.bash: line 14: 3232 Segmentation fault (core dumped) ./camera_controller_resin "${SERVER_URL}"
May 16 22:44:41 e215041 balenad[11805]: ls: cannot access '/tmp/*.bin': No such file or directory
And the supervisor is being restarted every 2 minutes:
May 16 22:45:31 e215041 systemd[1]: Starting Resin supervisor...
May 16 22:47:01 e215041 systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
May 16 22:47:01 e215041 resin-supervisor[1182]: active
May 16 22:47:01 e215041 systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
May 16 22:47:01 e215041 systemd[1]: Failed to start Resin supervisor.
May 16 22:47:11 e215041 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
May 16 22:47:11 e215041 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 16994.
May 16 22:47:11 e215041 systemd[1]: Stopped Resin supervisor.
May 16 22:47:11 e215041 systemd[1]: Starting Resin supervisor...
Still looking to see why the supervisor is restarting …
I would probably like to restart the balenad - maybe event the device which would interrupt your services for a short while. Is that acceptable ?
Looks like a reboot fixed the issue for e215041. Unfortunately I could not detect the problem that caused it.
I will take a look at the other device now…
I have rebooted the second device too. Again there is some not so nice stuff in the logs like an out of memory situation but still no solid hint to the cause.
There appears to be a related issue https://github.com/balena-os/balena-engine/issues/142 which will be solved with with balenaOS 2.34.0 so it might be a good idea to update.
Devices both seem to be Ok after update with normal load and no supervisor restarts…
Thanks. Both devices had an uptime of nearly 90 days. So who knows!
@jason10, although the “ultimate root cause” of the issue isn’t clear, we understand a bit about the mechanism that led to the higher CPU usage. At some point, the balena supervisor failed a health check by the OS watchdog. It’s not clear why it failed the health check – though the system running out of memory (OOM events) is not an unusual cause for a health check failure. Following that event, what should have happened was for the supervisor to be cleanly restarted by the watchdog. But instead, likely because of the bug linked by samothx, the supervisor failed to restart and the OS entered a failed supervisor restart attempt loop that led to higher CPU usage. Even in devices subject to that bug, this failure does not happen often – most of the time, the watchdog succeeds in restarting the supervisor. In this instance, unfortunately the only solution (as far as we were aware) was to reboot the device.
This explanation probably doesn’t help you much, but I thought I would share our understanding of what was going on. The bug mentioned is fixed in balenaEngine 18.09 (which catches up with Docker 18.09), which will ship with balenaOS 2.34.0 (to be available in production shortly).