High CPU and Memory usage on resin-supervisor.

nicholas-presien · June 4, 2021, 12:02am

Hi all,

The resin supervisor on our application is acting weird where it’s CPU usage and memory usage would sky rocket, causing high latency and excessive CPU usage on our device.

Details:

device: jetson-nano
host os: balenaOS 2.69.1+rev1
supervisor version: 12.3.0
container running: 8 (not including resin-supervisor)

Metrics:

CPU and memory usage:

282bfb67b176 resin_supervisor 75.01% 1.597GiB / 3.869GiB 41.28% 0B / 0B 441MB / 42.6MB 12

logs from resin-supervisor

[api] GET /v1/healthy 200 - 4.863 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete

[event] Event: Device state report failure {“error”:{“message”:“”}}
[api] GET /v1/healthy 200 - 338.178 ms

[event] Event: Device state report failure {“error”:{“message”:“”}}
[api] GET /v1/healthy 200 - 2.870 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 2.885 ms
[api] GET /v1/healthy 200 - 3.086 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 3.158 ms
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error
[error] at /usr/src/app/dist/app.js:22:554480
[error] at runMicrotasks ()
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:553788
[error] at async /usr/src/app/dist/app.js:22:555310
[api] GET /v1/healthy 200 - 341.618 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 8.252 ms
[api] GET /v1/healthy 200 - 362.491 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 3.540 ms
[api] GET /v1/healthy 200 - 3.129 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error
[error] at /usr/src/app/dist/app.js:22:554480
[error] at runMicrotasks ()
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:553788
[error] at async /usr/src/app/dist/app.js:22:555310
[api] GET /v1/healthy 200 - 20.501 ms
[api] GET /v1/healthy 200 - 3.139 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error
[error] at /usr/src/app/dist/app.js:22:554480
[error] at runMicrotasks ()
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:553788
[error] at async /usr/src/app/dist/app.js:22:555310

any help on this would be nice, thanks.

alanb128 · June 4, 2021, 1:26am

Hello, are you aware of any event that triggers the behavior you described, such as pushing a new release or updating the OS/Supervisor? Please run the Device Health Checks on the diagnostics page of the dashboard and let us know if anything is failing.

jdot · June 5, 2021, 12:59am

Also seeing high CPU usage/time from resin_supervisor. Most notably the command; node /usr/src/app/dist/app.js.

RPI3
balenaOS 2.77.0+rev1
Supervisor 12.3.0.

The Device Health Checks are all green.

rcooke-warwick · June 7, 2021, 11:40am

Hi there - you can try to use htop to determine where this CPU usage is coming from exactly. Could you try balena run --rm -it --privileged --pid=host wrboyce/utils:htop from the Host OS of the device, and then sorting by CPU / mem usage?

markcorbinuk · June 10, 2021, 3:43pm

We have recently fixed a Supervisor issue that was causing random CPU usage spikes while doing system information reporting. This should be fixed in Supervisor v12.8.3, so please try updating to this version (or later).

jdot · June 11, 2021, 1:10am

Thanks we will give that a shot!

nicholas-presien · June 11, 2021, 1:37am

@alanb128 @rcooke-warwick

Sorry for the late reply as I was busy with other stuff.

Here’s the top output on the jetson-nano host:

Mem: 3719844K used, 337008K free, 16276K shrd, 102104K buff, 490724K cached
CPU: 58% usr 8% sys 0% nic 27% idle 0% io 4% irq 2% sirq
Load average: 16.82 11.13 12.50 7/746 80565
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
4324 1 root S 3037m 76% 18% /usr/bin/balenad --experimental --log-driver=journald -s overlay2 -H fd:// -H unix:///var/run/balena.sock -H unix:///var/run/balena-engine.sock -H tcp://0.0.0.0:2375 --d

Edit:
The high CPU load is and load average spike is causing latency issue on our “real time” system, where we are getting latency spike in our application which is time critical.

This is running supervisor 12.3.5 and I haven’t upgrade it to v12.8.3 as suggested by @markcorbinuk.

Depending on if I can catch the CPU spike, it’s either the balanad or node /usr/src/app/dist/app.js which causes the high CPU usage.

I’ll run my dev device with v12.8.3 and see if this issue persist.

gantonayde · June 16, 2021, 1:25pm

Hi Nicholas, have you had the chance to upgrade the supervisor to v12.8.3 and see if the issue is gone? As Mark mentioned the problem should be gone in v12.8.3 but it will be nice to have a confirmation that everything works as intended.

Topic		Replies	Views
Why is balenad using 32% of my CPU? balenaOS	9	1108	May 17, 2019
Supervisor CPU Spikes balenaOS	1	321	June 10, 2021
Notably increment of RAM usage in newest Resin OS versions Product support raspberrypi3	10	1148	September 10, 2018
High host OS CPU after some pushes on random boxes balenaOS	7	494	January 29, 2020
An average user CPU usage of 35% is reported on my raspberry pi which cannot directly be attributed to any container balenaOS raspberrypi3	14	4667	May 29, 2019

High CPU and Memory usage on resin-supervisor.

Related topics