High CPU and Memory usage on resin-supervisor.

Hi all,

The resin supervisor on our application is acting weird where it’s CPU usage and memory usage would sky rocket, causing high latency and excessive CPU usage on our device.

Details:

  • device: jetson-nano
  • host os: balenaOS 2.69.1+rev1
  • supervisor version: 12.3.0
  • container running: 8 (not including resin-supervisor)

Metrics:

  • CPU and memory usage:

282bfb67b176 resin_supervisor 75.01% 1.597GiB / 3.869GiB 41.28% 0B / 0B 441MB / 42.6MB 12

  • logs from resin-supervisor

[api] GET /v1/healthy 200 - 4.863 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete

[event] Event: Device state report failure {“error”:{“message”:“”}}
[api] GET /v1/healthy 200 - 338.178 ms

[event] Event: Device state report failure {“error”:{“message”:“”}}
[api] GET /v1/healthy 200 - 2.870 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 2.885 ms
[api] GET /v1/healthy 200 - 3.086 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 3.158 ms
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error
[error] at /usr/src/app/dist/app.js:22:554480
[error] at runMicrotasks ()
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:553788
[error] at async /usr/src/app/dist/app.js:22:555310
[api] GET /v1/healthy 200 - 341.618 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 8.252 ms
[api] GET /v1/healthy 200 - 362.491 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[api] GET /v1/healthy 200 - 3.540 ms
[api] GET /v1/healthy 200 - 3.129 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error
[error] at /usr/src/app/dist/app.js:22:554480
[error] at runMicrotasks ()
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:553788
[error] at async /usr/src/app/dist/app.js:22:555310
[api] GET /v1/healthy 200 - 20.501 ms
[api] GET /v1/healthy 200 - 3.139 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error
[error] at /usr/src/app/dist/app.js:22:554480
[error] at runMicrotasks ()
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:553788
[error] at async /usr/src/app/dist/app.js:22:555310

any help on this would be nice, thanks.

Hello, are you aware of any event that triggers the behavior you described, such as pushing a new release or updating the OS/Supervisor? Please run the Device Health Checks on the diagnostics page of the dashboard and let us know if anything is failing.

Also seeing high CPU usage/time from resin_supervisor. Most notably the command; node /usr/src/app/dist/app.js.

  • RPI3
  • balenaOS 2.77.0+rev1
  • Supervisor 12.3.0.

The Device Health Checks are all green.

Hi there - you can try to use htop to determine where this CPU usage is coming from exactly. Could you try balena run --rm -it --privileged --pid=host wrboyce/utils:htop from the Host OS of the device, and then sorting by CPU / mem usage?

We have recently fixed a Supervisor issue that was causing random CPU usage spikes while doing system information reporting. This should be fixed in Supervisor v12.8.3, so please try updating to this version (or later).

Thanks we will give that a shot!

@alanb128 @rcooke-warwick

Sorry for the late reply as I was busy with other stuff.

Here’s the top output on the jetson-nano host:

Mem: 3719844K used, 337008K free, 16276K shrd, 102104K buff, 490724K cached
CPU: 58% usr 8% sys 0% nic 27% idle 0% io 4% irq 2% sirq
Load average: 16.82 11.13 12.50 7/746 80565
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
4324 1 root S 3037m 76% 18% /usr/bin/balenad --experimental --log-driver=journald -s overlay2 -H fd:// -H unix:///var/run/balena.sock -H unix:///var/run/balena-engine.sock -H tcp://0.0.0.0:2375 --d

Edit:
The high CPU load is and load average spike is causing latency issue on our “real time” system, where we are getting latency spike in our application which is time critical.

This is running supervisor 12.3.5 and I haven’t upgrade it to v12.8.3 as suggested by @markcorbinuk.

Depending on if I can catch the CPU spike, it’s either the balanad or node /usr/src/app/dist/app.js which causes the high CPU usage.

I’ll run my dev device with v12.8.3 and see if this issue persist.

Hi Nicholas, have you had the chance to upgrade the supervisor to v12.8.3 and see if the issue is gone? As Mark mentioned the problem should be gone in v12.8.3 but it will be nice to have a confirmation that everything works as intended.