Same code running 2x slower in Balena for GPU Jetson Xavier

Hey guys!
I’m using YoloV4 optimized with tensorRT using this amazing library tkDNN in a NVIDIA Jetson Xavier AGX. I have 2 settings: the first one where the jetson was flashed with Jetpack 4.3 which runs Ubuntu18.04; the second one I containerized everything and run it in Balena.

With raw Ubuntu (1st setting), I can reproduce what in the tkDNN repo is reported, which is running YoloV4 with FP16 batch_size=1 with an average of 41FPS or 24 milliseconds per image. Nonetheless, when running the same code on Balena, I only got 21FPS or 45 milliseconds per image. Both settings have the GPU at same frequency.

The question is why is this happening? What I am missing?
The steps to reproduce this can be found in this repo https://github.com/charlielito/test

Hi,

I assume you are running the tests on the Devkit, no particular carrier board.
Does running the jetson-clocks.sh script in the container or hostOS make any difference?
And tegra nvstats, do they show different values between the two setups?

Also, the current available image 2.51 for the AGX is based on 32.4.2 and the Jetpack 4.3 Ubuntu is probably running 32.4.3. Might also have slightly different package versions than the ones installed in the container test app.

Yes, the tests are running in the DevKit. Running jetson_clocks made no difference, same results. Moreover, just noticed that the Jetson with Balena has 32 GB and the other 16 GB of RAM, but I don’t think that could affect negatively the performance.

These are the results of tegrastast:

For balena Xavier

RAM 27794/31924MB (lfb 608x4MB) CPU [0%@1190,0%@1190,0%@1190,0%@1190,11%@1190,37%@1190,11%@1190,41%@1190] EMC_FREQ 40%@1331 GR3D_FREQ 99%@675 APE 150 MTS fg 0% bg 7% AO@29C GPU@31C iwlwifi@33C Tdiode@33.75C PMIC@100C AUX@28.5C CPU@30C thermal@29.55C Tboard@30C GPU 4797/513 CPU 774/725 SOC 2167/1137 CV 0/0 VDDRQ 1702/467 SYS5V 2688/1869

For raw ubuntu

RAM 13934/15823MB (lfb 124x4MB) SWAP 26/7911MB (cached 1MB) CPU [9%@2265,7%@2265,13%@2265,16%@2265,7%@2265,8%@2265,19%@2265,18%@2265] EMC_FREQ 0% GR3D_FREQ 98% AO@48C GPU@54.5C iwlwifi@47C Tdiode@52C PMIC@100C AUX@47C CPU@50.5C thermal@50C Tboard@47C GPU 20037/2670 CPU 2602/1987 SOC 4743/2236 CV 0/0 VDDRQ 2753/559 SYS5V 3588/2305

I don’t know if there is any substantial difference there. Also that L4T can be slightly different (just the last number) doesn’t seem like a reason for such a difference in performance.

I wanted to test a tensorflow model to see if same problem arises. Indeed running a model, same code, same behavior as the original issue. In Balena we get 20 FPS and without Balena 41FPS. I updated the repository to test this since it is easier to reproduce, no compilation required or tensorRT, just raw tensorflow.

EDIT: Just tested the code in balena with the 16GB version, and as expected, same results.

Hi Charlie, those results are a bit difficult to interpret without formatting, or some headers that indicate what the values represent. Mostly I am curious about these results though, and what they actually represent? (Note: I don’t have this hardware, so couldnt try to replicate it myself).

GPU 20037/2670 (taken from Ubuntu results)

-versus-

GPU 4797/513 (taken from balena results)

What are those values?

Here you can find the meaning of the numbers https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3231/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/AppendixTegraStats.html

Nonetheless I don’t see any section referring to GPU X/Y. Maybe @acostach do know how to interpret those numbers?

Hi,

No, I didn’t find what GPU X/Y means either. But I ran both nvpmodel to set Power model to MAXN along jetson_clocks, both in container, and now see the CPU and GPU increase to values closer to the ones you noticed in ubuntu:

root@8267885:/usr/src/app# nvpmodel  -f /etc/nvpmodel/nvpmodel_t194.conf -m 0
root@8267885:/usr/src/app# nvpmodel -q -f /etc/nvpmodel/nvpmodel_t194.conf 
NV Fan Mode:quiet
NV Power Mode: MAXN
0
root@8267885:/usr/src/app# jetson_clocks

tegrastats:

RAM 1204/15697MB (lfb 3556x4MB) SWAP 0/1000MB (cached 0MB) CPU [0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,3%@2265,0%@2265,0%@2265] EMC_FREQ 0%@2133 GR3D_FREQ 0%@1377 APE 150 MTS fg 0% bg 0% AO@25C GPU@26C Tdiode@29C PMIC@100C AUX@24C CPU@26C thermal@25.2C Tboard@27C GPU 1056/1056 CPU 603/603 SOC 2112/2112 CV 0/0 VDDRQ 150/150 SYS5V 2049/2049

Hello, Please could you try Alexandru’s suggestion of shcanign the power model settings, and see if this remedies this situation? Thank you.

correction to above - “Please could you try Alexandru’s suggestion of changing the power model settings…”

Indeed this solves the issue! I naively thought that jetson_clocks also set the power mode :frowning: to MaxQ.
Now I’m only seeing roughly 1 millisecond more in balena, which I would think is normal.

Thanks!