Hi Balena Community & Devs,
I’m trying to run a number of TensorRT Neural Nets which can utilize the Deep Learning Accelerator (DLA) on a Xavier AGX. When running the networks it seems that they are solely reliant on the AGX’s GPU and do not seem to access or use the DLA cores whatsoever. My fist observation for that was monitoring GPU usage with and without DLA using tegrastats and noticing no difference. Checking for DLA usage via the following cat commands also verified the assumption, as both cores were consistently reporting “suspended” when running the network config for either cores.
As a sanity check, I used TensorRT’s trtexec tool for rapid benchmarking of neural networks, and it complained about no DLA devices available when attempting to run it on the balena AGX, which I know has 2 DLA cores on it. Another sanity check was to try the exact same networks with the exact same setup on a non-balena AGX with success.
It seems to be little to no resources regarding DLA support for the Xavier AGX when looking around the balena forums, and was hoping I can get some insight from the balena dev team on the possibility of using it on balenaOS.
Machine: Nvidia Xavier AGX
BalenaOS Version: balenaOS 2.82.11+rev2
Supervisor Version: 12.9.3
Glad to provide any further information as needed.
Hello, I haven’t tried accessing DLA cores on an Nvidia device using balena before but I’ll ask some of my colleagues if they have any insight. In the meantime, could you post the Dockerfile you used for your DLA test?(even a minimal version) It sounds like you have successfully accessed the GPU using balena though?
Dockerfile.template.txt (1.8 KB)
Hi @alanb128 ,
I’ve attempted to setup a simple & minimal dockerfile building off the xavier dockerfile example in the OpenDataCam repo on github. I updated the dockerfile to deploy with cuda 10-2 and libcudnn8 as well as tensorrt. Once deployed, an example command that can be ran is:
/usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/googlenet/googlenet.prototxt --output=prob --useDLACore=0 --allowgpufallback
Which will complain about a failed assertion
“…/rtExt/dla/eglUtils.cpp (99) - Assertion Error in operator(): 0 ((eglCreateStreamKHR) != nullptr)”
Running the exact same trtexec command without the useDLACore and allowGPUFallback flags does not raise any issues:
/usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/googlenet/googlenet.prototxt --output=prob
Hello @mohie34 one of my colleagues has suggested that DLA support may be linked to the versions of packages installed in the container, so the first step probably is to ensure the t194 r32.5 is used in sources.list like this: contracts/distro-config.tpl at b154f659a7299091a1958c8d8a576f5e5227b073 · balena-io/contracts · GitHub So let’s try that first and then if need be we will try to get someone on this end that has access to a Xavier to troubleshoot further. Just to confirm, the first command executed successfully on a non-balena AGX?
Hi @alanb128 That’s correct, the first command executed without issues on a non-balena AGX. The failed assertion error still pops up after adding the t194 r32.5 source.list
Hi @mohie34 we have created an issue for this item on GitHub: Investigate how DLA cores can enter active state on Xavier AGX · Issue #234 · balena-os/balena-jetson · GitHub and we’ll attempt to reproduce it to troubleshoot further. You can follow along there and a notification will also appear in this thread once it has been closed.
1 Like
Hi @mohie34 , can you please try doing a ’ cp /usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0 /usr/lib/aarch64-linux-gnu/libEGL_mesa.so.0
’ ?
I have tested this with the latest Xavier AGX image v2.88.4 which is based on L4T 32.6.1
/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --useDLACore=0 --allowGPUFallback
and saw no crash, but it could be that the above cp
might be needed for the previous image based on L4T 32.5.1.
...
[02/17/2022-12:00:39] [I]
[02/17/2022-12:00:39] [I] === Performance summary ===
[02/17/2022-12:00:39] [I] Throughput: 579.664 qps
[02/17/2022-12:00:39] [I] Latency: min = 1.50879 ms, max = 2.42726 ms, mean = 1.70438 ms, median = 1.66333 ms, percentile(99%) = 2.33064 ms
[02/17/2022-12:00:39] [I] End-to-End Host Latency: min = 1.52075 ms, max = 2.45248 ms, mean = 1.7191 ms, median = 1.67694 ms, percentile(99%) = 2.3517 ms
[02/17/2022-12:00:39] [I] Enqueue Time: min = 1.28638 ms, max = 2.34479 ms, mean = 1.57797 ms, median = 1.53613 ms, percentile(99%) = 2.25676 ms
[02/17/2022-12:00:39] [I] H2D Latency: min = 0.00610352 ms, max = 0.177734 ms, mean = 0.013794 ms, median = 0.00616455 ms, percentile(99%) = 0.110107 ms
[02/17/2022-12:00:39] [I] GPU Compute Time: min = 1.49927 ms, max = 2.35214 ms, mean = 1.68536 ms, median = 1.65173 ms, percentile(99%) = 2.26614 ms
[02/17/2022-12:00:39] [I] D2H Latency: min = 0.00317383 ms, max = 0.0574646 ms, mean = 0.00523052 ms, median = 0.00341797 ms, percentile(99%) = 0.0341187 ms
[02/17/2022-12:00:39] [I] Total Host Walltime: 3.00174 s
[02/17/2022-12:00:39] [I] Total GPU Compute Time: 2.93253 s
[02/17/2022-12:00:39] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[02/17/2022-12:00:39] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[02/17/2022-12:00:39] [I] Explanations of the performance metrics are printed in the verbose logs.
[02/17/2022-12:00:39] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --useDLACore=0 --allowGPUFallback
[02/17/2022-12:00:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 911, GPU 13346 (MiB)
Full test output along Dockerfile for L4T 32.6.1 is available in the github issue: Investigate how DLA cores can enter active state on Xavier AGX · Issue #234 · balena-os/balena-jetson · GitHub