I want to share with you a docker image that I’m using to run Yolov8n on my Jetson nano with TensorRt. I’m sharing it here to save the effort for anyone who is planning to do the same.
This post came after the one I made before but it was limited to the one using .pt model instead of .engine
These are the details of the Jetson nano it has been tested on:
HOST OS VERSION: balenaOS 4.0.9+rev2
SUPERVISOR VERSION: 16.1.0
I made a build (3.82GB) with the docker image here under the name mwlvdev/jetson-nano-ubuntu:bionic-torch1.10-cp38-cuda10.2-TRT
@MWLC thanks for sharing! I had a quick look at your BYODR project which looks very cool. Are you running these robots with balena? We’d love to hear more about your project and your mission with MWLC so feel free to share.
thanks for sharing. In order to build the .engine, do you have any dependencies to shared object libraries located under /usr/lib/aarch64-linux-gnu/tegra? E.g., libnvdla_compiler.so?
If so, how do you make sure these are available when running your container?
I exported the .engine directly from .pt using model.export()Export - Ultralytics YOLOv8 Docs which is offered as part of Ultralytics library in Python. Some answers mentioned that I should export to .onnx first then to .engine, but I didn’t need to with the default yolov8n.pt.
Thank you for your quick response. It makes sense, and I appreciate you sharing the resource links.
Have you encountered any performance issues running models with tensorrt on BalenaOS compared to the default OS on the Jetson? In my tests, using the same containerized applications on both an off-the-shelf Jetson and one flashed with BalenaOS, I noticed higher CPU and GPU usage on the BalenaOS.
If this hasn’t been your experience, I may need to investigate further to understand the cause of this behavior.
I didn’t try to run it with the normal Jetson L4T. I would assume the higher CPU usage is because of the balena_supervisor that is running in the background all the time.
There are more approaches you can take if you want to run the model faster, such as lower the input resolution, lower the model precision like fp16 or fp32 instead of INT64, more can be found here or running the Nano on performance mode.
Feel free to ask here if you have more questions😃.
I will investigate this issue further. Currently, we are using fp16 and cannot reduce it further due to the size of the objects we need to detect and the accuracy required. We are updating our fleet to Orin devices, which will allow us to simplify our Docker images to the one you suggested, even though they are quite similar.
The GPU performance is still good and almost matches what we got with the OS that came with the device. My main concern is the unexplained increase in CPU usage. We had our app running smoothly on the original OS, but now we’re seeing CPU usage consistently at 20-30% higher with BalenaOS. I’ll keep digging to figure out what’s causing this.
We haven’t thoroughly investigated the problem yet. We noticed that BalenaCloud is reporting very high CPU usage (e.g., as seen in the attached image). This is also reflected in our Grafana dashboard, so we’ve just assumed this to be accurate for now.
Our services are a bit specialized and rely on different hardware, so it’s tough to reproduce the issue without the same setup. I’ll try to narrow down the key services to make it easier to share with you.
Do you expect the performance of multithreaded applications within containers to be comparable when running on the original OS versus running on BalenaOS?
Thanks for the link to the blog post! I guess we might diverge a bit from the original issue in this post, if so, I guess we could continue the discussion elsewhere. Anyways, thanks a lot for your help so far.
It looks like we’re using a pretty similar Dockerfile to the one you pointed out here. For development purposes, we start from nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel and installs the BSP for L4T version r35.4.1, as you can see in the attached Dockerfile.
From what I can tell, BalenaOS comes with l4t-r35.4, which matches up with the BSP we’re installing in the Docker image. The output from uname -a confirms this.
The metrics reported in BalenaCloud seem to be accurate when we check top. When we use the resulting Docker image on Balena OS, we see a higher system load and CPU usage compared to running it directly on the OS shipped with the Jetson.
For example, when we use docker run --runtime nvidia --rm --device /dev/bus/usb private-docker-image:v0.2.0 directly on the Jetson OS our app uses around 70% CPU on average.
But when we use the same image on BalenaOS (with the BSP bundled in as per the instructions), the CPU usage shoots up to over 90%, and the system load is much higher.
Here’s the Dockerfile we’re using, and we push it with balena push:
FROM private-docker-image:v0.2.0
# Don't prompt with any configuration questions
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
lbzip2 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Set paths
ENV CUDA_HOME=/usr/local/cuda-11.4
ENV UDEV=1
# Download and install BSP binaries for L4T
RUN \
cd /tmp/ && wget https://developer.nvidia.com/downloads/embedded/l4t/r35_release_v4.1/release/jetson_linux_r35.4.1_aarch64.tbz2 && \
tar xf jetson_linux_r35.4.1_aarch64.tbz2 && \
cd Linux_for_Tegra && \
sed -i 's/config.tbz2\"/config.tbz2\" --exclude=etc\/hosts --exclude=etc\/hostname/g' apply_binaries.sh && \
sed -i 's/install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/#install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/g' nv_tegra/nv-apply-debs.sh && \
sed -i 's/chroot . \// /g' nv_tegra/nv-apply-debs.sh && \
./apply_binaries.sh -r / --target-overlay && cd .. \
rm -rf Linux_for_Tegra && \
echo "/usr/lib/aarch64-linux-gnu/tegra" > /etc/ld.so.conf.d/nvidia-tegra.conf && ldconfig
CMD ["./build/run"]
Have you seen this behaviour before when comparing BalenaOS to the OS that the Jetson is shipped with?