Running Yolov8n on Jetson Nano with TensorRt

MWLC · May 21, 2024, 1:12pm

Hi all,

I want to share with you a docker image that I’m using to run Yolov8n on my Jetson nano with TensorRt. I’m sharing it here to save the effort for anyone who is planning to do the same.

This post came after the one I made before but it was limited to the one using .pt model instead of .engine

These are the details of the Jetson nano it has been tested on:
HOST OS VERSION: balenaOS 4.0.9+rev2
SUPERVISOR VERSION: 16.1.0

I made a build (3.82GB) with the docker image here under the name mwlvdev/jetson-nano-ubuntu:bionic-torch1.10-cp38-cuda10.2-TRT

Here are some metrics with the time difference

You can find the original docker file here if you decided to build your own

Regards,

chrisys · May 23, 2024, 12:38pm

@MWLC thanks for sharing! I had a quick look at your BYODR project which looks very cool. Are you running these robots with balena? We’d love to hear more about your project and your mission with MWLC so feel free to share.

MWLC · May 23, 2024, 2:58pm

Hi @chrisys, thanks for showing interest. We have been actively developing for quite sometime with Balena. All our robots are using it.

Our mission is to make the logistics of farming (for now) easier. I would be happy to answer any questions you have regarding this.

Feel free to reach on my linkedIn

Regards,

thomaskleiven · May 25, 2024, 1:12pm

Hi @MWLC,

thanks for sharing. In order to build the .engine, do you have any dependencies to shared object libraries located under /usr/lib/aarch64-linux-gnu/tegra? E.g., libnvdla_compiler.so?

If so, how do you make sure these are available when running your container?

MWLC · May 27, 2024, 7:31am

Hi @thomaskleiven,

I have encountered the same problem with libnvdla_compiler.so. This dependency is part of onnxruntime_gpu.

This solution involves adding a dynamic linker/loader that can find the /tegra directory and use the shared libraries located in it. You can find more info here ImportError: libnvdla_compiler.so: cannot open shared object file: No such file or directory - Jetson & Embedded Systems / Jetson Nano - NVIDIA Developer Forums

I exported the .engine directly from .pt using model.export() Export - Ultralytics YOLOv8 Docs which is offered as part of Ultralytics library in Python. Some answers mentioned that I should export to .onnx first then to .engine, but I didn’t need to with the default yolov8n.pt.

Regards,

thomaskleiven · May 27, 2024, 9:33am

Thank you for your quick response. It makes sense, and I appreciate you sharing the resource links.

Have you encountered any performance issues running models with tensorrt on BalenaOS compared to the default OS on the Jetson? In my tests, using the same containerized applications on both an off-the-shelf Jetson and one flashed with BalenaOS, I noticed higher CPU and GPU usage on the BalenaOS.

If this hasn’t been your experience, I may need to investigate further to understand the cause of this behavior.

MWLC · May 27, 2024, 10:53am

I didn’t try to run it with the normal Jetson L4T. I would assume the higher CPU usage is because of the balena_supervisor that is running in the background all the time.

Achieving 72ms while running yolov8 with a 640p stream on a legacy device (that its dev kit is EOL) is quite good actually. If you are looking for more performance improvement, I would recommend you to keep with the PR -Docker image for jetson nano by Ahelsamahy · Pull Request #13100 · ultralytics/ultralytics (github.com)) I made on ultralytics, maybe one of their devs will have a recommendation to it.

There are more approaches you can take if you want to run the model faster, such as lower the input resolution, lower the model precision like fp16 or fp32 instead of INT64, more can be found here or running the Nano on performance mode.

Feel free to ask here if you have more questions😃.

thomaskleiven · May 27, 2024, 12:03pm

I will investigate this issue further. Currently, we are using fp16 and cannot reduce it further due to the size of the objects we need to detect and the accuracy required. We are updating our fleet to Orin devices, which will allow us to simplify our Docker images to the one you suggested, even though they are quite similar.

The GPU performance is still good and almost matches what we got with the OS that came with the device. My main concern is the unexplained increase in CPU usage. We had our app running smoothly on the original OS, but now we’re seeing CPU usage consistently at 20-30% higher with BalenaOS. I’ll keep digging to figure out what’s causing this.

mpous · May 27, 2024, 12:16pm

@thomaskleiven do you think you can share the services you are running when you see the CPU usage at 20-30% higher with balenaOS?

Where do you see this data? running top or systeminformation?

Thanks

thomaskleiven · May 27, 2024, 12:54pm

We haven’t thoroughly investigated the problem yet. We noticed that BalenaCloud is reporting very high CPU usage (e.g., as seen in the attached image). This is also reflected in our Grafana dashboard, so we’ve just assumed this to be accurate for now.

Our services are a bit specialized and rely on different hardware, so it’s tough to reproduce the issue without the same setup. I’ll try to narrow down the key services to make it easier to share with you.

Do you expect the performance of multithreaded applications within containers to be comparable when running on the original OS versus running on BalenaOS?

mpous · May 27, 2024, 1:13pm

@thomaskleiven could you please try top on the Terminal?

mpous · May 28, 2024, 9:37am

BTW @thomaskleiven @MWLC did you read this blogpost with some tricks?

thomaskleiven · May 28, 2024, 12:18pm

Thanks for the link to the blog post! I guess we might diverge a bit from the original issue in this post, if so, I guess we could continue the discussion elsewhere. Anyways, thanks a lot for your help so far.

It looks like we’re using a pretty similar Dockerfile to the one you pointed out here. For development purposes, we start from nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel and installs the BSP for L4T version r35.4.1, as you can see in the attached Dockerfile.

From what I can tell, BalenaOS comes with l4t-r35.4, which matches up with the BSP we’re installing in the Docker image. The output from uname -a confirms this.

The metrics reported in BalenaCloud seem to be accurate when we check top. When we use the resulting Docker image on Balena OS, we see a higher system load and CPU usage compared to running it directly on the OS shipped with the Jetson.

For example, when we use docker run --runtime nvidia --rm --device /dev/bus/usb private-docker-image:v0.2.0 directly on the Jetson OS our app uses around 70% CPU on average.

But when we use the same image on BalenaOS (with the BSP bundled in as per the instructions), the CPU usage shoots up to over 90%, and the system load is much higher.

Here’s the Dockerfile we’re using, and we push it with balena push:

FROM private-docker-image:v0.2.0

# Don't prompt with any configuration questions
ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
    lbzip2 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Set paths
ENV CUDA_HOME=/usr/local/cuda-11.4
ENV UDEV=1

# Download and install BSP binaries for L4T
RUN \
    cd /tmp/ && wget https://developer.nvidia.com/downloads/embedded/l4t/r35_release_v4.1/release/jetson_linux_r35.4.1_aarch64.tbz2 && \
    tar xf jetson_linux_r35.4.1_aarch64.tbz2 && \
    cd Linux_for_Tegra && \
    sed -i 's/config.tbz2\"/config.tbz2\" --exclude=etc\/hosts --exclude=etc\/hostname/g' apply_binaries.sh && \
    sed -i 's/install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/#install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/g' nv_tegra/nv-apply-debs.sh && \
    sed -i 's/chroot . \//  /g' nv_tegra/nv-apply-debs.sh && \
    ./apply_binaries.sh -r / --target-overlay && cd .. \
    rm -rf Linux_for_Tegra && \
    echo "/usr/lib/aarch64-linux-gnu/tegra" > /etc/ld.so.conf.d/nvidia-tegra.conf && ldconfig

CMD ["./build/run"]

Have you seen this behaviour before when comparing BalenaOS to the OS that the Jetson is shipped with?

Topic		Replies	Views
Using Ultralytics jetson4 image on Jetson Nano Product support support , docker , jetson , nano , jetpack	4	201	July 3, 2024
Jetson Nano image with TensorRT for Python projects balenaOS nano	2	1654	July 13, 2019
Running YOLOv8 on JetsonNano Product support docker , jetson , nano	0	247	February 27, 2024
Same code running 2x slower in Balena for GPU Jetson Xavier balenaOS docker , jetson	8	1321	October 19, 2020
Balena & TensorRT balenaOS	2	363	July 2, 2019

Running Yolov8n on Jetson Nano with TensorRt

Related topics