Jetson Nano GPU: CUDA driver does not match CUDA version

Hi guys,

We’re trying to setup a Jetson-nano based project to perform edge machine-learning inference. We have specific requirements with our cuda versions:

  • We need CUDA 10.2, TensorRT 7.1.3 and a compatible ONNX runtime version (1.6.0)

This is roughly the setup that comes on a Jetson Nano installed with Nvidia’OS for Jetpack 4.5.1, with L4T 32.5.

We’re trying to replicate this environment in Balena, but we are unable to make it work properly. We get the following error:

RuntimeError: /src/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /src/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=127 ; hostname=216db3c ; expr=cudaSetDevice(device_id_); 

This is odd because when inspecting the image built and the working jetson-os based install, it seems to correspond 1:1. We are aware that for correct L4T support with need the correct base-os version of balena. We are running Balena OS 2.82.11 rev2, which reveals that it is indeed L4T 32.5:

$uname -a
Linux 216db3c 4.9.201-l4t-r32.5 #1 SMP PREEMPT Thu May 6 13:07:24 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

Below is the current iteration of our Dockerfile:

FROM balenalib/jetson-nano-ubuntu-python:3.8-bionic-build

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update

# ================
# Core deps. install
# ================

RUN \
    echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" >> /etc/apt/sources.list && \
    echo "deb https://repo.download.nvidia.com/jetson/t210 r32.5 main" >> /etc/apt/sources.list && \
    apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
    mkdir -p /opt/nvidia/l4t-packages/ && \
    touch /opt/nvidia/l4t-packages/.nv-l4t-disable-boot-fw-update-in-preinstall && \
    apt-get update && \
    apt-get install -y --no-install-recommends nvidia-l4t-core

RUN \
    apt-get install --no-install-recommends -y \
    apt-utils \
    nvidia-l4t-firmware \
    nvidia-l4t-cuda \
    nvidia-l4t-tools \
    nvidia-l4t-gstreamer \
    nvidia-l4t-jetson-io \
    nvidia-l4t-configs \
    nvidia-l4t-oem-config \
    nvidia-jetpack \
    tensorrt \
    wget \
    tar \
    lbzip2 \
    python3-wheel \
    build-essential \
    ffmpeg \
    libsm6 \
    libxext6 \
    build-essential \
    software-properties-common \
    libopenblas-dev \
    git

WORKDIR src

# Setup python3.8 and pip3.8
RUN python3 -m venv /opt/venv
RUN /opt/venv/bin/python3 -m pip install --upgrade pip

ENV PATH="/opt/venv/bin:$PATH"

RUN pip install build Cython scikit-build cppy setuptools-rust urllib3 numpy

# ================
# Code deps. install
# ================

# Install Cmake
COPY cmake23_ins.sh cmake23_ins.sh
RUN ./cmake23_ins.sh
ENV PATH="/src/cmake/bin:$PATH"

# Install ONNX runtime for Jetson Nano.
# ONXNRT V1.6.0, TRT V7.1, cuda V10.2
COPY onnxrt.sh onnxrt.sh
RUN ./onnxrt.sh
RUN pip install "/src/onnxruntime/build/Linux/Release/dist/onnxruntime_gpu_tensorrt-1.6.0-cp38-cp38-linux_aarch64.whl"


# ================
# Pipeline install
# ================
WORKDIR app

# Install the requirements before the application as they are
# unlikely to change very often
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

# Install the pipeline itself
# ---------------------------
# Because of the docker pathing, if we do not remove the *.egg-info, python
# thinks the package is installed but it's not really installed
COPY . .
RUN python -m build && rm -rf *.egg-info && python -m pip install dist/*.whl


# ================
# Validate
# ================
RUN our-app --info # ensure it works

# ================
# ENV Configuration
# ================
ENV UDEV=on

# Entrypoint, run with the provided config + input
CMD ["our-app", "/data/config.json", "/data/input.json"]

It refers to two external scripts (cmake23_ins.sh and onnxrt.sh). The former install cmake for our platform and the later builds the specific onnxruntime that we need bundled.

Our docker compose does nothing special:

version: '2'
volumes:
        resin-data:
services:
        appliance:
                build: .
                volumes:
                        - 'resin-data:/data'
                privileged: true
                restart: always
                network_mode: "host"
# We have tried with and without these options.  Adding the GPU features
# makes the container not even start.
#                labels: 
#                       io.balena.features.kernel-modules: '1'
#                        io.balena.features.gpu: '1'

The docker image builds successfully, all the dependencies are resolved correctly. The error, as listed above, is a runtime error.

The tegrastats commands seems to return correct values from the container, so I suspect it does see the GPU.

Were looking for a solution to the GPU driver being basically the wrong version, but I am unsure how since we are installing the complete L4T package.

Thanks ahead!

Hi @samuel_yvon ,

Do you unpack the L4T 32.5 BSP archive in a script that has not been posted? This step seems to be missing in your Dockerfile.

This type of error can occur when the drivers from the BSP archive are not present or don’t match the kernel’s L4T version, so if unpacking the BSP archive in the Dockerfile as listed in this example does not solve the error, please share the rest of the scripts as well as the exact device type that you are using and we will try to replicate the error on our side.

Knowing the exact Nano device type is useful because we have 2.82.11+rev2 images for multiple device types that use slightly different hardware. The Nano SD, Nano eMMC, Nano 2GB as well as the JN30B-Nano all have a release at this version.

Hi!

Thanks for the reply. I was actually missing that step; I will rebuild with it and update back (pretty rookie mistake :sweat_smile: )

As for the device type, it’s a Nano 2GB DevKit for now :slight_smile:

1 Like

Thank you very much for your help, this was indeed the issue.

2 Likes