Hi guys,
We’re trying to setup a Jetson-nano based project to perform edge machine-learning inference. We have specific requirements with our cuda versions:
- We need CUDA 10.2, TensorRT 7.1.3 and a compatible ONNX runtime version (1.6.0)
This is roughly the setup that comes on a Jetson Nano installed with Nvidia’OS for Jetpack 4.5.1, with L4T 32.5.
We’re trying to replicate this environment in Balena, but we are unable to make it work properly. We get the following error:
RuntimeError: /src/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /src/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=127 ; hostname=216db3c ; expr=cudaSetDevice(device_id_);
This is odd because when inspecting the image built and the working jetson-os based install, it seems to correspond 1:1. We are aware that for correct L4T support with need the correct base-os version of balena. We are running Balena OS 2.82.11 rev2, which reveals that it is indeed L4T 32.5:
$uname -a
Linux 216db3c 4.9.201-l4t-r32.5 #1 SMP PREEMPT Thu May 6 13:07:24 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
Below is the current iteration of our Dockerfile:
FROM balenalib/jetson-nano-ubuntu-python:3.8-bionic-build
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update
# ================
# Core deps. install
# ================
RUN \
echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" >> /etc/apt/sources.list && \
echo "deb https://repo.download.nvidia.com/jetson/t210 r32.5 main" >> /etc/apt/sources.list && \
apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
mkdir -p /opt/nvidia/l4t-packages/ && \
touch /opt/nvidia/l4t-packages/.nv-l4t-disable-boot-fw-update-in-preinstall && \
apt-get update && \
apt-get install -y --no-install-recommends nvidia-l4t-core
RUN \
apt-get install --no-install-recommends -y \
apt-utils \
nvidia-l4t-firmware \
nvidia-l4t-cuda \
nvidia-l4t-tools \
nvidia-l4t-gstreamer \
nvidia-l4t-jetson-io \
nvidia-l4t-configs \
nvidia-l4t-oem-config \
nvidia-jetpack \
tensorrt \
wget \
tar \
lbzip2 \
python3-wheel \
build-essential \
ffmpeg \
libsm6 \
libxext6 \
build-essential \
software-properties-common \
libopenblas-dev \
git
WORKDIR src
# Setup python3.8 and pip3.8
RUN python3 -m venv /opt/venv
RUN /opt/venv/bin/python3 -m pip install --upgrade pip
ENV PATH="/opt/venv/bin:$PATH"
RUN pip install build Cython scikit-build cppy setuptools-rust urllib3 numpy
# ================
# Code deps. install
# ================
# Install Cmake
COPY cmake23_ins.sh cmake23_ins.sh
RUN ./cmake23_ins.sh
ENV PATH="/src/cmake/bin:$PATH"
# Install ONNX runtime for Jetson Nano.
# ONXNRT V1.6.0, TRT V7.1, cuda V10.2
COPY onnxrt.sh onnxrt.sh
RUN ./onnxrt.sh
RUN pip install "/src/onnxruntime/build/Linux/Release/dist/onnxruntime_gpu_tensorrt-1.6.0-cp38-cp38-linux_aarch64.whl"
# ================
# Pipeline install
# ================
WORKDIR app
# Install the requirements before the application as they are
# unlikely to change very often
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
# Install the pipeline itself
# ---------------------------
# Because of the docker pathing, if we do not remove the *.egg-info, python
# thinks the package is installed but it's not really installed
COPY . .
RUN python -m build && rm -rf *.egg-info && python -m pip install dist/*.whl
# ================
# Validate
# ================
RUN our-app --info # ensure it works
# ================
# ENV Configuration
# ================
ENV UDEV=on
# Entrypoint, run with the provided config + input
CMD ["our-app", "/data/config.json", "/data/input.json"]
It refers to two external scripts (cmake23_ins.sh and onnxrt.sh). The former install cmake for our platform and the later builds the specific onnxruntime that we need bundled.
Our docker compose does nothing special:
version: '2'
volumes:
resin-data:
services:
appliance:
build: .
volumes:
- 'resin-data:/data'
privileged: true
restart: always
network_mode: "host"
# We have tried with and without these options. Adding the GPU features
# makes the container not even start.
# labels:
# io.balena.features.kernel-modules: '1'
# io.balena.features.gpu: '1'
The docker image builds successfully, all the dependencies are resolved correctly. The error, as listed above, is a runtime error.
The tegrastats
commands seems to return correct values from the container, so I suspect it does see the GPU.
Were looking for a solution to the GPU driver being basically the wrong version, but I am unsure how since we are installing the complete L4T package.
Thanks ahead!