Jetson: Support Nvidia Docker Images

@alanb128

Here is detailed issue we would like to solve.

We were able to install the docker & Nvidia container toolkit inside a service that is running on Jetson Device.

Here is our Docker file:

FROM balenalib/jetson-tx2-ubuntu:bionic
ENV DEBIAN_FRONTEND=noninteractive
# Install CUDA and some utilities
RUN apt-get update && apt-get install -y lbzip2 xorg cuda-toolkit-10-2 wget tar python3 libegl1 python3-gi

# Download and install BSP binaries for L4T 32.4.4
RUN apt-get update && apt-get install -y lbzip2 python3 libegl1 && \
    wget https://developer.nvidia.com/embedded/L4T/r32_Release_v4.4/r32_Release_v4.4-GMC3/T210/Tegra210_Linux_R32.4.4_aarch64.tbz2 && \
    tar xf Tegra210_Linux_R32.4.4_aarch64.tbz2 && \
    cd Linux_for_Tegra && \
    sed -i 's/config.tbz2\"/config.tbz2\" --exclude=etc\/hosts --exclude=etc\/hostname/g' apply_binaries.sh && \
    sed -i 's/install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/#install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/g' nv_tegra/nv-apply-debs.sh && \
    sed -i 's/LC_ALL=C chroot . mount -t proc none \/proc/ /g' nv_tegra/nv-apply-debs.sh && \
    sed -i 's/umount ${L4T_ROOTFS_DIR}\/proc/ /g' nv_tegra/nv-apply-debs.sh && \
    sed -i 's/chroot . \//  /g' nv_tegra/nv-apply-debs.sh && \
    ./apply_binaries.sh -r / --target-overlay && cd .. \
    rm -rf Tegra210_Linux_R32.4.4_aarch64.tbz2 && \
    rm -rf Linux_for_Tegra && \
    echo "/usr/lib/aarch64-linux-gnu/tegra" > /etc/ld.so.conf.d/nvidia-tegra.conf && ldconfig

# Install CuDNN
RUN apt-get install -y libcudnn8 nvidia-cudnn8

RUN apt-get update && apt-get install -y apt-utils dialog aufs-tools gcc libc-dev iptables conntrack unzip

# Install docker.
RUN apt-get install -y apt-transport-https ca-certificates curl gnupg lsb-release && \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
    echo "deb [arch=arm64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
        $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null && \
    apt-get update && apt-get install -y docker-ce

# Install Nvidia Container Toolkit
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
RUN apt-get update && apt-get install -y nvidia-docker2
# Enable udevd so that plugged dynamic hardware devices show up in our container.
ENV UDEV=1
ENV INITSYSTEM on

And here is our docker-compose.yml file to support docker inside our service(named sm):

version: '2.1'

volumes:
    docker-data:
services:
  sm:
    build: ./sm
    restart: always
    privileged: true
    network_mode: host
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
    volumes:
      - 'docker-data:/var/lib/docker'
    labels:
      io.balena.features.supervisor-api: '1'
      io.balena.features.balena-api: '1'
      io.balena.features.kernel-modules: '1'
      io.balena.features.firmware: '1'
      io.balena.features.dbus: '1'
    cap_add:
      - SYS_RAWIO
    ports:
      - "80:80"
    devices:
      - "/dev:/dev"
  • We can run any docker container inside this sm service.
  • Everything is installed on sm service correctly as shown below:
root@balena:/usr/app# head -n 1 /etc/nv_tegra_release
# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t210ref, EABI: aarch64, DATE: Fri Oct 16 19:44:43 UTC 2020
root@balena:/usr/app# tegrastats 
RAM 2032/7846MB (lfb 16x4MB) SWAP 1/3923MB (cached 0MB) CPU [6%@498,off,off,4%@498,7%@499,3%@499] EMC_FREQ 0%@204 GR3D_FREQ 0%@114 APE 150 PLL@32C MCPU@32C PMIC@100C Tboard@28C GPU@31.5C BCPU@32C thermal@32.1C Tdiode@29C VDD_SYS_GPU 152/152 VDD_SYS_SOC 381/381 VDD_4V0_WIFI 19/19 VDD_IN 1600/1600 VDD_SYS_CPU 152/152 VDD_SYS_DDR 133/133
RAM 2032/7846MB (lfb 16x4MB) SWAP 1/3923MB (cached 0MB) CPU [1%@499,off,off,0%@500,4%@499,0%@499] EMC_FREQ 0%@204 GR3D_FREQ 0%@114 APE 150 PLL@32C MCPU@32C PMIC@100C Tboard@28C GPU@31C BCPU@32C thermal@32.1C Tdiode@29C VDD_SYS_GPU 152/152 VDD_SYS_SOC 381/381 VDD_4V0_WIFI 19/19 VDD_IN 1562/1581 VDD_SYS_CPU 152/152 VDD_SYS_DDR 114/123
root@balena:/usr/app# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 14
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.9.140-l4t-r32.4
 Operating System: Ubuntu 18.04.5 LTS (containerized)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 7.662GiB
 Name: balena
 ID: C726:TCOB:HCPX:323M:WKGK:5EBR:RVPQ:XDKV:3VMX:IYXI:C4R4:4O25
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

But we cannot use --runtime nvidia parameter to launch the nvidia base images:

root@balena:/usr/app# docker run --runtime nvidia --network host -it nvcr.io/nvidia/l4t-tensorflow:r32.4.4-tf2.3-py3
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.
ERRO[0000] error waiting for container: context canceled 

Any idea to support GPU-based docker containers?

Cheers,
Shane.

But you can run other, non-nvidia containers in that manner? If you try that same container without the --runtime-nvidia does it actually launch the container? (Granted, I understand it might not be accelerated, but just trying to determine if any container can launch)

@dtischler

Yes, all other containers are running without any issue.