Nvidia Driver doesn't work with GTX970 & GTX1060 on genericx86-64 image

Here is my Dockerfile.template to install NVidia driver inside a container:

FROM balenalib/%%BALENA_MACHINE_NAME%%-ubuntu:latest

ARG RESINOS_VERSION=2.58.6%2Brev1.prod
ARG YOCTO_VERSION=5.2.10
ARG NVIDIA_DRIVER_VERSION=465.31

ENV YOCTO_KERNEL=${YOCTO_VERSION}-yocto-standard
ENV NVIDIA_DRIVER_RUN=NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run
ENV DEBIAN_FRONTEND=noninteractive

# Install Nvidia Driver
RUN apt-get update && apt-get install -y wget gcc build-essential apt-utils dialog aufs-tools libc-dev iptables conntrack unzip libglu1-mesa-dev
RUN wget -nv https://files.balena-staging.com/images/%%BALENA_MACHINE_NAME%%/${RESINOS_VERSION}/kernel_modules_headers.tar.gz && \
    tar -xzvf kernel_modules_headers.tar.gz && \
    mkdir -p /lib/modules/${YOCTO_KERNEL} && \
    cp -r kernel_modules_headers /lib/modules/${YOCTO_KERNEL}/build && \
    ln -s /lib64/ld-linux-x86-64.so.2 /lib/ld-linux-x86-64.so.2 && \
    wget -nv http://us.download.nvidia.com/XFree86/Linux-x86_64/${NVIDIA_DRIVER_VERSION}/${NVIDIA_DRIVER_RUN} && \
    chmod +x ./${NVIDIA_DRIVER_RUN} && \
    mkdir -p /nvidia && \
    mkdir -p /nvidia/driver && \
    ./${NVIDIA_DRIVER_RUN} \
        --kernel-install-path=/nvidia/driver \
        --ui=none \
        --no-drm \
        --no-x-check \
        --install-compat32-libs \
        --no-nouveau-check \
        --no-nvidia-modprobe \
        --no-rpms \
        --no-backup \
        --no-check-for-alternate-installs \
        --no-libglx-indirect \
        --no-install-libglvnd \
        --x-prefix=/tmp/null \
        --x-module-path=/tmp/null \
        --x-library-path=/tmp/null \
        --x-sysconfig-path=/tmp/null \
        --kernel-name=${YOCTO_KERNEL} && \
    rm -rf /tmp/* ${NVIDIA_DRIVER_RUN} kernel_modules_headers.tar.gz
CMD ["bash", "/usr/app/start.sh"]

The content of start.sh:

insmod /nvidia/driver/nvidia.ko || true
insmod /nvidia/driver/nvidia-modeset.ko || true
insmod /nvidia/driver/nvidia-uvm.ko || true

I am using genericx86_64 2.58.6 development image to build locally with sudo balena push <IP> --noparent-check command.

The problem is that this works well on my 1st PC with GTX1660 installed:

root@balena:/usr/app# nvidia-smi
Thu Jun  3 07:58:27 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 25%   43C    P0    20W / 120W |      0MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@balena:/usr/app# 

But sadly this doesn’t work on the other PCs with GTX970/GTX1060 installed:

root@balena:/usr/app# lshw -C display
  *-display                 
       description: VGA compatible controller
       product: GM204 [GeForce GTX 970]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nouveau latency=0
       resources: irq:164 memory:a2000000-a2ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:3000(size=128) memory:c0000-dffff
root@balena:/usr/app# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
root@balena:/usr/app# lshw -C display
  *-display                 
       description: VGA compatible controller
       product: GP106 [GeForce GTX 1060 6GB]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nouveau latency=0
       resources: irq:138 memory:de000000-deffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff
root@balena:/usr/app# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

IMPORTANT NOTE: Intel NUC images are working well on ALL PCs, but genericx86-64 image has this issue…

Any idea?
Cheers!

Another note: The latest genericx86-64 image from staging server(2.73.1) doesn’t work on my 1st PC as well. I think something is wrong with these generic images?

Hi Shane,

It looks like you might be having trouble loading the nvidia module because nouveau is already loaded. Unfortunately, blacklisting modules isn’t supported on balenaOS, but you should be able to unload nouveau before loading the nvidia module.

See this thread for information on how to go about that.

Let me know if this helps.

Yeah, that works like a charm! Thanks a lot! :+1: