Nvidia hardware on generic-amd64

Hello,

We just found out about BalenaOS and we wanted to test it on our client devices (coming from NixOS). Our hardware devices are :
CPU : Intel Core i7-14700K
GPU : NVIDIA GeForce RTX 3080 Ti

I saw this first blog post about working with hardware drivers but it seems to be out of date since BalenaOS 3.0. I then found this second blog post about building out of tree linux kernel modules and tried to make it work with alexgg/nvidia branch.

I only changed in the docker-compose.yml file the OS_VERSION to current one (5.3.27+rev1) and after a minor issue (missing bc package), I ran into a more complexe one :

[Build]   ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
[Build]   Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
[Build]   
[Build]   [load] 
[Build]   ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

From there, we tried some things :

  • changing the nvidia drivers versions
  • connecting via SSH to run rmmod nvidiafb && rmmod nouveau , and only then performing the balena push
  • checking GCC versions (running gcc --version in the dockerfile and cat /proc/version on the OS output 11.4.0)
  • disabling secure boot

Adding --skip-module-load to ./${nvidia_installer} --silent --kernel-source-path "${headers_dir}" allows us to keep going through the installer but it end with this error loop :

[Logs]    [2024-07-09T15:45:28.722Z] Restarting service 'load sha256:c98fdcc73543b58a2e05fce6efa9dcfe435af71cf32b66c9b77807ca0d388fa8'
[Logs]    [2024-07-09T15:45:28.688Z] [load] OS Version is 5.3.27+rev1
[Logs]    [2024-07-09T15:45:28.688Z] [load] Loading module from /opt/lib/modules/5.3.27+rev1/nvidia-drm.ko
[Logs]    [2024-07-09T15:45:28.705Z] [load] insmod: can't insert '/opt/lib/modules/5.3.27+rev1/nvidia-drm.ko': unknown symbol in module, or unknown parameter
[Logs]    [2024-07-09T15:45:28.707Z] [load] Loading module from /opt/lib/modules/5.3.27+rev1/nvidia-modeset.ko
[Logs]    [2024-07-09T15:45:28.729Z] [load] insmod: can't insert '/opt/lib/modules/5.3.27+rev1/nvidia-modeset.ko': unknown symbol in module, or unknown parameter
[Logs]    [2024-07-09T15:45:28.733Z] [load] Loading module from /opt/lib/modules/5.3.27+rev1/nvidia-peermem.ko
[Logs]    [2024-07-09T15:45:28.773Z] [load] insmod: can't insert '/opt/lib/modules/5.3.27+rev1/nvidia-peermem.ko': Invalid argument
[Logs]    [2024-07-09T15:45:28.777Z] [load] Loading module from /opt/lib/modules/5.3.27+rev1/nvidia-uvm.ko
[Logs]    [2024-07-09T15:45:28.800Z] [load] insmod: can't insert '/opt/lib/modules/5.3.27+rev1/nvidia-uvm.ko': unknown symbol in module, or unknown parameter
[Logs]    [2024-07-09T15:45:28.804Z] [load] Loading module from /opt/lib/modules/5.3.27+rev1/nvidia.ko
[Logs]    [2024-07-09T15:45:29.047Z] Service exited 'load sha256:c98fdcc73543b58a2e05fce6efa9dcfe435af71cf32b66c9b77807ca0d388fa8'
[Logs]    [2024-07-09T15:45:28.962Z] [load] insmod: can't insert '/opt/lib/modules/5.3.27+rev1/nvidia.ko': No such device

Could you kindly help us troubleshoot the issue?

Thanks in advance!

1 Like

Hello @till first of all welcome to the balena community.

Could you please share a docker-compose where we can try to reproduce your issue?

Thanks!