I am having a generic x86_64 new OS type and as it has both nouveau drivers and as my container needs nvidia drivers, I am experiencing few problems. In some of the GPUs like GeForce GTX 16 series, the nouveau drivers are not loaded when kernel boots up. But, in some desktops like hp omen where I have GeForce rtx 2070, the nouveau drivers are not loaded and runs my container with nvidia drivers installed. So, inorder for my container to work on HP Omens, I was trying to remount the Host OS with read and write permissions and blacklist those nouveau drivers to not load at boot time. since I was unable to blacklist from my services however much I tried to write blacklist conf files, it was not working. So, now the nvidia drivers were getting loaded and I was able to run my application. But, I have another problem. All the nvidia-uvm, nvidia, nvidia-dkms modules are getting loaded, except nvidia-drm. I am having a problem with inserting that module using insmod and it says unknown symbol. I want to know how if what I did break something? and I want to know if it is correct, then is there a clean way to handle it. Thanks!
I understand you are using the Generic x86_64 device type and have included the proprietary nvidia drivers in your application container. As you say, the Generic x86_64 device type provides a wide selection of modules that are available to your multi container application when you use the io.balena.features.kernel-modules label in your docker compose file (see https://www.balena.io/docs/learn/develop/multicontainer/#labels)
Once all the modules are accessible from the application containers, there is no need to modify the hostOS in any way. Doing so is not recommended as there is no way to replicating those changes automatically across a fleet of devices and the changes will be lost when you perform a hostOS update of the device.
You should be able to manage the modules from your application container as you would from a normal Linux system. For example, if you want to blacklist a module, you should add it to the modprobe.d/blacklist.conf file.
Back to your message, as you say the hostOS includes the noveua drivers, and they may be loaded automatically if the kernel has the need for them, for example to display a splash screen. You can try to explicitly run rmmod in your application container before using modprobe to load the nvidia drivers.
About your last comment, the unknown symbols, this usually means that there are dependency drivers that need to be loaded beforehand which are needed by the module you are trying to load. To work around this, I suggest you use modprobe to load modules instead of insmod.
Thank you so much for getting back to me. I tried having rmmod and I also tried to blacklist the drivers from my service but it was not making any changes to disable nouveau drivers that is the reason why I got into hostOS and made the change. I know that it is not advisable and is very janky way of doing it. Well, I will try testing and see if I can make changes from my services. Thank you for making time to answer my questions.
I’m having a similar issue to the one described in this thread and I’m hoping someone can help. I have an Generic x86_64 application with devices that have an nvidia gpu in them. I ultimately want to be able to run nvidia drivers on the device, but am getting stuck before I can even try to load the nvidia drivers. The device seems to automatically load the nouveau drivers and not allow me to reliably unload them.
For my setup, I’m running balenaOS 2.58.6+rev1. I’ve started with a super simple dockerfile and docker compose to just try and remove the nouveau drivers. I have the io.balena.features.kernel-modules label set to ‘1’ as suggested by @alexgg . The only thing going on in the Dockerfile.template is to blacklist the nouveau drivers and then run a spin.sh script that echos some text, followed by sleep. The blacklist lines were obtained from https://linuxconfig.org/how-to-disable-nouveau-nvidia-driver-on-ubuntu-18-04-bionic-beaver-linux.
When the machine starts up, I enter the main service and try to remove the nouveau drivers (eventually I’d want to move this to the dockerfile.template instead of manually running it). However, I get an error that says the nouveau is in use.
root@a09bf0e:/# rmmod nouveau
rmmod: ERROR: Module nouveau is in use
I’ve tried running with the -f force option - no luck. The only way I’ve been able to successfully rmmod nouveau is if there was no monitor plugged into the balena device on startup. That leads me to believe there is some process that automatically starts up in the host OS or base image that is forcing the nouveau drivers to load and be used (perhaps it’s the balena splash screen?).
So my questions are the following:
Why are the nouveau drivers still being loaded despite being blacklisted?
Is there a process in the host OS or base balena image that I can try to kill, so the nouveau driver is not in use when I startup with a monitor plugged in?
I confirmed the suggested dbus command to stop plymouthd turned off the splash display. Unfortunately, with the display off and just a plain terminal showing on the screen, I’m still getting an error that nouveau is still in use.
root@a09bf0e:/# rmmod nouveau
rmmod: ERROR: Module nouveau is in use
Hi @jgordon, what does spin.sh run? Is it possible for that script to be using the nouveau driver?
It’d be great if you could share a minimal reproduction example here so we could see what the code is doing and reproduce this issue on our end in case we need to.
Another path that might be interesting to explore is typing rmmod nouveauin the dockerfile after the blacklisting part.
I also pinged our devices team internally to see if they have other suggestions.
Hi, the generic x86_64 devices includes the mainstream Noveau driver for the nvidia GPU. When the hostOS starts and plymouth starts, this module will be loaded in order to display the splash screen. This hostOS driver cannot be blacklisted in a container, as by the time the container starts the module is already loaded. It can also not be removed from a container while it’s in use.
To understand what can be done I suggest you focus on the hostOS for a bit. Let’s try the following:
Move to device to an empty application so no container app is running
Reboot the device
SSH into the hostOS
Run lsmod to see the noveau driver loaded
Stop plymouthd with systemctl start plymouth-quit
Try to rmmod noveau
If that works, then the same procedure can be run from your app’s startup code. If it does not work, we will need to reproduce in-house to see what else in the hostOS can be needing the graphics driver.
@gelbal, the spin.sh script was just echo-ing some text and sleeping while true; do echo "spinning"; sleep 5; done.
@alexgg that makes sense why the blacklist of nouveau will not work.
I made some progress that I’d like to share in case someone else encounters the issue with failing to remove the nouveau drivers. The short description is that in addition to stopping plymouth for the splash display, I had to also disable the virtual console. I believe that is the only other item that was utilizing the nouveau drivers and preventing their removal. I included the relevant docker-compose, dockerfile and spin script for reference below.
FROM balenalib/genericx86-64-ext-ubuntu:bionic
RUN install_packages dbus
ENV UDEV=1
COPY spin.sh /
CMD ["/spin.sh"]
spin.sh
#!/bin/bash
## Disable all display elements that could prevent removal of nouveau module
# Stop plymouth service used for splash screen display
DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket dbus-send \
--system --dest=org.freedesktop.systemd1 --type=method_call --print-reply \
/org/freedesktop/systemd1 org.freedesktop.systemd1.Manager.StartUnit \
string:"plymouth-quit.service" string:"replace"
# Disable virtual console
echo 0 > /sys/class/vtconsole/vtcon0/bind
# Wait some time for plymouth and virtual console to stop
echo "Waiting to stop display services.."
sleep 2
# Remove the nouveau modules
rmmod nouveau
# TODO: Insert nvidia drivers and run application...
while true; do
echo "spinning..."
sleep 5
done
The above example solves my nouveau removal problem.