Running CUDA service(s)

I’m trying to run a service based on CUDA with balenaOS (on a Generic x86_64 device), but I’m running into some issues. Based on this blog I see that there is a scheduled feature for allowing “hostapp extensions” for layering additional software, like NVIDIA drivers, on top of the balenaOS host OS. Based on this comment from @alexgg it seems that the balena team is still actively working on this and that there is no release date planned yet.

  1. From my understanding I would need to add hostapp extensions at least for the NVIDIA driver and the NVIDIA Container Toolkit. This should allow me to pass through my GPU devices to my services. I am unsure if I need a CUDA hostapp extension as well, or if it is sufficient to install CUDA within the service itself. Is my understanding of hostapp extensions correct here?

  2. Is this already possible? Even if it is using a beta or experimental version of balenaOS, I would love to give this a shot.

  3. If this is not possible (yet), is there a known workaround to get CUDA services to work (on Generic x86_64 devices) on balenaOS?

  4. With qemu I can passthrough the GPU to the virtual machine using IOMMU, allowing for raw access but prohibiting access to the GPU from the host OS. It would mean limiting to only a single service, since it will “own” that GPU, but it will work for my case. I found it difficult to find resources online for something like this. Is such a thing possible with Docker (and therefore balenaOS)?

Hello @hgaiser, your summary and analysis is correct, the “Hostapp Extensions” are a feature that we are still working on, but are not ready yet. And indeed, they will allow you to extend the OS to include added features, in this case the Nvidia runtime components.

In the meantime however, there is no turnkey way to accomplish the passthrough, that I am aware of.

I have pinged a colleague to have a look, perhaps he has some ideas, but I can’t say that’s a guarantee. :slight_smile:

Hey @dtischler . I’ve been in email contact with a colleague of yours, @joehounsham . He shared with me a method to compile the nvidia kernel module for the Linux kernel running in the host OS. Hopefully that will allow me to load the driver in the container, while I’m waiting for the hostapp extensions to be completed.

I’m not sure if I’m allowed to share the method @joehounsham provided, but I would be happy to :).

Regardless, thank you for your response.

@hgaiser Oh wow, great, glad to hear you have a working solution then! I’ll follow up with Joe and see if we can publish a sample GitHub repo for this. Thanks!