Enable GPU on Container

Hi, so, the above will hopefully provide a working solution for amd64 platforms. For non-amd64 like the Nvidia, we are working on adding CUDA support to BalenaOS. I have opened an issue so that we can track progress on this and update this ticket when support is available. https://github.com/balena-os/balena-jetson/issues/109

We are able to run CUDA-accelerated applications and even deepstream 5 based applications in a container on balenaOS 2.56.0+rev1 on Jetson Nano.
It’s a matter of including all required packages and libs in the container, so there are no dependencies on the host. This does result in a container roughly 6 GB large…

2 Likes

Sweet, @robin

Could you share the details of the container configuration?

Hi, @alexgg

I am not sure how to get GPU working on amd64 with docker-compose.yml. Do I have to launch a service(container) manually on hostOS with --gpus all parameter?

Hi there,

You are correct, until the full support is merged you’ll need to create the container manually with that --gpus flag as my colleague mentioned. You can do that from the host OS.

@xginn8

I have just tried -gpus all flag on my AMD64 machine, but it failed:

root@balena:~# cat /etc/os-release 
ID="balena-os"
NAME="balenaOS"
VERSION="2.54.2+rev1"
VERSION_ID="2.54.2+rev1"
PRETTY_NAME="balenaOS 2.54.2+rev1"
MACHINE="surface-go"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.54.2"
RESIN_BOARD_REV="d6bfe9c"
META_RESIN_REV="abdd15e"
SLUG="surface-go"
root@balena:~# balena run --gpus all nvidia/cuda:10.2-cudnn7-devel nvidia-smi
balena: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[2020-10-06T02:09:24.698529715Z] error waiting for container: context canceled 

Do I have to flash another version of balenaOS?

Hi, could you check that the nvidia container runtime is installed?

which nvidia-container-runtime-hook

There are some other troubleshooting pointers at https://collabnix.com/introducing-new-docker-cli-api-support-for-nvidia-gpus-under-docker-engine-19-03-0-beta-release/

That is not installed, of course.

I have just downloaded the latest Surface Go image(development) and flashed.

root@balena:~# which nvidia-container-runtime-hook
root@balena:~# cat /etc/os-release 
ID="balena-os"
NAME="balenaOS"
VERSION="2.54.2+rev1"
VERSION_ID="2.54.2+rev1"
PRETTY_NAME="balenaOS 2.54.2+rev1"
MACHINE="surface-go"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.54.2"
RESIN_BOARD_REV="d6bfe9c"
META_RESIN_REV="abdd15e"
SLUG="surface-go"

Do I have to so something to install this on balenaOS?

@alexgg

This is the output from my Jetson TX2 board:

root@balena:~# whereis nvidia-container-runtime-hook
nvidia-container-runtime-hook: /usr/bin/nvidia-container-runtime-hook
root@balena:~# cat /etc/os-release 
ID="balena-os"
NAME="balenaOS"
VERSION="2.56.0+rev4"
VERSION_ID="2.56.0+rev4"
PRETTY_NAME="balenaOS 2.56.0+rev4"
MACHINE="jetson-tx2"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.56.0"
RESIN_BOARD_REV="b9defdf"
META_RESIN_REV="6c6df5f"
SLUG="jetson-tx2"

So nvidia-container-runtime-hook is installed on it.
But I couldn’t find on my amd64 PC with GPU. Its META_BALENA_VERSION is 2.54.2.
I assume that was because of the lower version?

Could you guys update the balenaOS image of Intel NUC to the latest?

Cheers

Hey Shane,

For your information at balena we have been having our annual summit (get together) - which is why support has been a bit slower than usual. We are, however, trying to get to the bottom of this, but we’re struggling to find someone on the team with an AMD64 machine with a Nvidia GPU we can test on.

What you need to do is get a container with all of the NVidia libraries and dependencies installed (in that container, not the hostOS) and then running it with the --gpus all option.

Phil

@phil-d-wilson

Thanks for your reply, mate.

But as you could see in the above messages, it was working in the last year with nvidia-driver-435.

Can you just build the latest Intel-NUC image and update your balenaOS webpage? It’s 2.54.0, which is fairly old and doesn’t support nvidia-container-runtime-hook?

Cheers,
Shane.

Shane

Can you just build the latest Intel-NUC image and update your balenaOS webpage?

It’s not just that simple, im afraid.

Can you confirm that you’ve tried installing nvidia-container-runtime in the app container, and not the host OS?

Hey, @richbayliss

Yes, I am pretty sure. I have been pulling my hair out with this issue for over a week…

Shane, I realize this has been a long running thread, and we are certainly trying our best to assist, but without having your specific hardware, or seeing your precise troubleshooting process, it’s rather difficult.

However, I have to wonder if a package got accidentally moved, renamed, or dropped upstream…and after it was realized, it was fixed. The reason for this statement, is that I just ran a build using the Dockerfile you posted in your very first post, and it worked just fine:

[Info]     Uploading images
[Success]  Successfully uploaded images
[Info]     Built on x64_01
[Success]  Release successfully created!
[Info]     Release: d830fd64a4a3b82c7c0782535fa25508 (id: 1561507)
[Info]     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
[Info]     β”‚ Service β”‚ Image Size β”‚ Build Time            β”‚
[Info]     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
[Info]     β”‚ main    β”‚ 1.95 GB    β”‚ 2 minutes, 51 seconds β”‚
[Info]     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
[Info]     Build finished in 4 minutes, 22 seconds
                            \
			     \
			      \\
			       \\
			        >\/7
			    _.-(6'  \
			   (=___._/` \
			        )  \ |
			       /   / |
			      /    > /
			     j    < _\
			 _.-' :      ``.
			 \ r=._\        `.
			<`\\_  \         .`-.
			 \ r-7  `-. ._  ' .  `\
			  \`,      `-.`7  7)   )
			   \/         \|  \'  / `-._
			              ||    .'
			               \\  (
			                >\  >
			            ,.-' >.'
			           <.'_.''
			             <'

So, after this long back and forth, I’d actually like to have you simply try your original steps again:

FROM balenalib/%%BALENA_MACHINE_NAME%%-ubuntu-python:3.6-bionic-build

ENV RESINOS_VERSION=2.50.1%2Brev1.prod
ENV DEBIAN_FRONTEND=noninteractive
ENV YOCTO_VERSION=5.2.10

RUN wget https://files.resin.io/images/intel-nuc/${RESINOS_VERSION}/kernel_modules_headers.tar.gz
RUN tar -xf kernel_modules_headers.tar.gz && rm -rf kernel_modules_headers.tar.gz
RUN mkdir -p /lib/modules/${YOCTO_VERSION}-yocto-standard
RUN mv ./kernel_modules_headers /lib/modules/${YOCTO_VERSION}-yocto-standard/build
RUN ln -s /lib64/ld-linux-x86-64.so.2 /lib/ld-linux-x86-64.so.2
RUN apt-get update && apt-get install -y apt-transport-https
RUN apt-get install -y nvidia-driver-435 libboost-all-dev

ENV UDEV=1
ENV INITSYSTEM on

CMD [ "sleep", "infinity"]

Let us know what happens, thanks.

@dtischler
Thanks for your help, mate.

Yes, this is a very long running thread. Did you try to execute nvidia-smi command on the flashed service?

There is no error in the building stage… It just doesn’t detect my GTX970 board via nvidia-smi command.

Cheers,
Shane.

Ah, well, I apologize. As we have both stated, yes this is a long thread. In my review, I noticed on this post that the build was failing, but when mine succeeded, I thought perhaps the issue had resolved itself, ha. Unfortunately, just like the rest of us here, I don’t have an Nvidia GPU either - so I can’t test the output or troubleshoot any further either.

At this point, I’m not sure there’s much more we can do here. Perhaps someone in the community can lend a hand as many folks read / browse threads on a regular basis.

@phil-d-wilson

Can I provide SSH access to my amd64 PC with GTX970?

Please email me and I will discuss with team.

Cheers,
Shane.

Hi Shane, a couple of further questions here. I assume that you have tested this natively on the same OS that you are then using in the container (let’s say Ubuntu for example), and also that you have checked you are using the same kernel driver (that is you load the same kernel driver in the container you already verified works on your native OS), right?

Yes, that is correct, @alexgg

hey @scarlyon,

could you maybe enable support access on your device via balenaCloud and share the device link (or uuid) with me here? you can also send me a pm with it if you’d like…

I would like to take a look directly, that might speed up the process of getting to an answer :slight_smile: