Hi, so, the above will hopefully provide a working solution for amd64 platforms. For non-amd64 like the Nvidia, we are working on adding CUDA support to BalenaOS. I have opened an issue so that we can track progress on this and update this ticket when support is available. https://github.com/balena-os/balena-jetson/issues/109
We are able to run CUDA-accelerated applications and even deepstream 5 based applications in a container on balenaOS 2.56.0+rev1 on Jetson Nano.
Itβs a matter of including all required packages and libs in the container, so there are no dependencies on the host. This does result in a container roughly 6 GB largeβ¦
Sweet, @robin
Could you share the details of the container configuration?
Hi, @alexgg
I am not sure how to get GPU working on amd64 with docker-compose.yml
. Do I have to launch a service(container) manually on hostOS with --gpus all
parameter?
Hi there,
You are correct, until the full support is merged youβll need to create the container manually with that --gpus
flag as my colleague mentioned. You can do that from the host OS.
I have just tried -gpus all
flag on my AMD64 machine, but it failed:
root@balena:~# cat /etc/os-release
ID="balena-os"
NAME="balenaOS"
VERSION="2.54.2+rev1"
VERSION_ID="2.54.2+rev1"
PRETTY_NAME="balenaOS 2.54.2+rev1"
MACHINE="surface-go"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.54.2"
RESIN_BOARD_REV="d6bfe9c"
META_RESIN_REV="abdd15e"
SLUG="surface-go"
root@balena:~# balena run --gpus all nvidia/cuda:10.2-cudnn7-devel nvidia-smi
balena: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[2020-10-06T02:09:24.698529715Z] error waiting for container: context canceled
Do I have to flash another version of balenaOS?
Hi, could you check that the nvidia container runtime is installed?
which nvidia-container-runtime-hook
There are some other troubleshooting pointers at https://collabnix.com/introducing-new-docker-cli-api-support-for-nvidia-gpus-under-docker-engine-19-03-0-beta-release/
That is not installed, of course.
I have just downloaded the latest Surface Go
image(development) and flashed.
root@balena:~# which nvidia-container-runtime-hook
root@balena:~# cat /etc/os-release
ID="balena-os"
NAME="balenaOS"
VERSION="2.54.2+rev1"
VERSION_ID="2.54.2+rev1"
PRETTY_NAME="balenaOS 2.54.2+rev1"
MACHINE="surface-go"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.54.2"
RESIN_BOARD_REV="d6bfe9c"
META_RESIN_REV="abdd15e"
SLUG="surface-go"
Do I have to so something to install this on balenaOS?
This is the output from my Jetson TX2 board:
root@balena:~# whereis nvidia-container-runtime-hook
nvidia-container-runtime-hook: /usr/bin/nvidia-container-runtime-hook
root@balena:~# cat /etc/os-release
ID="balena-os"
NAME="balenaOS"
VERSION="2.56.0+rev4"
VERSION_ID="2.56.0+rev4"
PRETTY_NAME="balenaOS 2.56.0+rev4"
MACHINE="jetson-tx2"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.56.0"
RESIN_BOARD_REV="b9defdf"
META_RESIN_REV="6c6df5f"
SLUG="jetson-tx2"
So nvidia-container-runtime-hook
is installed on it.
But I couldnβt find on my amd64 PC with GPU. Its META_BALENA_VERSION
is 2.54.2.
I assume that was because of the lower version?
Could you guys update the balenaOS image of Intel NUC to the latest?
Cheers
Hey Shane,
For your information at balena we have been having our annual summit (get together) - which is why support has been a bit slower than usual. We are, however, trying to get to the bottom of this, but weβre struggling to find someone on the team with an AMD64 machine with a Nvidia GPU we can test on.
What you need to do is get a container with all of the NVidia libraries and dependencies installed (in that container, not the hostOS) and then running it with the --gpus all
option.
Phil
Thanks for your reply, mate.
But as you could see in the above messages, it was working in the last year with nvidia-driver-435
.
Can you just build the latest Intel-NUC image and update your balenaOS webpage? Itβs 2.54.0, which is fairly old and doesnβt support nvidia-container-runtime-hook
?
Cheers,
Shane.
Shane
Can you just build the latest Intel-NUC image and update your balenaOS webpage?
Itβs not just that simple, im afraid.
Can you confirm that youβve tried installing nvidia-container-runtime
in the app container, and not the host OS?
Hey, @richbayliss
Yes, I am pretty sure. I have been pulling my hair out with this issue for over a weekβ¦
Shane, I realize this has been a long running thread, and we are certainly trying our best to assist, but without having your specific hardware, or seeing your precise troubleshooting process, itβs rather difficult.
However, I have to wonder if a package got accidentally moved, renamed, or dropped upstreamβ¦and after it was realized, it was fixed. The reason for this statement, is that I just ran a build using the Dockerfile you posted in your very first post, and it worked just fine:
[Info] Uploading images
[Success] Successfully uploaded images
[Info] Built on x64_01
[Success] Release successfully created!
[Info] Release: d830fd64a4a3b82c7c0782535fa25508 (id: 1561507)
[Info] βββββββββββ¬βββββββββββββ¬ββββββββββββββββββββββββ
[Info] β Service β Image Size β Build Time β
[Info] βββββββββββΌβββββββββββββΌββββββββββββββββββββββββ€
[Info] β main β 1.95 GB β 2 minutes, 51 seconds β
[Info] βββββββββββ΄βββββββββββββ΄ββββββββββββββββββββββββ
[Info] Build finished in 4 minutes, 22 seconds
\
\
\\
\\
>\/7
_.-(6' \
(=___._/` \
) \ |
/ / |
/ > /
j < _\
_.-' : ``.
\ r=._\ `.
<`\\_ \ .`-.
\ r-7 `-. ._ ' . `\
\`, `-.`7 7) )
\/ \| \' / `-._
|| .'
\\ (
>\ >
,.-' >.'
<.'_.''
<'
So, after this long back and forth, Iβd actually like to have you simply try your original steps again:
FROM balenalib/%%BALENA_MACHINE_NAME%%-ubuntu-python:3.6-bionic-build
ENV RESINOS_VERSION=2.50.1%2Brev1.prod
ENV DEBIAN_FRONTEND=noninteractive
ENV YOCTO_VERSION=5.2.10
RUN wget https://files.resin.io/images/intel-nuc/${RESINOS_VERSION}/kernel_modules_headers.tar.gz
RUN tar -xf kernel_modules_headers.tar.gz && rm -rf kernel_modules_headers.tar.gz
RUN mkdir -p /lib/modules/${YOCTO_VERSION}-yocto-standard
RUN mv ./kernel_modules_headers /lib/modules/${YOCTO_VERSION}-yocto-standard/build
RUN ln -s /lib64/ld-linux-x86-64.so.2 /lib/ld-linux-x86-64.so.2
RUN apt-get update && apt-get install -y apt-transport-https
RUN apt-get install -y nvidia-driver-435 libboost-all-dev
ENV UDEV=1
ENV INITSYSTEM on
CMD [ "sleep", "infinity"]
Let us know what happens, thanks.
@dtischler
Thanks for your help, mate.
Yes, this is a very long running thread. Did you try to execute nvidia-smi
command on the flashed service?
There is no error in the building stageβ¦ It just doesnβt detect my GTX970 board via nvidia-smi
command.
Cheers,
Shane.
Ah, well, I apologize. As we have both stated, yes this is a long thread. In my review, I noticed on this post that the build was failing, but when mine succeeded, I thought perhaps the issue had resolved itself, ha. Unfortunately, just like the rest of us here, I donβt have an Nvidia GPU either - so I canβt test the output or troubleshoot any further either.
At this point, Iβm not sure thereβs much more we can do here. Perhaps someone in the community can lend a hand as many folks read / browse threads on a regular basis.
Can I provide SSH access to my amd64 PC with GTX970?
Please email me and I will discuss with team.
Cheers,
Shane.
Hi Shane, a couple of further questions here. I assume that you have tested this natively on the same OS that you are then using in the container (letβs say Ubuntu for example), and also that you have checked you are using the same kernel driver (that is you load the same kernel driver in the container you already verified works on your native OS), right?
Yes, that is correct, @alexgg
hey @scarlyon,
could you maybe enable support access on your device via balenaCloud and share the device link (or uuid) with me here? you can also send me a pm with it if youβd likeβ¦
I would like to take a look directly, that might speed up the process of getting to an answer