Enable GPU on Container

probably 2.45.0

you are using the same base-image and all as back then on the newer os?

Yeah, intel-nuc

I mean the distro part :slight_smile:

Nothing changed, mate.

So, the nvidia-smi command was working after RUN apt-get install -y nvidia-driver-435 with the latest balenaOS version in November 2019, but is no longer working with balenaOS v2.50.1+rev1. And the hardware last year was an Intel NUC, right? For the Intel NUC, these balenaOS versions would be candidates:

  • v2.45.0+rev4 - released on 19 Nov 2019
  • v2.45.0+rev3 - released on 13 Nov 2019
  • v2.45.0+rev2 - released on 6 Nov 2019
  • v2.45.0+rev1 - released on 1 Nov 2019
  • v2.44.0+rev1 - released 3 Oct 2019

To be sure that nothing else has changed, you could try flashing one of these releases and confirm that the driver works. Also, could the Intel NUC hardware have changed? The NUC has had 10 generations, each with changes to the GPU chipset: https://en.wikipedia.org/wiki/Next_Unit_of_Computing

The hardware device is not changed.

At risk of suggesting something you have already tried, I’ve noticed that the following command does not include the --privileged flag:

root@balena:~# balena run --gpus all nvidia/cuda:10.2-cudnn7-devel nvidia-smi
balena: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[2020-10-06T02:09:24.698529715Z] error waiting for container: context canceled 

Does it make a difference if you add --privileged?

@pdcastro

Hmm, I didn’t use --privileged at that time. Will do in the next testing.

By the way, where can I download the old version? 2.45.0+rev1, for instance?

Cheers,
Shane.

Any idea how to download the images of such versions?

Nevermind, my balena-cli was logged in to my openbalena server! :joy:

Let us know what happens after you added the --privelleged flag.

I was using v2.44.0+rev1 in the last year. But I cannot download now?

anydeskpc@anydeskpc:~balena os versions intel-nuc
v2.50.1+rev1.prod (recommended)
v2.50.1+rev1.dev
v2.48.0+rev3.prod
v2.48.0+rev3.dev
v2.47.1+rev1.prod
v2.47.1+rev1.dev
v2.47.0+rev1.prod
v2.47.0+rev1.dev
v2.46.0+rev1.prod
v2.46.0+rev1.dev
v2.45.1+rev2.prod
v2.45.1+rev2.dev
v2.41.1+rev1.prod
v2.41.1+rev1.dev
v2.38.3+rev5.prod
v2.38.3+rev5.dev
... ... ...

Hi

The reason you are seeing this is that we had to pull the 2.44 version of the OS for all device types as we found a race condition in the boot process that would sometimes not allow the supervisor to start up correctly.
See a similar forum issue here - Can't update from 2.41.1 to 2.44.1
2.45.1 should work just like 2.44 for you, without the bug

Ok, v2.45.1+rev2 is not working…

root@balena:/# printenv | grep BALENA
BALENA_SUPERVISOR_VERSION=10.3.7
BALENA_DEVICE_TYPE=intel-nuc
BALENA=1
BALENA_SUPERVISOR_HOST=127.0.0.1
BALENA_SERVICE_NAME=od
BALENA_APP_NAME=localapp
BALENA_APP_ID=1
BALENA_DEVICE_UUID=e01f87419ce9c55149a945473533f5fe
BALENA_SERVICE_HANDOVER_COMPLETE_PATH=/tmp/balena/handover-complete
BALENA_SUPERVISOR_ADDRESS=http://127.0.0.1:48484
BALENA_HOST_OS_VERSION=balenaOS 2.45.1+rev2
BALENA_SUPERVISOR_PORT=48484
BALENA_DEVICE_NAME_AT_INIT=local
BALENA_SUPERVISOR_API_KEY=98a0e59477fa38423febee29e23b8fac8b6d0ef8bbfe48d37d7924b1be3fda
BALENA_APP_LOCK_PATH=/tmp/balena/updates.lock
root@balena:/# nvidia-smi i
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

What should I do? @pdcastro @robertgzr

Hi, just re-reading this thread. It seems the problem boils down to the nvidia driver not being present. Could you please let us know how you had installed the driver in the past?

What might be happening is that you installed it from a pre-built distribution like Debian that happened to be compatible with the kernel BalenaOS was running. That was just a coincidence as the BalenaOS kernel releases are not in sync with any distribution.

There are several alternatives:

  1. If you have access to the kernel driver source you can build it for a specific BalenaOS version and install it from your application container. You can use the following example project to achieve this https://github.com/balena-os/kernel-module-build

  2. We are working on a mechanism that will allow us to release drivers and system libraries with the OS in a more dynamic way, and once it’s ready we will release an OS extension with compatible drivers. I have created an issue so track this work. https://github.com/balena-os/balena-intel/issues/334

1 Like

Hi, @alexgg

Sorry for the delay.

Yeah, just installed balenaOS 2.44+rev2(intel-nuc), which is not available at the moment.
And used the above Dockerfile.template to install nvidia driver and CUDA/CUDDN on it.
nvidia-smi worked well and we were able to use the GPU without any issue.

Regarding the alternatives, I don’t think it’s worth building the custom kernel.
When would you be able to release the new OS extension with GPU support? Any rough estimation?

Cheers,
Shane.

Hi, @alexgg
Any update here?

I can see some signs of progress in the meta-balena project?

Cheers,
Shane.

Hi Shane, as you say the OS changes have been merged. We are now working on complementary changes on other parts of the product, like the supervisor and our build infrastructure. Once it’s ready we will do a proper write-up along with instructions and documentation.

2 Likes

Hi @alexgg,

Is there a timeline for this?

Regards,