Supervisor restarting continuously

We can connect to the device now. Thanks! I cleaned up the handling sshd services, and you should be able to connect to the device after a reboot now. The issue was already addressed in balenaOS in the following PR: https://github.com/balena-os/meta-balena/pull/1838, and upgrading should make sure it doesn’t happen again

Thank you!

I am already on the most recent Balena version available for my TX2. When will that patch trickle down to me?

Hi there – I have created an issue at https://github.com/balena-os/meta-balena/issues/1935 to track your request for the SSH fix to be ported to the CTI Spacely TX2; feel free to subscribe to that issue, but we will also track it ourselves and contact you when a fix has been released.

If I understand correctly, you were also encountering problems accessing the CUDA cores from your application. Can you confirm if that is persisting? If so, are you able to share your docker-compose.yml file?

All the best,
Hugh

Thank you, Hugh.

Not sure if the problem is resolved. I tried to test, but can’t get the container to start up long enough to test it.

Here is the docker-compose.yaml I’m using

version: '2'
volumes:
    myVol:
services:
    my_app:
        hostname: my_app
        build:
            context: ./bin
            dockerfile: Dockerfile
        ports:
          - "12345:22"
        volumes:
            - myVol:/app/data/run:rw
        environment:
            NVIDIA_VISIBLE_DEVICES: 'all'
        devices:
          - "/dev/USB0:/dev/USB0"
        restart: "no"
        labels:
          io.balena.features.dbus: '1'
        command: bash
        privileged: true

Hi @cnr – thanks for that. When you say that your container won’t stay up long enough to test it, is that because it’s crashing with the CUDA Error: all CUDA-capable devices are busy or unavailable error you noted above. Assuming for the moment I’ve got that right, there are a couple approaches you could take to getting a shell in this container.

One would be to change the CMD or ENTRYPOINT line in your Dockerfile to be something like sleep infinity; that would leave the container idle, and you would be able to connect to it from the dashboard or through the CLI to debug.

Another approach would be to leave the Dockerfile untouched, and in the host OS run a command like balena run -it my_app /bin/sh; that would also give you a shell to explore in.

Either way, I’d suggest checking out a couple things to rule out simpler problems:

  • Are there utilities included in your Docker container that can verify the presence of the GPU units you wish to use? Are they successful?
  • Are there particular devices that your application is trying to access? Are they present, and does your Dockerfile grant the application enough privileges to access those devices?
  • Are all the kernel modules loaded that are required for your hardware? Does dmesg or the like show any problems?

Additionally, can you post your Dockerfile? It would help to see how you’re setting up your application.

All the best,
Hugh

1 Like

I will try those things you suggest and get back to you.

I’m a bit concerned about posting the dockerfile publicly, is there a way to share it privately with you and your team?

Hi there, you can DM me here in the forums and send me a message with the file attached. Thanks!

Ok sent!

Update, nothing I do seems to enable the container to successfully start up. Even when I clean up the abandoned ssh connection attempts that you linked to before.

image

So, testing this other issue is blocked by the issue for the SSH fix, or I need to downgrade to a prior BalenaOS release that did work.

Hi there – thanks for sharing the Dockerfile with us, as this give us a bit more context for the problems you’re having.

As we asked before, can you let us know why your application continues to crash? Is it because it’s crashing with the CUDA Error: all CUDA-capable devices are busy or unavailable error you noted above, or some other reason?

Have you tried changing the CMD or ENTRYPOINT in your Dockerfile to be something like sleep infinity? That would allow you to spin up a container that would sit idle while you connect to it to debug things manually.

Are there utilities included in your Docker container that can verify the presence of the GPU units you wish to use?

Are you able to run your application within your container as root? If that works, you may need to investigate what permissions may be missing for your unprivileged user.

Please give these steps a try and let us know how it goes.

All the best,
Hugh

Hugh, as I said in my previous message, the container does not run because of the SSH issue.

So, testing this other issue is blocked by the issue for the SSH fix, or I need to downgrade to a prior BalenaOS release that did work.

I cannot test anything you have recommended until I can get my container to run. It is constantly in a Stopping state due to the issue that you linked me to. Updating the Dockerfile ENTRYPOINT to sleep infinity or bash has no effect.

In other words, none of the tools that I might try to use to “verify the presence of the GPU units” can be run inside the container if the container cannot run.

I remind you that this is the exact same Dockerfile as I’m successfully running on my other devices (ok, with the exception that this Dockerfile one uses an ffmpeg3 build from ppa:cran/ffmpeg-3 instead of ppa:jonathanf/ffmpeg-3 because the latter is no longer available, but that shouldn’t cause me to be unable to access CUDA cores). The only thing that has changed is the version of BalenaOS I am running on.

It goes without saying that upgrading to a new OS version should not break my containers. This is at least part of the reason why we use containers in the first place: to improve the separation between application and OS.

Hi there, I’ve had another look at the device and I was able to SSH into it. The supervisor however, was stuck in an update loop, so I restarted balena-engine service and the supervisor was able to synchronise device state and appears to be running normally now, although I can’t see any logs from the application container…

How do I detect and stop and update loop on my own?

systemctl restart balena-engine.service?

Yes, that was the command I ran. I was also just reading through this ticket and it appears the problem can be reproduced by simply restarting the application container from the dashboard?

For a quick test, I restarted the app from the dashboard and it appears to have stopped/started correctly…

Can you please check and let us know if you can replicate the issue again after today’s service restart.

I can confirm that I’ve gotten it back into a failed state by:

  1. sshing into the application container
  2. Trying to run a CUDA related command (in this case ./dgTest.sh) which hangs
  3. Restarting the container

I am going to try to restart the balena-engine service to see if I can replicate the fix for the hang that you found.

EDIT: Yes, systemctl restart balena-engine.service has enabled the Stopping container to successfully to become `Exited, however now the container won’t start again.

Hi!

Thanks for the information!

I’ve logged into the device and saw that the service wasn’t able to start because there was a container for the same service incorrectly stopped, so I removed it (got the container id with balena ps -a and removed it with balena rm <id>) and then I could start it again.

I was able to reproduce your problem following the steps you indicated and the ./dgTest.sh script actually hung when Converting output. I checked the process list (running ps) and saw several darknetprocess blocked on uninterrumpible IO operation which might be the reason why the balena-engine couldn’t neither stop or kill the container. May we know that is the process doing? Is it accessing the GPU? We’d need to know what the process is doing to debug further.

As I understand, you didn’t face this problems with 2.45.1+rev1. Have you tried that version on this device? I see some voltage errors in the dmesg output which according to the NVIDIA engineers in their forums, might be caused by faulty hardware, just want to discard this possibility. Also, it would help us if we could apply this temporary fix for the ssh dangling problem, do you agree with us applying this fix?

Thank you again for your help so far.

Restoring a Functioning Container

This is working for me. It’s just very slow. The steps I took were:

balena stop [CONTAINER ID]
sudo systemctl restart balena-engine.service
balena ps -a # to find the Stopping service
balena rm [CONTAINER ID] # to successfully remove the stopping service

However, for several minutes after this I cannot restart the container. Then, mysteriously it will start working! Good enough.

For some strange reason, my container image is very small. I run the command

root@abc:~> balena images
REPOSITORY                  TAG                 IMAGE ID            CREATED             SIZE
<none>                      <none>              c2e453157643        22 hours ago        18B
balena-healthcheck-image    latest              a29f45ccde2a        6 months ago        9.14kB
balena/aarch64-supervisor   v10.6.27            634e52c7fa89        6 months ago        67MB

The image ID for the container I was using shows that it is only 18B in size.

I would like to know why is the container so small (binary diff?) and why does it take so long to spin up?


Testing CUDA

The dgTest.sh script is actually even more than we need to put the system into a failed state. Try running merely /app/darknetMinimal/darknet -h that should be enough to put it in a failed state.

The darknet command is trying to connect to run darknet, which is making calls to CUDA.

More debugging:

Following the CUDA Installation Guide, 2.1. Verify You Have a CUDA-Capable GPU I cannot detect an NVIDIA device with lspci

sudo apt-get -y install pciutils
sudo lspci -v

returns nothing. Should I expect it to return something when run inside the container?

Using this Verify CUDA Installation guide, I find that checking the version fails

> cat /proc/driver/nvidia/version 
cat: /proc/driver/nvidia/version: No such file or directory

but that is also true for my functioning devices, so this is not the issue

> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Sun_Nov_19_03:16:56_CST_2017
Cuda compilation tools, release 9.0, V9.0.252

PASS (same as other devices)

Testing CUDA Samples

git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/
git checkout tags/v9.2
make TARGET_ARCH=aarch64

Waited until “Finished building CUDA samples” message. Then ran
./bin/aarch64/linux/release/deviceQuery

This just hangs.

Unfortunately, I cannot test this on my other devices because they are connected via cellular which means I can’t download the git repo to test these CUDA samples.


Voltage Problems

This answer on the NVIDIA Forums indicates regarding

[337879.650949] vdd-1v8: voltage operation not allowed
[337879.655940] sdhci-tegra 3440000.sdhci: could not set regulator OCR (-1)

that

No, this error is not fatal. It comes from the sdhci device tree in which we fix the vdd-1v8 to 1v8 only which exactly could be 1.65 ~1.95v

and is

just warning message.

Downgrading

I seem to remember that I had other problems with that version, but I will try to downgrade next week and will report my findings.

Temporary Fix

I tried the fix myself

I’m no longer having that hanging ssh problem as evidenced by:

> systemctl list-units --state=activating
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.