Moving device to another fleet produces failures in one of the containers

I’m very puzzled by this inconvenience. I have various devices working fine under one fleet (the container runs TensorFlow on the NVIDIA GPU). However, when moving any of these devices to another fleet (with the same architecture amd64), the container keeps respawning complaining about

Traceback (most recent call last):
import tensorflow as tf
File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 37, in <module>
 from tensorflow.python.tools import module_util as _module_util

The release pushed into the devices is from the same exact code (just pushed to the various fleets), but this is not a problem on the one fleet where the devices were deployed on.

Hello, note that for devices running balenaOS version 2.12.0 and above, data in persistent storage (named volumes) is automatically purged when a device is moved to a new fleet. Is it possible this is affecting your container? If not, is there any further text from the error message than you’ve posted here so we can try to troubleshoot further?

Our devices are currently using balenaOS 2.95.12+rev1. The data in the persistent storage is not the case because we are easily recreating whatever data needs to be added to the volume(s).

If I try some of the latest balenaOS releases, like 2.113.15, the issue is a different one (complaining about CUDA kernel modules drivers) even within the same working fleet:

CUDA Error: no CUDA-capable device is detected

Do you think the balena labels added to the docker-compose file have anything to do with this behavior?

If possible, could I schedule a “Private Support” session with a Balena Engineer (I think our Pilot plan supports that)?

Hey there @cjaramillo — yes, you being on the Pilot Plan indeed covers private support. It’s accessible via the “Need Help?” tab on the bottom right-hand side of your dashboard when you’re signed in. When you open the support ticket, please provide a link to this forum thread and provide as much context as possible so we can continue helping.

1 Like

Thanks for your help. So the issue was that for some reason, the other fleet was installing a different version of Numpy in the container. We assumed the installed dependencies were going to be the same on both releases (same code pushed to different fleets), but that wasn’t the case. The solution (for me, at least) was to explicitly add the version number for numpy to get installed in the containers, such as pip3 install numpy==1.23.4