Containers won't start on Nvidia Jetson

Hi All,

We’ve had this issue where our containers just aren’t starting on a BalenaOS deployed onto a Jetson Xavier NX emmc 16GB. We have a 8GB (ram) version running fine with also a 16GB emmc onboard. I have the following:

  • updated to new version of Balena Jetson
  • updated to new jetson-flash
  • tried different boards
    We had to build a custom image because our WIFI card is not supported in the official jetson image. This is the persistent error I keep getting. Any help would really be appreciated.
error]         at fn (/usr/src/app/dist/app.js:10:9736)
[error]       at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
[error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code 500) server error - failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write 13325: write /sys/fs/cgroup/cpuset/system.slice/docker-b876bc18176474ce63dd1b1aafe0b8c79881d42d4631ac884f91620cc66d7023.scope/cgroup.procs: no space left on device: unknown  Steps:["start","start","start"]
[error]         at fn (/usr/src/app/dist/app.js:10:9736)
[error]       at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Hello @choln on your logs I see the message

no space left on device: unknown Steps:["start","start","start"]

Could you please confirm if you have available space on the device? Thanks!

Hi mpous,

This is a really mysterious error and it has something to do with crgroups. I haven’t been able to confirm why everything still works fine on our 8GB RAM Jetson Xavier NX emmc. I don’t know if there’s some drastic hardware difference that may be causing the issue. The thing is it has randomly worked before on the 16GB RAM models and then all of a sudden stopped. This is the df -h of the device:

Filesystem                      Size  Used Avail Use% Mounted on
devtmpfs                        7.4G     0  7.4G   0% /dev
tmpfs                           7.8G  6.4M  7.8G   1% /run
/dev/disk/by-state/resin-rootA  677M  408M  219M  66% /mnt/sysroot/active
/dev/disk/by-state/resin-state   18M  412K   16M   3% /mnt/state
overlay                         677M  408M  219M  66% /
/dev/mmcblk0p13                  13G  282M   12G   3% /mnt/data
tmpfs                           7.8G     0  7.8G   0% /dev/shm
tmpfs                           4.0M     0  4.0M   0% /sys/fs/cgroup
tmpfs                           7.8G  8.0K  7.8G   1% /tmp
/dev/mmcblk0p9                  120M   74K  120M   1% /mnt/boot
tmpfs                           7.8G   68K  7.8G   1% /var/volatile
/dev/mmcblk0p11                 677M   24K  627M   1% /mnt/sysroot/inactive

Thanks for the confirmation @choln

On the other hand, could you please share the WiFi card that you are using on this device?

Did you try to connect the device via Ethernet? In that case you don’t see this issue?

I checked on our issues database and give a try to:

Removing the containers and restarting the supervisor should allow the supervisor to recreate the missing network and start the containers again.

Hi mpous, thanks for your prompt response. We have ruled out network issues as we are certain our new images get downloaded just fine. I’ve also tried restarting the supervisor and all the basic troubleshooting and nothing has worked. I’ve also reduced our image sizes to fit on the device by making them 100MB. That also didn’t work. When I get to the office I’ll compare the supervisor version on the 8GB RAM jetson that works as well as look into the cgroup files in /sys/fs as that is where the write fails as shown in the error. Sadly I think we may be the first ones to come across this error on Nvidia Jetson Xavier NX. If we discover a solution I’ll make sure you guys know of it.

Hello @choln feel free to grant support access to your device and share your UUID using a DM to me and I will try to check your device.

Let’s stay connected!

Hi @mpous,

We managed to solve this one. The issue was related to Jetson Xavier NX 16GB emmc (the 16GB RAM variant was the one we had trouble with). The device basically had its 2 other cpus turned off and only 4 was available this was what was causing the issue for us. We created a custom nvpmodel and appended it to /etc/nvpmodel.conf At the end was:

< POWER_MODEL ID=9 NAME=MODE_COMPANY_6_CORE >
CPU_ONLINE CORE_0 1
CPU_ONLINE CORE_1 1
CPU_ONLINE CORE_2 1
CPU_ONLINE CORE_3 1
CPU_ONLINE CORE_4 1
CPU_ONLINE CORE_5 1
TPC_POWER_GATING TPC_PG_MASK 1
GPU_POWER_CONTROL_ENABLE GPU_PWR_CNTL_EN on
CPU_DENVER_0 MIN_FREQ 1190400
CPU_DENVER_0 MAX_FREQ 1907200
CPU_DENVER_1 MIN_FREQ 1190400
CPU_DENVER_1 MAX_FREQ 1907200
CPU_DENVER_2 MIN_FREQ 1190400
CPU_DENVER_2 MAX_FREQ 1907200
GPU MIN_FREQ 0
GPU MAX_FREQ 510000000
GPU_POWER_CONTROL_DISABLE GPU_PWR_CNTL_DIS auto
EMC MAX_FREQ 1600000000
CVNAS MAX_FREQ 460800000
NVDEC MAX_FREQ 665600000
NVDEC1 MAX_FREQ 665600000
NVENC MAX_FREQ 499200000
NVENC1 MAX_FREQ 499200000
NVJPG MAX_FREQ 371200000
SE1 MAX_FREQ 473600000
SE2 MAX_FREQ 473600000
SE3 MAX_FREQ 473600000
SE4 MAX_FREQ 473600000
VDDIN_OC_LIMIT WARN 3000
VDDIN_OC_LIMIT CRIT 5000

# mandatory section to configure the default power mode
< PM_CONFIG DEFAULT=9 >
# optional section to configure the default fan mode
< FAN_CONFIG DEFAULT=quiet >

This enabled us to utilize all six cores and then for whatever reason that happens to solve our problem.

1 Like

Note that our containers were requesting six cpus in total and the system only presented 4 before we did the nvpmodel solution. Our assumption had been that we had all six cores available, but unfortunately the default nvpmodel profile only had 4 enabled by default in BelenaOS.