Nvidia Jetson Orin NX 16GB in Xavier NX Devkit NVME go offline on its own!

Device Information:

  • Model: Nvidia Jetson Orin NX 16GB
  • Setup: Xavier NX Devkit with NVME
  • HOST OS: balenaOS
  • Version: 3.2.5
  • Supervisor Version: 14.12.0
  • Application Running: Lumeo v1.16.26

Issue Description: After successfully starting and booting the OS, the device seems to operate as expected. It appears online on balenaOS cloud dashborad, and I can smoothly access both the host terminal and the Lumeo terminal. However, after roughly 4 hours of consistent operation, the device unexpectedly goes offline. Despite this, the device remains powered on, and all diagnostics indicate that it is functioning normally.

Temporary Solution Attempted: To temporarily resolve this issue and bring the device back online, I’ve tried unplugging the power cable and then plugging it back in. This method seems to work, but it’s not a long-term solution.

Additional Information: No significant changes or modifications were made to the device or software before this issue started occurring. The environmental factors, like temperature and humidity, are within standard operating ranges.

I am including device diagnostics output in case it helps!
0f868c45c0e3d0868b83fb96f5de8c26_diagnostics_2023.09.21_00.03.40+0000.txt (554.4 KB)

Also adding supervisor state:
{
“api_port”: 48484,
“ip_address”: “ADDRESS”,
“os_version”: “balenaOS 3.2.5+rev2”,
“mac_address”: “ADDRESS”,
“supervisor_version”: “14.12.0”,
“update_pending”: false,
“update_failed”: false,
“update_downloaded”: false,
“commit”: “a16e2c00bfdd3f1c0fd4c9dc7a5ef30f”,
“status”: “Idle”,
“download_progress”: null
}

Finally device health checks:

{“diagnose_version”:“4.22.13”,“checks”:[{“name”:“check_balenaOS”,“success”:true,“status”:“Supported balenaOS 2.x detected”},{“name”:“check_container_engine”,“success”:true,“status”:“No container_engine issues detected”},{“name”:“check_localdisk”,“success”:true,“status”:“No localdisk issues detected”},{“name”:“check_memory”,“success”:true,“status”:“93% memory available”},{“name”:“check_networking”,“success”:true,“status”:“No networking issues detected”},{“name”:“check_os_rollback”,“success”:true,“status”:“No OS rollbacks detected”},{“name”:“check_supervisor”,“success”:true,“status”:“Supervisor is running & healthy”},{“name”:“check_temperature”,“success”:true,“status”:“No temperature issues detected”},{“name”:“check_timesync”,“success”:true,“status”:“Time is synchronized”},{“name”:“check_service_lumeo”,“success”:true,“status”:“User service ‘lumeo’ is running & healthy”}]}

I’d appreciate any insights or solutions that the community might have to address this problem. Has anyone else experienced a similar issue?

Hi @Haytham_Amin,

Does the issue occur only on a single device or on multiple ones? Also, does it happen regardless of the network used? From the logs you shared it appears that you are using Ethernet only for connectivity, so I will set up an Orin NX on Ethernet and leave it running for some time to see if the issue is reproducible on our side.

If you have flashed a development variant of the image you can connect a USB serial cable to the board and login as root, alternatively, if the image you flashed is a production one, you can switch it to development by modifying /mnt/boot/config.json to add ,"developmentMode":"true". Once the change is done, the webterminal connection will reset and you will see the login prompt on the UART.

Using the UART interface you can check if the device still has an IP on the eth0 interface once it goes offline and also check if the vpn is still connected using journalctl | grep openvpn

1 Like

Hello! Issue resolved after flashing the device. Thanks

1 Like

Hi and thank you for letting us know. On our side the device has been online since I flashed it yesterday. Should the issue occur again, please don’t hesitate to contact us. Thanks!