Container stuck when restarting: OCI runtime exec failed

Hi,

I have a Balena device with a container that is failing to restart. It appears to get stuck when the balena-engine kills the container. This does not happen in my other (~50) devices that are running the same release.

The device is an intel NUC running balenaOS 2.68.1+rev1 and supervisor 12.9.3. It was previously running 2.46.0+rev1 but I recently upgraded thinking it would help this issue but it hasn’t.

I have attached the system logs from a time immediately after I tried to restart the container from the web console. The thing that jumps out to me is the following:

Jul 27 09:39:05 cfdb910 balenad[1134]: time="2021-07-27T09:39:05.847081513Z" level=error msg="Error running exec b962cf21fd80edb72299bbca771a4d328ec7118b18cf9eee3786326537650ffd in container: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown"

I have seen this previous issue on these forums: OCI runtime exec failed: exec failed: container_linux.go:348. A balena team member mentions issues with the hardware as a possible cause. Nothing in dmesg seems to indicate a problem, the only logs there are the boot logs. How could I go about checking if it is a hardware issue or an issue with the code in the container?

This container does load some kernel models into the host OS, so that we can make use of Nvidia hardware. It also has a web server with a child process, that seems to be trying to restart while the container is being killed. I am not sure if these could cause an issue like this, it has not occurred on other devices running this same setup.

stuck-container.txt (51.5 KB)