Device In Unrecoverable State

We have a prototype device that we attempted to down grade to a previous version of our release and now one of the containers isn’t running correctly. When I SSH into the container I see the following:

Connecting to ... Spawning shell... OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown SSH session disconnected SSH reconnecting...

We had another container consuming 100% of a single core of the CPU and I successfully paused that container but I haven’t been able to recover the system to a usable state. A remote reboot errors with a TIMEOUT message and stopping the containers from the dashboard doesn’t seem to work.

I’m looking for best practices on how to deal with situations for when we go into production.

I have granted support access to this device.

Please let me know if you need more details.

Thanks,

Hello @WestCoastDaz it looks like the container is not compatible with your fleet device type!

could you please share what hardware are you using, the device type of your fleet and if possible the container running?

@mpous

Can you provide more details on what you mean by not compatible? Our hardware is a CM4 on a custom carrier board that is pretty my identical to a CM4 IO board. We have several devices using this hardware running Balena without issue.

The device got into this state after we tried upgrade the device to a latest release.

I have tried downgrading the containers on this unit but that is also not working.

My main concern right now is to understand how best to understand what happened and how to recover this unit as it’s deployed at with a customer right now.

ok! sorry for the missunderstanding @WestCoastDaz

I would like to take a look on your device! could you please grant us support access of the device? maybe over DM you can share the URL of your device once this is on support mode? Thanks!

@WestCoastDaz it looks like your device have a DNS issue, not sure why! Did you change anything? This is connected over Ethernet right?

@mpous

I believe my colleague tried updating it late last week and then it entered this state. It is on Ethernet.

What next steps do you think I should take? Also why would DNS issues cause the container issue I noted in the original post?

Hi @WestCoastDaz,

Quick note. You asked for diagnosing / troubleshooting tips. You may already know these tips, but just in case… We have a Masterclass on the topic, please see Balena Device Debugging Masterclass. Among the things in that masterclass is using the Diagnostics tab for the device. Looks like that’s where mpous started above - Diagnostics .. Device health checks. Which currently shows check_networking failing with specific DNS issues. Diagnostics .. Device diagnostics can also be helpful, but it’s a lot of diagnostic info.

I’m happy to take a look this evening.

Hi @WestCoastDaz,

The error message you pasted, OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"", occurs when a balena exec attempt into a container’s terminal fails, as you’ve observed. In the multi-container application that your device is running, I’m able to access the container terminals of all the services except for server which is currently exited. Of course, if a container is not running, you cannot exec into it. It looks like the issue you may have once observed is no longer present; our OS and Supervisor attempt to be self-healing and it looks like the system recovered in this case. I’ve seen instances of this error message before, and you can find details (and a fix) in this GitHub issue: Fails to open terminal to an application container with: OCI runtime exec failed · Issue #256 · balena-os/balena-engine · GitHub. Feel free to follow that issue for updates.

For debugging tips both now and in production, when an application is in an inconsistent state, especially during an update, one thing that will provide more information is to check the Supervisor journal logs on the device. You’ll find details about this as well as other valuable debugging tips in the Device Debugging Masterclass linked by my colleagues above, but the TL;DR is that the command for checking journal logs is journalctl -u balena-supervisor -u resin-supervisor -xef -n 200 or something similar. This command is meant to be run from the host OS terminal.

Looking at the logs for your device, I see a dashboard stop action for the server container in particular, which manifests as Event: Service stop in the logs. There is nothing else abnormal in recent logs so I can’t say for sure why your device entered this state.

Let us know if there are any other questions!

Regards,
Christina

1 Like

The device checks & diagnostics that failed aren’t necessarily related to this issue, but may be related to some other abnormality on the device which unluckily were occurring at the same time. In this case I don’t think they’re related.

1 Like