We’ve had approximately 2-3 remote production devices go into a weird state in which our application containers stop running, but devices show as “Online” in the balena website.
The “weird state” is characterized by: containers appear to be not running (no balena logs and no internal logs to our infrastructure). Issuing a restart command from the website is responded to with "“tunneling socket could not be established: cause=socket hang up” message displayed, and restart is unsuccessful. Issuing a reboot command from the website is similarly unsuccessful, with the same error message displayed. We discovered that we were able to SSH into the “Host OS” and run “shutdown -r now”, and this seemed to fully restore the devices to operational status. This has been observed on balenaOS 2.29.0+rev1 supervisor 9.0.1. and possibly other versions.
Why is this happening, and how can we prevent it from happening? What can we look into?