Our resin device is stuck in an updating state, the containers are responding and the terminal can be opened in the services. However the reboot and restart commands do not work and respond with the following error.
Request error: tunneling socket could not be established, cause=socket hang up
I am sure that the problem will be resolved by power cycling the device however I would prefer to resolve this in am approach that will work when the device is in production.
When I have a look at Journal Controller I get the following output
journalctl --no-pager -u balena
Oct 08 08:55:40 af1f06f healthdog[809]: time=“2018-10-08T08:55:40.360305042Z” level=warning msg=“container kill failed because of ‘container not found’ or ‘no such process’: Cannot kill container 3fb4cd7354888137c5776d9f422413eee8716b758e28eaa84b56ed780130957f: rpc error: code = Unknown desc = containerd: container not found”
I am not sure how to check that the resin_supervisor container is running. Here are the services which are displayed through the dashboard. The downloading state is the same as they were before the weekend.
The logs are repeatedly restarting the supervisor at the moment. I have tailed the logs as there are a lot of them and seem to continually repeat the restarting process.
root@af1f06f:~# journalctl -u resin-supervisor.service | tail -n 20
Oct 08 10:29:08 af1f06f systemd[1]: Started Resin supervisor.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Watchdog timeout (limit 1min)!
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25407 (start-resin-sup) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25408 (exe) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25438 (balena) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: Killing process 25438 (balena) with signal SIGKILL.
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Oct 08 10:31:48 af1f06f systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Oct 08 10:31:48 af1f06f systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 1284.
Oct 08 10:31:48 af1f06f systemd[1]: Stopped Resin supervisor.
Oct 08 10:31:48 af1f06f systemd[1]: Starting Resin supervisor...
Oct 08 10:32:09 af1f06f balena[26803]: resin_supervisor
Oct 08 10:32:09 af1f06f systemd[1]: Started Resin supervisor.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Watchdog timeout (limit 1min)!
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 26988 (start-resin-sup) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 26989 (exe) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 27019 (balena) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
@imrehg would you recommend just power cycling the device? I think this will resolve the issue but may mean the root cause is not identified. I can put the device into support mode if that is helpful.
Checked out the device, the supervisor wasn’t running for some reason, not totally sure at this point. Cleared up the running containers and restarted the supervisor, it should be downloading the application. Let’s see what happens.
The supervisor can be checked by systemctl status resin-supervisor or check the logs of the same service, in the host OS.
It’s worth probably keeping an eye out for this, as indeed the root cause is not yet identified. The newer resinOS versions have new balena, which should resolve a bunch of strange balena issues (like this might be), but would recommend waiting for the next (resinOS 2.20.0 or above, or so… )
Thanks for your help, I can se that everything is downloading now. What were the commands that you used to clear up the running containers?
Following my last post yesterday I had run systemctl stop resin-supervisor in an attempt to resolve the issues with the supervisor but I got no output to the command for a very long time so assumed that it had not worked. Looks as though it had worked just took its time.
I will be sure to use systemctl status resin-supervisor in future. When resinOS 2.20.0 lands I will update to the latest version and post again if I experience any similar issues.
We run into this issue when the local storage on the device is full, which considering the number of containers you have, may be the case.
You can check by opening a terminal on the Host OS and typing ‘df -h’ and you may find that /mnt/data is at 100%.
Our hack is to 'rm -rf /mnt/data/docker ’ (bad, I know)
This wipes out the docker config and all of the container layers that are eating up the storage. Although you can’t reboot the device through the UI, you can issue the ‘reboot’ command to the Host OS via the terminal.
After rebooting, it will restart the download process.