Device stuck in updating state

hpgmiskin · October 8, 2018, 8:52am

Our resin device is stuck in an updating state, the containers are responding and the terminal can be opened in the services. However the reboot and restart commands do not work and respond with the following error.

Request error: tunneling socket could not be established, cause=socket hang up

I am sure that the problem will be resolved by power cycling the device however I would prefer to resolve this in am approach that will work when the device is in production.

hpgmiskin · October 8, 2018, 8:57am

When I have a look at Journal Controller I get the following output

journalctl --no-pager -u balena

Oct 08 08:55:40 af1f06f healthdog[809]: time=“2018-10-08T08:55:40.360305042Z” level=warning msg=“container kill failed because of ‘container not found’ or ‘no such process’: Cannot kill container 3fb4cd7354888137c5776d9f422413eee8716b758e28eaa84b56ed780130957f: rpc error: code = Unknown desc = containerd: container not found”

imrehg · October 8, 2018, 8:58am

hey @hpgmiskin what device type are you using, and what resinOS version?

hpgmiskin · October 8, 2018, 9:21am

I am using a Raspberry Pi 3+ with the following HostOS version and supervisor version

imrehg · October 8, 2018, 9:49am

The error suggest it’s a supervisor issue.

Is the resin_supervisor container running? What do you see in the resin-supervisor service’s logs?

hpgmiskin · October 8, 2018, 10:39am

I am not sure how to check that the resin_supervisor container is running. Here are the services which are displayed through the dashboard. The downloading state is the same as they were before the weekend.

When I use the system controller to investigate the resin-supervisor service the state is disabled.

root@af1f06f:~# systemctl list-unit-files | grep resin
resin\x2ddata.mount                                disabled
bind-etc-resin-supervisor.service                  enabled
bind-etc-systemd-system-resin.target.wants.service enabled
openvpn-resin.service                              enabled
resin-boot.service                                 enabled
resin-data.service                                 enabled
resin-device-api-key.service                       enabled
resin-device-uuid.service                          enabled
resin-filesystem-expand.service                    enabled
resin-hostname.service                             enabled
resin-info@.service                                disabled
resin-init.service                                 enabled
resin-net-config.service                           enabled
resin-ntp-config.service                           enabled
resin-persistent-logs.service                      enabled
resin-proxy-config.service                         enabled
resin-state-reset.service                          enabled
resin-state.service                                enabled
resin-supervisor.service                           disabled
update-resin-supervisor.service                    static
resin.target                                       static
update-resin-supervisor.timer                      disabled

The logs are repeatedly restarting the supervisor at the moment. I have tailed the logs as there are a lot of them and seem to continually repeat the restarting process.

root@af1f06f:~# journalctl -u resin-supervisor.service | tail -n 20
Oct 08 10:29:08 af1f06f systemd[1]: Started Resin supervisor.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Watchdog timeout (limit 1min)!
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25407 (start-resin-sup) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25408 (exe) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25438 (balena) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: Killing process 25438 (balena) with signal SIGKILL.
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Oct 08 10:31:48 af1f06f systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Oct 08 10:31:48 af1f06f systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 1284.
Oct 08 10:31:48 af1f06f systemd[1]: Stopped Resin supervisor.
Oct 08 10:31:48 af1f06f systemd[1]: Starting Resin supervisor...
Oct 08 10:32:09 af1f06f balena[26803]: resin_supervisor
Oct 08 10:32:09 af1f06f systemd[1]: Started Resin supervisor.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Watchdog timeout (limit 1min)!
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 26988 (start-resin-sup) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 26989 (exe) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 27019 (balena) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT

hpgmiskin · October 9, 2018, 1:01pm

@imrehg would you recommend just power cycling the device? I think this will resolve the issue but may mean the root cause is not identified. I can put the device into support mode if that is helpful.

imrehg · October 9, 2018, 1:03pm

Yeah, enabling support access and letting us know the UUID would be good, so we can check things, @hpgmiskin. Thanks!

imrehg · October 9, 2018, 1:14pm

Thanks @hpgmiskin received the UUID.

Checked out the device, the supervisor wasn’t running for some reason, not totally sure at this point. Cleared up the running containers and restarted the supervisor, it should be downloading the application. Let’s see what happens.

The supervisor can be checked by systemctl status resin-supervisor or check the logs of the same service, in the host OS.

It’s worth probably keeping an eye out for this, as indeed the root cause is not yet identified. The newer resinOS versions have new balena, which should resolve a bunch of strange balena issues (like this might be), but would recommend waiting for the next (resinOS 2.20.0 or above, or so… )

Let us know how’s it looking from your side.

hpgmiskin · October 9, 2018, 1:21pm

Thanks for your help, I can se that everything is downloading now. What were the commands that you used to clear up the running containers?

Following my last post yesterday I had run systemctl stop resin-supervisor in an attempt to resolve the issues with the supervisor but I got no output to the command for a very long time so assumed that it had not worked. Looks as though it had worked just took its time.

I will be sure to use systemctl status resin-supervisor in future. When resinOS 2.20.0 lands I will update to the latest version and post again if I experience any similar issues.

imrehg · October 9, 2018, 1:25pm

Not sure, the supervisor stop as you did seemed to have worked for me, but maybe missed something along the steps.

The container cleanup after the supervisor stop is balena rm -f $(balena ps -a -q).

That status, or journalctl -f -a -u resin-supervisor to check for the latest logs (or without -f to see all the logs) also works.

jgentes · October 12, 2018, 8:21pm

We run into this issue when the local storage on the device is full, which considering the number of containers you have, may be the case.

You can check by opening a terminal on the Host OS and typing ‘df -h’ and you may find that /mnt/data is at 100%.

Our hack is to 'rm -rf /mnt/data/docker ’ (bad, I know)

This wipes out the docker config and all of the container layers that are eating up the storage. Although you can’t reboot the device through the UI, you can issue the ‘reboot’ command to the Host OS via the terminal.

After rebooting, it will restart the download process.

Topic		Replies	Views
Device container stuck in stopping state Product support support , raspberrypi3	1	503	August 7, 2018
my device randomly stopped working; is currently just showing balenacloud logo; Request error: tunneling socket could not be established, cause=socket hang up Product support	8	389	March 25, 2020
Tunneling / Device Provisioning Issues Product support	8	862	May 30, 2017
ResinOS update failed Product support	1	1101	April 10, 2018
Device not found error when restarting Resin application Product support	4	618	April 26, 2018

Device stuck in updating state

Related topics