Device stuck in updating state

Our resin device is stuck in an updating state, the containers are responding and the terminal can be opened in the services. However the reboot and restart commands do not work and respond with the following error.

Request error: tunneling socket could not be established, cause=socket hang up

I am sure that the problem will be resolved by power cycling the device however I would prefer to resolve this in am approach that will work when the device is in production.

When I have a look at Journal Controller I get the following output

journalctl --no-pager -u balena

Oct 08 08:55:40 af1f06f healthdog[809]: time=“2018-10-08T08:55:40.360305042Z” level=warning msg=“container kill failed because of ‘container not found’ or ‘no such process’: Cannot kill container 3fb4cd7354888137c5776d9f422413eee8716b758e28eaa84b56ed780130957f: rpc error: code = Unknown desc = containerd: container not found”

hey @hpgmiskin what device type are you using, and what resinOS version?

I am using a Raspberry Pi 3+ with the following HostOS version and supervisor version

44

The error suggest it’s a supervisor issue.

Is the resin_supervisor container running? What do you see in the resin-supervisor service’s logs?

I am not sure how to check that the resin_supervisor container is running. Here are the services which are displayed through the dashboard. The downloading state is the same as they were before the weekend.

When I use the system controller to investigate the resin-supervisor service the state is disabled.

root@af1f06f:~# systemctl list-unit-files | grep resin
resin\x2ddata.mount                                disabled
bind-etc-resin-supervisor.service                  enabled
bind-etc-systemd-system-resin.target.wants.service enabled
openvpn-resin.service                              enabled
resin-boot.service                                 enabled
resin-data.service                                 enabled
resin-device-api-key.service                       enabled
resin-device-uuid.service                          enabled
resin-filesystem-expand.service                    enabled
resin-hostname.service                             enabled
resin-info@.service                                disabled
resin-init.service                                 enabled
resin-net-config.service                           enabled
resin-ntp-config.service                           enabled
resin-persistent-logs.service                      enabled
resin-proxy-config.service                         enabled
resin-state-reset.service                          enabled
resin-state.service                                enabled
resin-supervisor.service                           disabled
update-resin-supervisor.service                    static
resin.target                                       static
update-resin-supervisor.timer                      disabled

The logs are repeatedly restarting the supervisor at the moment. I have tailed the logs as there are a lot of them and seem to continually repeat the restarting process.

root@af1f06f:~# journalctl -u resin-supervisor.service | tail -n 20
Oct 08 10:29:08 af1f06f systemd[1]: Started Resin supervisor.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Watchdog timeout (limit 1min)!
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25407 (start-resin-sup) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25408 (exe) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Killing process 25438 (balena) with signal SIGABRT.
Oct 08 10:30:08 af1f06f systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: Killing process 25438 (balena) with signal SIGKILL.
Oct 08 10:31:38 af1f06f systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Oct 08 10:31:48 af1f06f systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Oct 08 10:31:48 af1f06f systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 1284.
Oct 08 10:31:48 af1f06f systemd[1]: Stopped Resin supervisor.
Oct 08 10:31:48 af1f06f systemd[1]: Starting Resin supervisor...
Oct 08 10:32:09 af1f06f balena[26803]: resin_supervisor
Oct 08 10:32:09 af1f06f systemd[1]: Started Resin supervisor.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Watchdog timeout (limit 1min)!
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 26988 (start-resin-sup) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 26989 (exe) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Killing process 27019 (balena) with signal SIGABRT.
Oct 08 10:33:09 af1f06f systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT

@imrehg would you recommend just power cycling the device? I think this will resolve the issue but may mean the root cause is not identified. I can put the device into support mode if that is helpful.

Yeah, enabling support access and letting us know the UUID would be good, so we can check things, @hpgmiskin. Thanks!

Thanks @hpgmiskin received the UUID.

Checked out the device, the supervisor wasn’t running for some reason, not totally sure at this point. Cleared up the running containers and restarted the supervisor, it should be downloading the application. Let’s see what happens.

The supervisor can be checked by systemctl status resin-supervisor or check the logs of the same service, in the host OS.

It’s worth probably keeping an eye out for this, as indeed the root cause is not yet identified. The newer resinOS versions have new balena, which should resolve a bunch of strange balena issues (like this might be), but would recommend waiting for the next (resinOS 2.20.0 or above, or so… )

Let us know how’s it looking from your side.

Thanks for your help, I can se that everything is downloading now. What were the commands that you used to clear up the running containers?

Following my last post yesterday I had run systemctl stop resin-supervisor in an attempt to resolve the issues with the supervisor but I got no output to the command for a very long time so assumed that it had not worked. Looks as though it had worked just took its time.

I will be sure to use systemctl status resin-supervisor in future. When resinOS 2.20.0 lands I will update to the latest version and post again if I experience any similar issues.

Not sure, the supervisor stop as you did seemed to have worked for me, but maybe missed something along the steps.

The container cleanup after the supervisor stop is balena rm -f $(balena ps -a -q).

That status, or journalctl -f -a -u resin-supervisor to check for the latest logs (or without -f to see all the logs) also works.

1 Like

We run into this issue when the local storage on the device is full, which considering the number of containers you have, may be the case.

You can check by opening a terminal on the Host OS and typing ‘df -h’ and you may find that /mnt/data is at 100%.

Our hack is to 'rm -rf /mnt/data/docker ’ (bad, I know)

This wipes out the docker config and all of the container layers that are eating up the storage. Although you can’t reboot the device through the UI, you can issue the ‘reboot’ command to the Host OS via the terminal.

After rebooting, it will restart the download process.