How to kill and restart an unhealthy supervsior

Hi,

I noticed one of our devices had an unhealthy heartbeat (VPN only). Running diagnostics showed failures for check_container_engine, check_service_restarts, and check_supervisor. Trying to restart supervisor via systemctl restart resin-supervisor failed, and even docker kill <supervisor_container_id> hung. Restarting balena-engine via systemctl results in the restart of our application but not the supervisor:

root@f0d9f9a:~# systemctl restart balena-engine
root@f0d9f9a:~# balena ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                  PORTS               NAMES
d33522f9c72f        f4ea140e4eea                    "/usr/bin/entry.sh /…"   4 days ago          Up 14 seconds                               solmon_2609638_1495878
8f0fc1e48a35        balena/rpi-supervisor:v10.8.0   "./entry.sh"             4 days ago          Up 3 days (unhealthy)                       resin_supervisor

Result of journalctl -u resin-supervisor in period of interest:

Aug 16 02:11:20 f0d9f9a resin-supervisor[10850]: [api]     GET /v1/healthy 200 - 56.291 ms
Aug 16 02:16:49 f0d9f9a resin-supervisor[10850]: [api]     GET /v1/healthy 200 - 45.236 ms
Aug 17 15:16:55 f0d9f9a systemd[1]: resin-supervisor.service: Stopping timed out. Terminating.
Aug 17 15:16:55 f0d9f9a systemd[1]: resin-supervisor.service: Main process exited, code=killed, status=15/TERM
Aug 17 15:16:55 f0d9f9a systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Aug 17 15:18:25 f0d9f9a systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Aug 17 15:18:25 f0d9f9a systemd[1]: resin-supervisor.service: Killing process 10998 (balena) with signal SIGKILL.
Aug 17 15:18:25 f0d9f9a systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
Aug 17 15:19:55 f0d9f9a systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
Aug 17 15:19:55 f0d9f9a systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Aug 17 15:19:55 f0d9f9a resin-supervisor[5799]: active
Aug 17 15:19:55 f0d9f9a systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
Aug 17 15:19:55 f0d9f9a systemd[1]: Failed to start Balena supervisor.
Aug 17 15:21:36 f0d9f9a systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
Aug 17 15:21:36 f0d9f9a systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Aug 17 15:21:36 f0d9f9a resin-supervisor[6085]: active
Aug 17 15:21:36 f0d9f9a systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
Aug 17 15:21:36 f0d9f9a systemd[1]: Failed to start Balena supervisor.

Note the long window where there is no output, right before I tried to restart it.

I’m guessing reboot will address the issue, but wondering if there is more can be gleaned from the logs, or else if there is another method of fixing this.

We recently added some changes to the supervisor so the reason for a healthcheck failure is logged accurately: https://github.com/balena-io/balena-supervisor/blob/master/CHANGELOG.md#v1150

What concerns me here is that the supervisor did not report a failing healthcheck, so it leads me to believe that perhaps the engine had gotten itself into a strange state. The best thing to do in this situation in to attach the diagnostics output and it enables us to get a more holistic view of the device. If there is anything sensitive in your application logs it would be better to private message the diagnostics file to a team member who will attach it to the ticket.

Is the device still in a state where the supervisor is being restarted?

Hi, we have two open threads on our support system for the same problem. I will close this one and continue the conversation on the other one. I will also ping you there to make sure we are in sync.