Hi,
I noticed one of our devices had an unhealthy heartbeat (VPN only). Running diagnostics showed failures for check_container_engine
, check_service_restarts
, and check_supervisor
. Trying to restart supervisor via systemctl restart resin-supervisor
failed, and even docker kill <supervisor_container_id>
hung. Restarting balena-engine via systemctl results in the restart of our application but not the supervisor:
root@f0d9f9a:~# systemctl restart balena-engine
root@f0d9f9a:~# balena ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d33522f9c72f f4ea140e4eea "/usr/bin/entry.sh /…" 4 days ago Up 14 seconds solmon_2609638_1495878
8f0fc1e48a35 balena/rpi-supervisor:v10.8.0 "./entry.sh" 4 days ago Up 3 days (unhealthy) resin_supervisor
Result of journalctl -u resin-supervisor
in period of interest:
Aug 16 02:11:20 f0d9f9a resin-supervisor[10850]: [api] GET /v1/healthy 200 - 56.291 ms
Aug 16 02:16:49 f0d9f9a resin-supervisor[10850]: [api] GET /v1/healthy 200 - 45.236 ms
Aug 17 15:16:55 f0d9f9a systemd[1]: resin-supervisor.service: Stopping timed out. Terminating.
Aug 17 15:16:55 f0d9f9a systemd[1]: resin-supervisor.service: Main process exited, code=killed, status=15/TERM
Aug 17 15:16:55 f0d9f9a systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Aug 17 15:18:25 f0d9f9a systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Aug 17 15:18:25 f0d9f9a systemd[1]: resin-supervisor.service: Killing process 10998 (balena) with signal SIGKILL.
Aug 17 15:18:25 f0d9f9a systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
Aug 17 15:19:55 f0d9f9a systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
Aug 17 15:19:55 f0d9f9a systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Aug 17 15:19:55 f0d9f9a resin-supervisor[5799]: active
Aug 17 15:19:55 f0d9f9a systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
Aug 17 15:19:55 f0d9f9a systemd[1]: Failed to start Balena supervisor.
Aug 17 15:21:36 f0d9f9a systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
Aug 17 15:21:36 f0d9f9a systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
Aug 17 15:21:36 f0d9f9a resin-supervisor[6085]: active
Aug 17 15:21:36 f0d9f9a systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
Aug 17 15:21:36 f0d9f9a systemd[1]: Failed to start Balena supervisor.
Note the long window where there is no output, right before I tried to restart it.
I’m guessing reboot will address the issue, but wondering if there is more can be gleaned from the logs, or else if there is another method of fixing this.