What does status "unhealthy" mean and why won't this container pause?

I’m trying stop or pause the main service on this device:
https://dashboard.balena-cloud.com/devices/8057fa160e7781f273c6913344eb70e6/summary
The dashboard didn’t seem to be able to do it, so I tried opening a terminal to the container but that failed. So I connected to the host os and listed the containers:

balena container list
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                  PORTS               NAMES
848d14f2b822        0655bc9bf087                    "/usr/bin/entry.sh /…"   4 weeks ago         Up 2 days                                   main_751517_730175
0fc24b33ab30        resin/amd64-supervisor:v8.0.0   "./entry.sh"             2 months ago        Up 2 days (unhealthy)                       resin_supervisor

Then I tried to pause the container:

root@8057fa1:~# balena pause 848d14f2b822
Error response from daemon: Cannot pause container 
848d14f2b8224ea8da4d4ea068d0bef6ffd7ddcbd68f607963c2d616afea823e: connection error: 
desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: 
connect: connection refused": unknown

Stopping the container also fails:

root@8057fa1:~# balena stop 848d14f2b822
Error response from daemon: cannot stop container: 848d14f2b822: 
Cannot kill container 848d14f2b8224ea8da4d4ea068d0bef6ffd7ddcbd68f607963c2d616afea823e: 
connection error: desc ="transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: 
connect: connection refused": unknown
  1. what does “unhealthy” mean with respect to the resin_supervisor?
  2. other than restarting the device, how do I pause the container?

Hi Jason,

The supervisor could get into unhealthy status for various reasons. It’s hard to be specific without seeing more of the supervisor logs. Could you look into the supervisor logs to see what’s going on?

journalctl -u resin-supervisor -fn100

From your first text blog, I see that the device is running supervisor v8.0.0. We have released new versions. Is it possible to upgrade the OS of this device?

The ideal solution here would be doing a clean setup from scratch. Then I understand that you might not want that if the device is up and running on production. So if you cannot do an update, it’d be helpful to look further into the logs.

Cheers…

I’ll look into the logs, however I have given support a week of access.

  1. We cannot perform a clean setup from scratch.
  2. There will always be “new and improved” versions of everything. If we can only fix a problem by creating a new installation and reinstalling on the device we have to replace the device first, bring it back to a clean work area, and then open up the enclosure, then we’d be looking for an alternative to balena… sorry if that sounds ranty but we can reset remotely, reboot remotely, even power off and on remotely. But flashing a new image is a PITA.

https://dashboard.balena-cloud.com/devices/8057fa160e7781f273c6913344eb70e6/summary

Here are the logs:

-- Logs begin at Tue 2019-02-12 15:58:38 UTC. --
Feb 12 16:16:37 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:16:37 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:16:39 8057fa1 resin-supervisor[8806]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:16:39 8057fa1 resin-supervisor[8817]: active
Feb 12 16:16:39 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:19:39 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:19:39 8057fa1 systemd[1]: resin-supervisor.service: Killing process 8818 (start-resin-sup) with signal SIGABRT.
Feb 12 16:19:39 8057fa1 systemd[1]: resin-supervisor.service: Killing process 8819 (exe) with signal SIGABRT.
Feb 12 16:19:39 8057fa1 systemd[1]: resin-supervisor.service: Killing process 8853 (balena) with signal SIGABRT.
Feb 12 16:19:39 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Feb 12 16:21:10 8057fa1 systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Feb 12 16:21:10 8057fa1 systemd[1]: resin-supervisor.service: Killing process 8853 (balena) with signal SIGKILL.
Feb 12 16:21:10 8057fa1 systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Feb 12 16:21:20 8057fa1 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Feb 12 16:21:20 8057fa1 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 996.
Feb 12 16:21:20 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:21:20 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:21:22 8057fa1 resin-supervisor[9048]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:21:22 8057fa1 resin-supervisor[9057]: active
Feb 12 16:21:22 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:24:22 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:24:22 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9058 (start-resin-sup) with signal SIGABRT.
Feb 12 16:24:22 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9059 (exe) with signal SIGABRT.
Feb 12 16:24:22 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9093 (balena) with signal SIGABRT.
Feb 12 16:24:22 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Feb 12 16:25:52 8057fa1 systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Feb 12 16:25:52 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9093 (balena) with signal SIGKILL.
Feb 12 16:25:52 8057fa1 systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Feb 12 16:26:02 8057fa1 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Feb 12 16:26:02 8057fa1 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 997.
Feb 12 16:26:02 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:26:02 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:26:04 8057fa1 resin-supervisor[9285]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:26:04 8057fa1 resin-supervisor[9296]: active
Feb 12 16:26:04 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:29:04 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:29:04 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9297 (start-resin-sup) with signal SIGABRT.
Feb 12 16:29:04 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9298 (exe) with signal SIGABRT.
Feb 12 16:29:04 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9332 (balena) with signal SIGABRT.
Feb 12 16:29:04 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Feb 12 16:30:35 8057fa1 systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Feb 12 16:30:35 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9332 (balena) with signal SIGKILL.
Feb 12 16:30:35 8057fa1 systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Feb 12 16:30:45 8057fa1 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Feb 12 16:30:45 8057fa1 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 998.
Feb 12 16:30:45 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:30:45 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:30:47 8057fa1 resin-supervisor[9506]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:30:47 8057fa1 resin-supervisor[9515]: active
Feb 12 16:30:47 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:33:47 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:33:47 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9516 (start-resin-sup) with signal SIGABRT.
Feb 12 16:33:47 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9517 (exe) with signal SIGABRT.
Feb 12 16:33:47 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9551 (balena) with signal SIGABRT.
Feb 12 16:33:47 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Feb 12 16:35:17 8057fa1 systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Feb 12 16:35:17 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9551 (balena) with signal SIGKILL.
Feb 12 16:35:17 8057fa1 systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Feb 12 16:35:27 8057fa1 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Feb 12 16:35:27 8057fa1 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 999.
Feb 12 16:35:27 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:35:27 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:35:29 8057fa1 resin-supervisor[9738]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:35:29 8057fa1 resin-supervisor[9746]: active
Feb 12 16:35:29 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:38:29 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:38:29 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9747 (start-resin-sup) with signal SIGABRT.
Feb 12 16:38:29 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9748 (exe) with signal SIGABRT.
Feb 12 16:38:29 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9784 (balena) with signal SIGABRT.
Feb 12 16:38:29 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Feb 12 16:40:00 8057fa1 systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Feb 12 16:40:00 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9784 (balena) with signal SIGKILL.
Feb 12 16:40:00 8057fa1 systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Feb 12 16:40:10 8057fa1 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Feb 12 16:40:10 8057fa1 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 1000.
Feb 12 16:40:10 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:40:10 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:40:12 8057fa1 resin-supervisor[9983]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:40:12 8057fa1 resin-supervisor[9992]: active
Feb 12 16:40:12 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:43:12 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:43:12 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9993 (start-resin-sup) with signal SIGABRT.
Feb 12 16:43:12 8057fa1 systemd[1]: resin-supervisor.service: Killing process 9994 (exe) with signal SIGABRT.
Feb 12 16:43:12 8057fa1 systemd[1]: resin-supervisor.service: Killing process 10027 (balena) with signal SIGABRT.
Feb 12 16:43:12 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT
Feb 12 16:44:42 8057fa1 systemd[1]: resin-supervisor.service: State 'stop-final-sigterm' timed out. Killing.
Feb 12 16:44:42 8057fa1 systemd[1]: resin-supervisor.service: Killing process 10027 (balena) with signal SIGKILL.
Feb 12 16:44:42 8057fa1 systemd[1]: resin-supervisor.service: Failed with result 'watchdog'.
Feb 12 16:44:52 8057fa1 systemd[1]: resin-supervisor.service: Service hold-off time over, scheduling restart.
Feb 12 16:44:52 8057fa1 systemd[1]: resin-supervisor.service: Scheduled restart job, restart counter is at 1001.
Feb 12 16:44:52 8057fa1 systemd[1]: Stopped Resin supervisor.
Feb 12 16:44:52 8057fa1 systemd[1]: Starting Resin supervisor...
Feb 12 16:44:54 8057fa1 resin-supervisor[12692]: Error response from daemon: cannot stop container: resin_supervisor: Cannot kill container 0fc24b33ab30b89f59ffc074164aa692216a3f2567432715ba90131b5398fb0c: connection error: desc = "transport: dial unix /var/run/balena-engine/containerd/balena-engine-containerd.sock: connect: connection refused": unknown
Feb 12 16:44:54 8057fa1 resin-supervisor[12738]: active
Feb 12 16:44:54 8057fa1 systemd[1]: Started Resin supervisor.
Feb 12 16:47:55 8057fa1 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
Feb 12 16:47:55 8057fa1 systemd[1]: resin-supervisor.service: Killing process 12739 (start-resin-sup) with signal SIGABRT.
Feb 12 16:47:55 8057fa1 systemd[1]: resin-supervisor.service: Killing process 12740 (exe) with signal SIGABRT.
Feb 12 16:47:55 8057fa1 systemd[1]: resin-supervisor.service: Killing process 12779 (balena) with signal SIGABRT.
Feb 12 16:47:55 8057fa1 systemd[1]: resin-supervisor.service: Main process exited, code=dumped, status=6/ABRT

Thanks for the information! Is it always reproducible that the container cannot be stopped?

The new installation should almost never be needed. We asked it more for troubleshooting reasons, as we didn’t know what your setup, context is like. For dev versions we don’t yet support self-service updates, but we are working on it, and our team can do remote OS updates for dev devices too. Remote supervisor updates are always an option too (and much more lightweight)

We are checking out the device, and keep you posted. Is it okay to try to stop the container with the dashboard (as I think that’s what you meant by “pausing”)?

A quick looks suggests that balenaEngine is not doing well, containerd seems to be down, which is required by the engine. Will try to restart and see if that fixes things up.

We’ve restarted the balenaEngine and the device seems to be working properly now. Because there are a lot of logs on the device it’s unclear what happened, but we’ll keep thinking nonetheless. If you hit such an issue again, please let us know, maybe we can get more logs.

Let us know if you have any problem with this given device now!

In the meantime, we are checking with the OS team, whether we can add further health checks that would allow us to automatically recover from such issues.

Opened issue for more robust handling this.

Thanks @imrehg ! Yes, I couldn’t stop the service multiple times, and tried to stop and pause it using the hostos command line. The only way I saw this was when listing containers.
Perhaps the unhealthy status can be forwarded to the dashboard?
Is there any other way to get the container health status other than connecting to the hostOS through the balena VPN (dashboard)?