Device services stuck at "stopping" state

Hello,

Some devices have all services getting stuck in “stopping” states. It doesn’t accept a restart command for the service or the whole device.

Also, it was previously working and have some other devices with the same code working.

Raspberry pi zero
Device uuid: 58e349441a79591d89053d8a20515f56
Supervisor version: 9.0.1

Thank you

Hey deoqc,
could you please grant support access to the device so a support agent can access it and have a look at what might be causing this issue?

Support access granted.

Changed device to a133379cc20a2cd61adb25c2940c6306.

Thank you, taking a look now.

1 Like

I took a look at your device and it seems containerd was misbehaving. I restarted that and was monitoring the startup logs when I got dropped out of the shell due to network issues, and I haven’t been able to get back in. I’ll continue to try, and I’d like to find out why containerd wasn’t working correctly, but currently I can’t access the device.

@deoqc, I was also looking at the device for at least half an hour, and some factual observations are:

  • It is rebooting itself frequently, as evidenced by both the “uptime” command on the host OS, and by the “last online” information on the web dashboard. I reckon it rebooted 5 times in around half an hour, at very irregular intervals (once up at least 13 minutes, once less than a minute). When the terminal disconnected, at first I thought it was a networking issue, then I realised it was actually rebooting itself.
  • Memory usage as shown by the ‘free’ command fluctuates a lot, though this isn’t necessarily wrong. At one point, just before one of the self reboots, there was just about 10MB = 2% available memory: sum of the “free” and “cached” columns. (This RPi has 512MB memory only.) As the device doesn’t have swap memory (which is correct/recommended to avoid premature SD card wear), running so low on memory runs on the risk of the Linux system having to resort to force-killing processes to free memory. I don’t have evidence that this actually happened, though.
  • Browsing the logs (commands below), I saw that the balena supervisor was frequently restarting, including being killed by the OS watchdog for not timely responding to the health tests. Less frequently but also happening, balenaEngine itself was being killed and restarted, with go-lang tracebacks in the logs.
journalctl -ab 
journalctl -au balena
journalctl -au resin-supervisor
dmesg

Excerpt from the balena logs:

May 04 00:22:45 a133379 systemd[1]: balena.service: Watchdog timeout (limit 3min)!
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 694 (balenad) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 696 (exe) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1064 (balena-engine-c) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1068 (balena-engine-c) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1290 (balena-engine-c) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1406 (balena-engine-c) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1497 (balena-healthch) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1521 (balena) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1089 (sh) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1206 (mosquitto) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1097 (sh) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1220 (node) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1350 (python3.6) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1351 (python3.6) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1352 (python3.6) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1447 (bash) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1527 (journalctl) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1305 (node) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1332 (dbus-daemon) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1338 (avahi-daemon) with signal SIGABRT.
May 04 00:22:45 a133379 systemd[1]: balena.service: Killing process 1339 (avahi-daemon) with signal SIGABRT.

The frequent restart of the balena supervisor and balenaEngine explains why services would get stuck or fail to update. In a healthy system, the supervisor and balenaEngine would normally never restart. This system was “very unhealthy”.

As for the reason why it was unhealthy like that and how to fix it… I considered the possibility of filesystem corruption, but the following host OS test was OK, so I don’t have evidence of corruption.

$ md5sum --quiet -c /resinos.fingerprint

I also considered that it was simply running too low on free memory, but I have not really found in the logs or dmesg evidence that the OS was triggering OOM events (out of memory) which would kill processes.

I also could not find evidence of common-ish hardware problems in dmesg, like under voltage events, or errors related to RAM memory or CPU failures. This does not prove that the hardware is OK, simply that I could not find evidence that it’s broken… :slight_smile:

So, unfortunately I don’t have an answer as to why this frequent watchdog failure and device reboot is happening. But the fact that it is happening at least explains the observed side effects towards your app services: getting stuck, or failing to update, and not acting on dashboard restart/reboot commands (as those depend on a running/healthy balena supervisor).

@deoqc, I have a couple of suggestions:

  • Enable “persistent logging” through the web dashboard Device Configuration page. Logs are normally lost on reboot, to avoid premature SD card wear, unless persistent logging is enabled. It’s possible that a hint/explanation as to what is causing the failures and self reboots is being logged immediately before the reboot. After you enable the setting through the web dashboard, open a host OS terminal and issue the ‘reboot’ command (single word). This manual reboot is not normally necessary because the supervisor would automatically do it, but in this case the supervisor may not be running. The persistent logging setting will take effect after a reboot.

  • If you can, as an experiment, try reducing the memory pressure on the system, perhaps by disabling some of your app’s functionality (or not running one of the containers). Reboot the device, then:

    • Check the free memory with the ‘free’ command on the host OS (a minute after the boot, as memory usage settles down). To watch the memory: ‘watch -n 3 free’ (quit with CTRL-C).
    • Check the supervisor logs to see if the supervisor is still being restarted. Specifically, look for the following in the logs:
$ journalctl -fau resin-supervisor
...
May 04 10:51:52 a133379 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
May 04 10:51:52 a133379 systemd[1]: resin-supervisor.service: Killing process 2289 (start-resin-sup) with signal SIGABRT.

By the way, there are a few errors that I have spotted in the supervisor logs, like “events.js Unhandled ‘error’ event” and “cannot stop container: resin_supervisor”, but they don’t happen consistently (sometimes the supervisor starts without those errors, only to be killed by the watchdog a few minutes later) so my working assumption is that those errors are consequence rather than cause, or at least incidental to the root cause of the problem affecting this device.

Let us know how you get on with these suggestions.

Thanks for the suggestions @pdcastro @CameronDiver, I’m looking at them.

Now, with your infos, I think the problem might be due to application caching of data (from 1 of the 2 containers) that happens when the connection is not working, so it can be sent when it is back.

I would like to see the logs of the balena stats from when it is w/o network. Is there an pratical way to log this, since this should be running in the host? This should show a clear picture of the memory pressure from containers, right?

One more thing, I hoped that I could limit the memory of each service, so to ensure that the supervisor works even when the service is not.

And it would also make clear that a service is taking more memory (or any other resource for that matter) than it should.

@deoqc take a look at this thread - maybe some of the information will be useful: RAM/Memory Management

@deoqc, with balena stats specifically, this comment suggests some command-line options to redirect the output to a csv file. For monitoring in general, many balena users have a ‘datadog’ container as described in this blog:

One complication however is that datadog itself takes some system resources, and may be tricky to fit comfortably on an RPi 2 that is already struggling with memory. I think it’s more commonly used with RPi 3 devices (1GB RAM).

Besides monitoring, you can look at “enforcement” with the docker-compose mem_limit setting and the Dockerfile healthcheck setting, as mentioned in the thread linked by Chris above.