Watchdog restarts the device during a release update

My raspberry pi is stuck in a restart loop, unable to update to the newest release.
Inside the balenaCloud dashboard I can see the update reaching about 7% progress before a reboot happens.
The release which I have on the device right now doesn’t have a correct docker CMD (my mistake), so the supervisor watchdog fails, but I’m unable to update because of the reboots (which happens roughly every 8 minutes).

You could add an option in the dashboard to disable the watchdog reboots – or to not reboot the device while an update is in progress…

Running “journalctl -a” on the device gives the following (truncated):

Jan 14 21:04:06 6f58c4d fcf800c6d1da[830]: [info] Internet Connectivity: OK
Jan 14 21:04:06 6f58c4d resin-supervisor[1360]: [info] Internet Connectivity: OK
Jan 14 21:08:34 6f58c4d fcf800c6d1da[830]: [api] GET /v1/healthy 200 - 32.079 ms
Jan 14 21:08:34 6f58c4d resin-supervisor[1360]: [api] GET /v1/healthy 200 - 32.079 ms
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 830 (balenad) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 837 (exe) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 886 (balena-engine-c) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1532 (balena-engine-c) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1641 (exe) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d balenad[830]: SIGABRT: abort
Jan 14 21:09:34 6f58c4d balenad[830]: PC=0x45a544 m=0 sigcode=0
Jan 14 21:09:34 6f58c4d balenad[830]: goroutine 0 [idle]:
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.futex(0x23fc528, 0x0, 0x0, 0x0, 0x4400000000, 0x4387dc, 0x442005c948, 0x0, 0x0, 0x4338d8, …)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/sys_linux_arm64.s:321 +0x1c
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.futexsleep(0x23fc528, 0x0, 0xffffffffffffffff)
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1673 (balena-healthch) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/os_linux.go:45 +0x3c
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.notesleep(0x23fc528)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/lock_futex.go:151 +0x84
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1701 (balena) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.stoplockedm(SIGABRT: abort)
Jan 14 21:09:34 6f58c4d balenad[830]: PC= 0x45a544/usr/lib/go/src/runtime/proc.go m=:02101 sigcode= +00x60
Jan 14 21:09:34 6f58c4d balenad[830]: goroutine runtime.schedule0( [)
Jan 14 21:09:34 6f58c4d balenad[830]: idle ]:
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/proc.go:2493runtime.futex +0x274(0x23fc528
Jan 14 21:09:34 6f58c4d balenad[830]: , runtime.park_m0x0(, 0x4420093c800x0)
Jan 14 21:09:34 6f58c4d balenad[830]: , 0x0/usr/lib/go/src/runtime/proc.go, :0x44000000002604, +0x2000000000x90,
Jan 14 21:09:34 6f58c4d balenad[830]: 0x1, runtime.mcall0x0(0x0)
Jan 14 21:09:34 6f58c4d balenad[830]: , 0x0 , /usr/lib/go/src/runtime/asm_arm64.s0x4331c8:, …169)
Jan 14 21:09:34 6f58c4d balenad[830]: + 0x50/usr/lib/go/src/runtime/sys_linux_arm64.s
Jan 14 21:09:34 6f58c4d balenad[830]: :321
Jan 14 21:09:34 6f58c4d balenad[830]: +0x1cgoroutine
Jan 14 21:09:34 6f58c4d balenad[830]: 1 [runtime.futexsleepchan receive(, 0x23fc5288 minutes, ]:
Jan 14 21:09:34 6f58c4d balenad[830]: 0x0, 0xffffffffffffffff)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/os_linux.go:45 +0x3c
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.notesleep(0x23fc528)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/lock_futex.go:151 +0x84

And also (more recent log):

Jan 14 22:23:40 6f58c4d resin-supervisor[9815]: time=“2020-01-14T22:23:40.324761888Z” level=error msg=“error waiting for container: unexpected EOF”
Jan 14 22:23:40 6f58c4d systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 14 22:23:40 6f58c4d systemd[1]: balena.service: Failed with result ‘watchdog’.
Jan 14 22:23:40 6f58c4d systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Jan 14 22:23:40 6f58c4d systemd[1]: resin-supervisor.service: Failed with result ‘exit-code’.

Terminal from the balenaCloud dashboard:

14.01.20 22:55:53 (+0200) Supervisor starting
14.01.20 22:56:02 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021
14.01.20 22:58:12 (+0200) Rebooting
14.01.20 23:03:50 (+0200) Supervisor starting
14.01.20 23:04:03 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021
14.01.20 23:11:12 (+0200) Supervisor starting
14.01.20 23:11:31 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021
14.01.20 23:19:01 (+0200) Supervisor starting
14.01.20 23:19:18 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021

Hi @claudiu725,

can you provide some extra info please? I would like to know what is your device’s type(RaspberryPi 3, 4 or other). Which balenaOS version are you running and which Supervisor version are you running (you can find all of these in your balenaCloud dashboard).

Best Regards,
Marios

Hi @mbalamat

Type
Raspberry Pi 3 (using 64bit OS) (BETA)

Host OS Version
balenaOS 2.46.1+rev1

Supervisor Version
10.6.27

What you can do is SSH into the host, find the offending container and its image, forcefully kill the container and remove the image and let the Supervisor pull from scratch. You may have to fight a bit with the Supervisor trying to do its thing so try to be quick. Alternatively, a less offensive strategy which might work is to use the “kill-then-download” update strategy.

Thanks!

I’m having the same issue while trying to add a device to a new 9 services app.
While downloading, the available memory goes from 400M to about 16M (watch -n 1 free), then the supervisor crashes and all downloads are aborted.
After 2-3 minutes, the supervisor boots back up as memory is freed. All downloads start again from scratch. This loop has been repeating for days now.

Looking at the logs using ssh root@ip -p 22222 "journalctl -n 100000 --no-pager" > logs.txt, the first line that shows up when the crash happens is:

Jul 07 20:28:44 f8d1673 systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jul 07 20:28:44 f8d1673 systemd[1]: balena.service: Killing process 4788 (balenad) with signal SIGABRT.
Jul 07 20:28:44 f8d1673 balenad[4788]: SIGABRT: abort
Jul 07 20:28:44 f8d1673 balenad[4788]: PC=0x7275c m=0 sigcode=0
Jul 07 20:28:44 f8d1673 balenad[4788]: goroutine 0 [idle]:

Then, thousands and thousands of lines from the stack trace of the balena daemon getting aborted.

I don’t have anything running on the device, as it’s a fresh install. It seems the download process just eats up more RAM than available.
I did set the RESIN_SUPERVISOR_UPDATE_STRATEGY device variable to kill-then-download, although as there is no container running other than the supervisor, I don’t think this does much.

Is there something I can do in order to get this working?

Hi there, it’s possible that the Linux kernel is killing the supervisor process due to out-of-memory condition (SIGABRT holds a clue, since malloc() will call it when there is a problem). there should be something to this effect also in the dmesg output. If so, are you able to try join this device to an app which has fewer services/less memory requirements?

Yes, it (a Raspberry Pi 3, should have mentioned earlier) can join other apps with less services. To me, it looks like the supervisor eats up a lot of memory for each download. Is there an option to do each download sequentially instead of parallel?

Is there an option to do each download sequentially instead of parallel?

Hi @edorgeville, we got this feature request earlier as well. I will add your +1 in our internal tracking system. In the newer balenaOS the default value for maximum concurrent downloads is 3. This is not user configurable at the moment. But I personally agree that it makes sense to expose this parameter as a configuration variable.

Perhaps a faster SD card might help you in this case. Another possible workaround is removing most of the services from the docker-compose.yml and then pushing with only few services declared. As the device downloads services, you keep adding a couple of more services to docker-compose.yml and push again. Essentially you manually instruct the device supervisor to download in smaller batches.

Finally, if you are up for it, it’d great if you could do memory profile on this device and share the results with us. I agree with your reasoning. Then it’d be good to confirm with quantitative feedback.

Cheers…

What would be the best way to do a memory profile?

Hi!

Memory profiling methods highly depend on the language used to write the service. For example, the tools you use to profile a Node application are different from the tools you use to profile a Java application. Which languages are being used in your services?

@kaisoz Most of the services are based on balenalib/raspberrypi3-python:3.7-run, some on balenalib/raspberrypi3-node. I doubt this does anything to crash the download though as there isn’t even a single service up yet.
If someone from the team can have a look, I made a fresh install today and granted access to the device:
https://dashboard.balena-cloud.com/devices/c9e178d56d70edbc2c4097ffaf188f7d/summary
Feel free to poke around!

Hi @claudiu725, welcome to the balena forums.

I ran the device health checks and while not conclusive it does give us some hints. I’d like to restart the container engine and the supervisor if that’s okay with you? What brand and speed of SD card are you using? It seems like it might be a bit slow, which could be the source of the issue.

Cheers,
James.

@jimsynz (I think you tagged the original poster instead of me)
Feel free to restart and anything else you want!
The SD card is this one: https://www.amazon.ca/Samsung-Class-Micro-Adapter-MB-MC32DA/dp/B00WR4IJBE

Hi,

I’ve restarted the container engine and the supervisor on the device, and it’s back to trying to update - so we’ll see what happens.

That SD card looks OK, but the device diagnostics still show some write speed issues:

Some localdisk issues detected: 
test_write_latency Slow disk writes detected: mmcblk0: 3061.6ms / write, sample size 1085612
mmcblk0p5: 4840.89ms / write, sample size 2543
mmcblk0p6: 3057.47ms / write, sample size 1083051

which possibly could indicate that the card is beginning to fail.

I also noticed that you have RESIN_HOST_CONFIG_core_freq set, presumably for the PiJuice service? I don’t think this is having an effect, but if memory serves it does affect memory performance. Does your PiJuice service work without this set (or with it overridden)?

Phil

Phil

@phil-d-wilson I’ll order new SD cards, save us some time :sweat_smile:

New SD card plugged in, everything deployed in under 30 minutes without issue.
Guess I’m just stubborn and keep getting surprised at how SD card don’t fail completely but slow down instead. Thanks for the support!