Watchdog restarts the device during a release update

claudiu725 · January 14, 2020, 10:42pm

My raspberry pi is stuck in a restart loop, unable to update to the newest release.
Inside the balenaCloud dashboard I can see the update reaching about 7% progress before a reboot happens.
The release which I have on the device right now doesn’t have a correct docker CMD (my mistake), so the supervisor watchdog fails, but I’m unable to update because of the reboots (which happens roughly every 8 minutes).

You could add an option in the dashboard to disable the watchdog reboots – or to not reboot the device while an update is in progress…

Running “journalctl -a” on the device gives the following (truncated):

Jan 14 21:04:06 6f58c4d fcf800c6d1da[830]: [info] Internet Connectivity: OK
Jan 14 21:04:06 6f58c4d resin-supervisor[1360]: [info] Internet Connectivity: OK
Jan 14 21:08:34 6f58c4d fcf800c6d1da[830]: [api] GET /v1/healthy 200 - 32.079 ms
Jan 14 21:08:34 6f58c4d resin-supervisor[1360]: [api] GET /v1/healthy 200 - 32.079 ms
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 830 (balenad) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 837 (exe) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 886 (balena-engine-c) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1532 (balena-engine-c) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1641 (exe) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d balenad[830]: SIGABRT: abort
Jan 14 21:09:34 6f58c4d balenad[830]: PC=0x45a544 m=0 sigcode=0
Jan 14 21:09:34 6f58c4d balenad[830]: goroutine 0 [idle]:
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.futex(0x23fc528, 0x0, 0x0, 0x0, 0x4400000000, 0x4387dc, 0x442005c948, 0x0, 0x0, 0x4338d8, …)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/sys_linux_arm64.s:321 +0x1c
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.futexsleep(0x23fc528, 0x0, 0xffffffffffffffff)
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1673 (balena-healthch) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/os_linux.go:45 +0x3c
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.notesleep(0x23fc528)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/lock_futex.go:151 +0x84
Jan 14 21:09:34 6f58c4d systemd[1]: balena.service: Killing process 1701 (balena) with signal SIGABRT.
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.stoplockedm(SIGABRT: abort)
Jan 14 21:09:34 6f58c4d balenad[830]: PC= 0x45a544/usr/lib/go/src/runtime/proc.go m=:02101 sigcode= +00x60
Jan 14 21:09:34 6f58c4d balenad[830]: goroutine runtime.schedule0( [)
Jan 14 21:09:34 6f58c4d balenad[830]: idle ]:
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/proc.go:2493runtime.futex +0x274(0x23fc528
Jan 14 21:09:34 6f58c4d balenad[830]: , runtime.park_m0x0(, 0x4420093c800x0)
Jan 14 21:09:34 6f58c4d balenad[830]: , 0x0/usr/lib/go/src/runtime/proc.go, :0x44000000002604, +0x2000000000x90,
Jan 14 21:09:34 6f58c4d balenad[830]: 0x1, runtime.mcall0x0(0x0)
Jan 14 21:09:34 6f58c4d balenad[830]: , 0x0 , /usr/lib/go/src/runtime/asm_arm64.s0x4331c8:, …169)
Jan 14 21:09:34 6f58c4d balenad[830]: + 0x50/usr/lib/go/src/runtime/sys_linux_arm64.s
Jan 14 21:09:34 6f58c4d balenad[830]: :321
Jan 14 21:09:34 6f58c4d balenad[830]: +0x1cgoroutine
Jan 14 21:09:34 6f58c4d balenad[830]: 1 [runtime.futexsleepchan receive(, 0x23fc5288 minutes, ]:
Jan 14 21:09:34 6f58c4d balenad[830]: 0x0, 0xffffffffffffffff)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/os_linux.go:45 +0x3c
Jan 14 21:09:34 6f58c4d balenad[830]: runtime.notesleep(0x23fc528)
Jan 14 21:09:34 6f58c4d balenad[830]: /usr/lib/go/src/runtime/lock_futex.go:151 +0x84

And also (more recent log):

Jan 14 22:23:40 6f58c4d resin-supervisor[9815]: time=“2020-01-14T22:23:40.324761888Z” level=error msg=“error waiting for container: unexpected EOF”
Jan 14 22:23:40 6f58c4d systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 14 22:23:40 6f58c4d systemd[1]: balena.service: Failed with result ‘watchdog’.
Jan 14 22:23:40 6f58c4d systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Jan 14 22:23:40 6f58c4d systemd[1]: resin-supervisor.service: Failed with result ‘exit-code’.

Terminal from the balenaCloud dashboard:

14.01.20 22:55:53 (+0200) Supervisor starting
14.01.20 22:56:02 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021’
14.01.20 22:58:12 (+0200) Rebooting
14.01.20 23:03:50 (+0200) Supervisor starting
14.01.20 23:04:03 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021’
14.01.20 23:11:12 (+0200) Supervisor starting
14.01.20 23:11:31 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021’
14.01.20 23:19:01 (+0200) Supervisor starting
14.01.20 23:19:18 (+0200) Downloading image ‘registry2.balena-cloud.com/v2/fc244d3e7c9552ed3184d0b87abcda2f@sha256:3c6977f8f0f4e7967947cdc7bf31052fe4b4525c30e61eed8e272c8c0b204021’

mbalamat · January 15, 2020, 10:32am

Hi @claudiu725,

can you provide some extra info please? I would like to know what is your device’s type(RaspberryPi 3, 4 or other). Which balenaOS version are you running and which Supervisor version are you running (you can find all of these in your balenaCloud dashboard).

Best Regards,
Marios

claudiu725 · January 15, 2020, 11:15pm

Hi @mbalamat

Type
Raspberry Pi 3 (using 64bit OS) (BETA)

Host OS Version
balenaOS 2.46.1+rev1

Supervisor Version
10.6.27

dfunckt · January 16, 2020, 3:41pm

What you can do is SSH into the host, find the offending container and its image, forcefully kill the container and remove the image and let the Supervisor pull from scratch. You may have to fight a bit with the Supervisor trying to do its thing so try to be quick. Alternatively, a less offensive strategy which might work is to use the “kill-then-download” update strategy.

claudiu725 · January 16, 2020, 5:02pm

Thanks!

edorgeville · July 7, 2020, 10:16pm

I’m having the same issue while trying to add a device to a new 9 services app.
While downloading, the available memory goes from 400M to about 16M (watch -n 1 free), then the supervisor crashes and all downloads are aborted.
After 2-3 minutes, the supervisor boots back up as memory is freed. All downloads start again from scratch. This loop has been repeating for days now.

Looking at the logs using ssh root@ip -p 22222 "journalctl -n 100000 --no-pager" > logs.txt, the first line that shows up when the crash happens is:

Jul 07 20:28:44 f8d1673 systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jul 07 20:28:44 f8d1673 systemd[1]: balena.service: Killing process 4788 (balenad) with signal SIGABRT.
Jul 07 20:28:44 f8d1673 balenad[4788]: SIGABRT: abort
Jul 07 20:28:44 f8d1673 balenad[4788]: PC=0x7275c m=0 sigcode=0
Jul 07 20:28:44 f8d1673 balenad[4788]: goroutine 0 [idle]:

Then, thousands and thousands of lines from the stack trace of the balena daemon getting aborted.

I don’t have anything running on the device, as it’s a fresh install. It seems the download process just eats up more RAM than available.
I did set the RESIN_SUPERVISOR_UPDATE_STRATEGY device variable to kill-then-download, although as there is no container running other than the supervisor, I don’t think this does much.

Is there something I can do in order to get this working?

ab77 · July 7, 2020, 10:24pm

Hi there, it’s possible that the Linux kernel is killing the supervisor process due to out-of-memory condition (SIGABRT holds a clue, since malloc() will call it when there is a problem). there should be something to this effect also in the dmesg output. If so, are you able to try join this device to an app which has fewer services/less memory requirements?

edorgeville · July 7, 2020, 11:00pm

Yes, it (a Raspberry Pi 3, should have mentioned earlier) can join other apps with less services. To me, it looks like the supervisor eats up a lot of memory for each download. Is there an option to do each download sequentially instead of parallel?

gelbal · July 8, 2020, 10:06am

Is there an option to do each download sequentially instead of parallel?

Hi @edorgeville, we got this feature request earlier as well. I will add your +1 in our internal tracking system. In the newer balenaOS the default value for maximum concurrent downloads is 3. This is not user configurable at the moment. But I personally agree that it makes sense to expose this parameter as a configuration variable.

Perhaps a faster SD card might help you in this case. Another possible workaround is removing most of the services from the docker-compose.yml and then pushing with only few services declared. As the device downloads services, you keep adding a couple of more services to docker-compose.yml and push again. Essentially you manually instruct the device supervisor to download in smaller batches.

Finally, if you are up for it, it’d great if you could do memory profile on this device and share the results with us. I agree with your reasoning. Then it’d be good to confirm with quantitative feedback.

Cheers…

edorgeville · July 8, 2020, 5:18pm

What would be the best way to do a memory profile?

kaisoz · July 9, 2020, 9:52am

Hi!

Memory profiling methods highly depend on the language used to write the service. For example, the tools you use to profile a Node application are different from the tools you use to profile a Java application. Which languages are being used in your services?

edorgeville · July 13, 2020, 10:10pm

@kaisoz Most of the services are based on balenalib/raspberrypi3-python:3.7-run, some on balenalib/raspberrypi3-node. I doubt this does anything to crash the download though as there isn’t even a single service up yet.
If someone from the team can have a look, I made a fresh install today and granted access to the device:
https://dashboard.balena-cloud.com/devices/c9e178d56d70edbc2c4097ffaf188f7d/summary
Feel free to poke around!

jimsynz · July 13, 2020, 10:29pm

Hi @claudiu725, welcome to the balena forums.

I ran the device health checks and while not conclusive it does give us some hints. I’d like to restart the container engine and the supervisor if that’s okay with you? What brand and speed of SD card are you using? It seems like it might be a bit slow, which could be the source of the issue.

Cheers,
James.

edorgeville · July 14, 2020, 2:57pm

@jimsynz (I think you tagged the original poster instead of me)
Feel free to restart and anything else you want!
The SD card is this one: https://www.amazon.ca/Samsung-Class-Micro-Adapter-MB-MC32DA/dp/B00WR4IJBE

phil-d-wilson · July 14, 2020, 3:47pm

Hi,

I’ve restarted the container engine and the supervisor on the device, and it’s back to trying to update - so we’ll see what happens.

That SD card looks OK, but the device diagnostics still show some write speed issues:

Some localdisk issues detected: 
test_write_latency Slow disk writes detected: mmcblk0: 3061.6ms / write, sample size 1085612
mmcblk0p5: 4840.89ms / write, sample size 2543
mmcblk0p6: 3057.47ms / write, sample size 1083051

which possibly could indicate that the card is beginning to fail.

I also noticed that you have RESIN_HOST_CONFIG_core_freq set, presumably for the PiJuice service? I don’t think this is having an effect, but if memory serves it does affect memory performance. Does your PiJuice service work without this set (or with it overridden)?

Phil

edorgeville · July 14, 2020, 8:44pm

@phil-d-wilson I’ll order new SD cards, save us some time

edorgeville · July 15, 2020, 8:41pm

New SD card plugged in, everything deployed in under 30 minutes without issue.
Guess I’m just stubborn and keep getting surprised at how SD card don’t fail completely but slow down instead. Thanks for the support!

Topic		Replies	Views
Device stuck in Updating loop Product support	16	1782	July 4, 2019
Slow download speed, causes restart of download Product support	13	1549	February 7, 2020
Services are in a constant restart loop! balenaOS support , balenafin	51	4020	January 15, 2021
"Update successful, rebooting" persists through reboots, Balena Application Container Engine fails Product support raspberrypi3 , docker	4	947	September 16, 2021
Supervisor restart loop after OS update balenaOS raspberrypi3	8	421	August 4, 2023

Watchdog restarts the device during a release update

Related topics