Unexpected supervisor crash/container restart on network change

I don’t have owner access to add developers but could get it done if needed. We don’t have any additional developer licenses on our account though, so not sure how to do that without being charged for an extra seat? Is there a good way to handle that?

Ah, in that case I can try another way. No action is needed from you but just be aware i’ll roll out the version in the next 2 hours and let you know when it’s done. It’s a quick roll out I just have to get some more things.

Ok, no rush. Let me know whenever it’s done and what to look for/do.

Hey Adam, I’ve successfully got the modified Supervisor version on the device and everything seems stable. This is running our latest commit so it hasn’t gone through our release process at all so please keep that in mind if an issue comes up. We’ll revert the device to the previous version when we’re done.

The changes consist of just dumping the network configurations that the device thinks has changed then we can see what has changed. Feel free to perform your steps that you said reliably shows the “Network changes detected” and share the logs! I should be able to reproduce this as well and will see what I can find on my end.

Just tried it. It happened for sure. This time it took ~2 minutes instead of 20 seconds.

Take a look in the journal from the reboot at 23:30:15 down. Nemo starts at 23:30:19, and then is killed by the supervisor at 23:31:25 and a second time at 23:31:37, and finally comes back at 23:32:36. Not sure what’s relevant in the supervisor prints. I don’t see the network changes detected message this time though…

Yup I do see the following logs without a reason for why this container was killed.

Nov 03 23:31:25 40fa62f resin-supervisor[2667]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}
Nov 03 23:31:37 40fa62f resin-supervisor[2667]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}
Nov 03 23:31:37 40fa62f resin-supervisor[2667]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}
Nov 03 23:31:37 40fa62f resin-supervisor[2667]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}

Keep messing around with it and hopefully we can get the Network changes detected to log. This behavior of stopping the service without logging why seems new. I’ll continue this tomorrow (8 pm for me).

Also, I fully understand the frustration of having your services restart/stop/start unexpectedly. A few weeks ago Balena had an internal hackathon and I personally worked on a feature that tells you which, why, and when your containers are killed/started. It needs more work before we can release to production but it’s something we want to add.

Ok, will do. By the way, I managed to trigger another supervisor crash. All I did was restart Nemo once just now (~00:25 UTC) without doing anything since I posted earlier, and then the supervisor crashed around 00:31 with what looks like a watchdog failure. Not sure why.

Nov 04 00:25:03 40fa62f balenad[1075]: time="2020-11-04T00:25:03.366241256Z" level=info msg="shim reaped" id=8df9c0597cc53d1ced79110961913f76bdd25c239e44136c7dfbdfec55a5e765
Nov 04 00:25:03 40fa62f balenad[1075]: time="2020-11-04T00:25:03.372051267Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 04 00:25:04 40fa62f af6bce370f97[1075]: 2020-11-04 00:25:04,903 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
Nov 04 00:31:03 40fa62f systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Nov 04 00:31:03 40fa62f systemd[1]: balena.service: Killing process 1075 (balenad) with signal SIGABRT.
Nov 04 00:31:03 40fa62f balenad[1075]: SIGABRT: abort
Nov 04 00:31:03 40fa62f balenad[1075]: PC=0x45a52c m=0 sigcode=0
Nov 04 00:31:03 40fa62f balenad[1075]: goroutine 0 [idle]:
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.futex(0x23fd528, 0x0, 0x0, 0x0, 0x0, 0x432ef4, 0x1a0, 0x0, 0x0, 0x4331b0, ...)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/sys_linux_arm64.s:321 +0x1c
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.futexsleep(0x23fd528, 0x0, 0xffffffffffffffff)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/os_linux.go:45 +0x3c
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.notesleep(0x23fd528)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/lock_futex.go:151 +0x84
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.stopm()
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:1952 +0xa0
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.findrunnable(0x442004aa00, 0x0)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:2415 +0x50c
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.schedule()
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:2541 +0xe8
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.park_m(0x442083b200)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:2604 +0x90
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.mcall(0x0)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/asm_arm64.s:169 +0x50
Nov 04 00:31:03 40fa62f balenad[1075]: goroutine 1 [chan receive, 6 minutes]:
Nov 04 00:31:03 40fa62f balenad[1075]: github.com/docker/docker/cmd/dockerd.(*DaemonCli).start(0x44207be5a0, 0x44202b0000, 0x0, 0x0)
...
Nov 04 00:31:03 40fa62f balenad[1075]:  /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/attach.go:177 +0xd0on.(*Daemon).ContainerAttach(0x442041c3c0, 0x44201033c7, 0x40, 0x4420a08b40, 0x0, 0x0)
Nov 04 00:31:03 40fa62f balenad[1075]:  /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/attach.go:74 +0x438server/router/container.(*containerRouter).postContainersAttach(0x4420a293c0, 0x1984600, 0x4420a08a80, 0x1982200, 0x4421604700, 0x4420a40a00, 0x4420a089f0, 0x17ae68f, 0x4)work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/api/server/router/container/container_routes.go:568 +0x2d

Might be coincidence but 6 minutes is almost exactly the amount of time between when I hit restart and when the watchdog fired.

Hi Adam,

How did you restart the service? Did you use the UI?

I’m seeing two watchdog errors in the logs preceeded by

Handler for POST /v1.39/containers/create returned error: mount already exists with name, which might tell us something about what is happening.

Would you mind if I try triggering the restart of the service?

Thanks
Felipe

Yeah, I’ve been restarting via the dashboard. Basically: power cycle device (physical, not reboot button), restart nemo in dashboard, power cycle again.

It’s just sitting there right now so feel free to mess with it.

Thank you, I will go ahead and perform some tests.

Hey Adam, it’s me again :stuck_out_tongue: Were you able to get the “network changes detected” log again ?

Nah, I messed with it a tiny bit (the post above about the supervisor crash) but then got stuck resolving something else all night. Whenever @pipex is done messing with it I can give it a go.

He won’t be using the device anymore so you can use it. I’ll try to see if I can find any leads regarding the watchdog error.

Hmm I’ve tried a couple times and haven’t been able to replicate yet. Don’t know what’s different, my trusty multi step approach worked a ton yesterday… I’ll keep trying.

Dang, that’s pretty frustrating. To be honest, my reliable method actually has not produced the network changes detected either…Maybe that was a regression that fixed from other code improvements. We can both keep trying but if it doesn’t happen again we can simplify this thread but just focusing on the watchdog issue. Sound good ?

So the last time I was able to replicate it was just before you updated the supervisor. We’re using the production dart imx8mm host OS image, which has supervisor version 11.4.10. I’m not sure when you guys have the next update planned for that platform? If the issue was fixed in some newer version of the supervisor that’s definitely good (though it would be nice to understand what actually fixed it), but until there’s a new OS release our customers are still gonna hit it.

What I’ve gathered so far is that you restarted nemo and then the Supervisor crashed (restarted). You found the watchdog logs which are telling us the engine timed out and so the watchdog terminated it to then be ran again. When checking some more logs we found the following:

Nov 04 00:28:04 40fa62f balenad[1075]: time="2020-11-04T00:28:04.000026956Z" level=error msg="Handler for POST /v1.39/containers/create returned error: mount already exists with name"
Nov 04 00:28:04 40fa62f balenad[1075]: balena: Error response from daemon: mount already exists with name.

This indicates a bug in the engine when it tried to create something that already existed which probably caused it to crash and thus not respond to the watchdog. The logs however don’t show a restart command from the Supervisor we just see:

Nov 04 00:25:15 40fa62f dd7538ad9f90[1075]: [info]    Starting API server
Nov 04 00:25:15 40fa62f dd7538ad9f90[1075]: [info]    Supervisor API successfully started on port 48484
Nov 04 00:25:15 40fa62f dd7538ad9f90[1075]: [info]    Applying target state
Nov 04 00:25:15 40fa62f dd7538ad9f90[1075]: [debug]   Ensuring device is provisioned
Nov 04 00:25:16 40fa62f dd7538ad9f90[1075]: [debug]   Starting current state report
Nov 04 00:25:16 40fa62f dd7538ad9f90[1075]: [debug]   Starting target state poll
Nov 04 00:25:16 40fa62f dd7538ad9f90[1075]: [debug]   Spawning journald with: chroot  /mnt/root journalctl -a --follow -o json _SYSTEMD_UNIT=balena.service
Nov 04 00:25:16 40fa62f dd7538ad9f90[1075]: [debug]   Finished applying target state
Nov 04 00:25:16 40fa62f dd7538ad9f90[1075]: [success] Device state apply success
Nov 04 00:25:16 40fa62f dd7538ad9f90[1075]: [info]    Applying target state
Nov 04 00:25:17 40fa62f dd7538ad9f90[1075]: [debug]   Finished applying target state
Nov 04 00:25:17 40fa62f dd7538ad9f90[1075]: [success] Device state apply success
Nov 04 00:25:24 40fa62f dd7538ad9f90[1075]: [info]    VPN connection is active.
Nov 04 00:25:24 40fa62f dd7538ad9f90[1075]: [info]    VPN connection is active.
Nov 04 00:25:24 40fa62f dd7538ad9f90[1075]: [info]    VPN connection is active.
Nov 04 00:25:24 40fa62f dd7538ad9f90[1075]: [info]    VPN connection is active.
Nov 04 00:25:25 40fa62f dd7538ad9f90[1075]: [info]    Internet Connectivity: OK
Nov 04 00:28:04 40fa62f balenad[1075]: time="2020-11-04T00:28:04.000026956Z" level=error msg="Handler for POST /v1.39/containers/create returned error: mount already exists with name"
Nov 04 00:28:04 40fa62f balenad[1075]: balena: Error response from daemon: mount already exists with name.
Nov 04 00:28:04 40fa62f balenad[1075]: See 'balena run --help'.
Nov 04 00:30:13 40fa62f dd7538ad9f90[1075]: [api]     GET /v1/healthy 200 - 18.667 ms
Nov 04 00:31:03 40fa62f systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Nov 04 00:31:03 40fa62f systemd[1]: balena.service: Killing process 1075 (balenad) with signal SIGABRT.
Nov 04 00:31:03 40fa62f balenad[1075]: SIGABRT: abort
Nov 04 00:31:03 40fa62f balenad[1075]: PC=0x45a52c m=0 sigcode=0
...

I’ll keep looking but will probably ask for help from the engine team soon.

Yeah, I totally agree that not knowing the cause isn’t a good outcome. I could do some testing to find which version in the v11.4.10 to v12.1.1 range fixed the issue and see what changes were made to further provide a concrete answer.

Hey Adam, I’ve asked our devices team if we should expect a new OS release for your device and if someone from the engine team could help out with the error we saw. I’ll see if I can find out which version in the Supervisor stops the “network changes detected” log to appear.

Ok, much appreciated on both fronts.