Ok a quick update: I looked in journalctl and at the time of the restart and it looks like the supervisor restarted, or at least restarted my service, due to a claimed network settings change:
Oct 11 21:20:39 40fa62f NetworkManager[1024]: <info> [1602451239.7688] manager: (veth08a6fc0): new Veth device (/org/freedesktop/NetworkManager/Devices/48)
...
Oct 11 21:20:42 40fa62f NetworkManager[1024]: <info> [1602451242.6981] device (vethd45bf23): carrier: link connected
Oct 11 21:20:42 40fa62f NetworkManager[1024]: <info> [1602451242.7016] device (veth2400642): carrier: link connected
...
Oct 11 21:20:44 40fa62f avahi-daemon[1018]: Registering new address record for fe80::a466:1dff:fe33:68bd on veth2400642.*.
Oct 11 21:20:54 40fa62f resin-supervisor[14273]: resin_supervisor
Oct 11 21:20:54 40fa62f resin-supervisor[14353]: active
Oct 11 21:20:54 40fa62f resin-supervisor[14354]: Container config has not changed
Oct 11 21:20:57 40fa62f systemd-journald[603]: Suppressed 1345 messages from balena.service
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [info] Supervisor v11.4.10 starting up...
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [info] Supervisor v11.4.10 starting up...
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug] Starting event tracker
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug] Starting event tracker
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug] Starting logging infrastructure
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug] Starting logging infrastructure
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug] Starting api binder
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug] Starting api binder
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [event] Event: Supervisor start {}
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [event] Event: Supervisor start {}
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [debug] Performing database cleanup for container log timestamps
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [debug] Performing database cleanup for container log timestamps
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [info] Previous engine snapshot was not stored. Skipping cleanup.
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [info] Previous engine snapshot was not stored. Skipping cleanup.
...
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [debug] Replacing container for service nautilus because of config changes:
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [debug] Replacing container for service nautilus because of config changes:
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [debug] Network changes detected
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [debug] Network changes detected
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event] Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event] Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155837>
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event] Event: Service already stopped {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId">
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event] Event: Service already stopped {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releas>
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event] Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event] Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155837>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event] Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376>
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event] Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event] Event: Service installed {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":15583>
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event] Event: Service installed {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event] Event: Service start {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event] Event: Service start {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":15583>
Oct 11 21:21:00 40fa62f NetworkManager[1024]: <info> [1602451260.5810] manager: (vethbb2a6d5): new Veth device (/org/freedesktop/NetworkManager/Devices/54)
Oct 11 21:21:00 40fa62f kernel: br-8d5301e37d5e: port 9(vethf8042b2) entered blocking state
Oct 11 21:21:00 40fa62f kernel[893]: br-8d5301e37d5e: port 9(vethf8042b2) entered blocking state
...
Oct 11 21:21:04 40fa62f resin-supervisor[14354]: [success] Device state apply success
Oct 11 21:21:04 40fa62f 28769d824dbb[13183]: [success] Device state apply success
Oct 11 21:21:08 40fa62f 28769d824dbb[13183]: [info] Internet Connectivity: OK
Oct 11 21:21:08 40fa62f resin-supervisor[14354]: [info] Internet Connectivity: OK
Not 100% sure if I’m reading this right, but it looks like there was a quick disconnect/reconnect on the ethernet interface (possible - this one is going through a wifi bridge), and for some reason that triggered a full supervisor restart. Is that correct? Why would a change, or in this case a simple reconnect, on an external network interface cause the supervisor or containers to restart? Our devices need to operate continuously through brief internet outages, and the services can’t shut down or restart for something like this.
I granted support access to the device in case anyone wants to take a look. I’ve also attached the full journalctl output for the above snippet here:
atlas-container-crash.log (19.0 KB)
I also noticed a ton of blob data
prints from my containers that seem to be flooding the log output a little. Not really sure what those are?
Also, not sure if it’s relevant or not but just before that it looks like something segfaulted somewhere in the supervisor:
Oct 11 21:20:33 40fa62f balenad[1090]: fatal error: unexpected signal during runtime execution
Oct 11 21:20:33 40fa62f balenad[1090]: [signal SIGSEGV: segmentation violation code=0x2 addr=0xffff9e626150 pc=0xffff9e4b0a94]
Oct 11 21:20:33 40fa62f balenad[1090]: runtime stack:
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.throw(0x17ffed4, 0x2a)
Oct 11 21:20:33 40fa62f balenad[1090]: /usr/lib/go/src/runtime/panic.go:616 +0x5c
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.sigpanic()
Oct 11 21:20:33 40fa62f balenad[1090]: /usr/lib/go/src/runtime/signal_unix.go:372 +0x23c
Oct 11 21:20:33 40fa62f balenad[1090]: goroutine 18518 [syscall]:
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.cgocall(0x131d930, 0x4420c50bf8, 0x100)
...
Oct 11 21:20:33 40fa62f balenad[1090]: goroutine 26 [select]:
Oct 11 21:20:33 40fa62f balenad[1090]: github.com/docker/docker/libcontainerd/supervisor.(*remote).monitorDaemon(0x44200db380, 0x1984540, 0x44207e4740)
Oct 11 21:20:33 40fa62f balenad[1090]: /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:252 +0x624
Oct 11 21:20:33 40fa62f balenad[1090]: created by github.com/docker/docker/libcontainerd/supervisor.Start
...
Oct 11 21:20:33 40fa62f systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 11 21:20:33 40fa62f systemd[1]: balena.service: Failed with result 'exit-code'.
Oct 11 21:20:33 40fa62f resin-supervisor[4196]: time="2020-10-11T21:20:33.139838050Z" level=error msg="error waiting for container: unexpected EOF"
Oct 11 21:20:33 40fa62f systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Oct 11 21:20:33 40fa62f systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.
I attached a text file with the complete stack trace (it’s super long). Not sure how helpful it is but it’s there just in case.
atlas-supervisor-crash.log (149.6 KB)
Thanks,
Adam