Unexpected supervisor crash/container restart on network change

Hi guys,

Quick update: we added update locking to our nautilus service and that seems to be working more or less to prevent the supervisor from killing it when it’s active, but we are definitely still seeing the supervisor Network changes detected condition followed by the supervisor restarting a service unnecessarily.

I’ve been doing a lot of testing today trying to handle unexpected supervisor communication failures at power on more gracefully (see https://forums.balena.io/t/supervisor-v1-device-api-missing-commit-if-update-pending/). I seem to have found a way to replicate the reset condition pretty reliably, but on our nemo service this time.

Basically:

  1. Power cycle the device
  2. Restart the nemo service manually (via dashboard)
  3. Power cycle the device again

When the device boots, nemo will start automatically, and then about 20 seconds later the supervisor will restart it. I’m seeing a lot of bursts of 2 or 3 repeats of:

Oct 30 18:24:35 40fa62f 28769d824dbb[1074]: [debug]   Replacing container for service nemo because of config changes:
Oct 30 18:24:35 40fa62f 28769d824dbb[1074]: [debug]     Network changes detected

It seems as though they start printing about 5 seconds before the nemo service has started, or at least before its first console print (not clear from the journal when the Docker container actually started), and continue printing in busts of 2 or 3 occasionally for another ~15 seconds before the supervisor actually performs the service restart. Then it seems to take another ~10 seconds before the container is actually running again.

Throughout this ~25 seconds, the supervisor messages seem to be interleaved with messages about openvpn including at one point:

Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 Closing TUN/TAP interface
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 /sbin/ip addr del dev resin-vpn local 10.240.13.170 peer 52.4.252.97
Oct 30 18:24:41 40fa62f openvpn[1067]: ip: RTNETLINK answers: Operation not permitted
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 Linux ip addr del failed: external program exited with error status: 2
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.240.13.170 52.4.252.97 init
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 ROUTE_GATEWAY 192.168.0.1/255.255.255.0 IFACE=eth0 HWADDR=f8:dc:7a:43:67:c4
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 Exiting due to fatal error
Oct 30 18:24:42 40fa62f systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Oct 30 18:24:42 40fa62f systemd[1]: openvpn.service: Failed with result 'exit-code'.

Not sure if the two are related though.

Any thoughts on what is triggering the restarting behavior? I honestly have no clue why restarting the service once manually before power cycling lets me replicate this, but simply power cycling does not. Power cycling without restarting a service does have a separate issue, which I wrote up here: Duplicate entries in `balena logs` after boot..

Now that it seems like I can replicate this easily, if there’s any journal output or anything you’d like that might be helpful for diagnosing the problem please let me know.

Thanks,

Adam