Unexpected supervisor crash/container restart on network change

From what I’ve gathered of the investigation so far, it is the configuration of the network defined in the docker-compose file as opposed to something in the network the device is attached to. As they zero in it should become clearer and we’ll certainly keep you updated.

Gotcha. Shouldn’t be anything funky in our docker compose network-wise, but here it is for reference:

version: '2.1'
networks: {}
volumes:
  resin-data: {}
services:
  nautilus:
    build: ./nautilus
    # Note: Disabling 'always' restart until Nautilus application handles
    # start/stop behavior internally.
    restart: on-failure
    devices:
      - "/dev/ttymxc1:/dev/ttymxc1"
      - "/dev/ttymxc2:/dev/ttymxc2"
    cap_add:
      - SYS_RAWIO
    volumes:
      - 'resin-data:/data'
    ports:
      - "5500:5500" # Vision ZMQ
      - "30200:30200" # NMEA-0183
      - "30201:30201" # FusionEngine Client
    labels:
      io.balena.features.procfs: '1' # Needed to enable core dumps
  nemo:
    build: ./nemo
    privileged: true
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "8888:8888" # Nemo REST Interface
    labels:
      io.balena.features.supervisor-api: '1'
      io.balena.features.dbus: '1'
      io.balena.features.procfs: '1'
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
  ui-bridge:
    build: ./nautilus
    restart: always
    ports:
      - "3000:3000" # UI Websocket
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ui_websocket_bridge",
              "--ui-multicast-port=30108",
              "--ws-port=3000"]
  ntrip-proxy:
    build: ./nautilus
    restart: always
    ports:
      - "2101:2101" # Proxied NTRIP Output
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ntrip_proxy",
              "--polaris-api-key=${P1_POLARIS_API_KEY}",
              "--polaris-unique-id=${P1_DEVICE_ID:-${BALENA_DEVICE_NAME_AT_INIT}}-ntrip-proxy",
              "0.0.0.0", "2101", "point_one/applications/ntrip_proxy"]
  eos:
    build: ./eos
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "80:80" # EOS UI

There is one thing we have noticed that may or may not be related, and we haven’t had a chance to pinpoint yet. The nemo service (Python) talks to the supervisor to control stuff, query sw version, etc. When the device first powers on, more often than not we get an infinite loop of connection refused errors from the supervisor. Restarting the nemo container fixes it every time. Power cycling the device fixes it most of the time. Waiting and continuing to try never resolves it. For customers, power cycling has been the only solution (since they can’t restart containers).

I was planning to make a ticket for it when I had some time to dig into it further, and also to make a slightly better work-around in our code where the service restarts itself if it detects the condition, but since it may be related I figured I’d mention it.

When the device first powers on, more often than not we get an infinite loop of connection refused errors from the supervisor.

On boot, your application is started separately from the Supervisor, which means it may come up after your app and take a while to start accepting requests, so your application will need to handle these errors gracefully.

Ah, interesting. We were under the impression that the supervisor always starts and stops the containers. We didn’t think that boot would be special and that they could start on their own independently.

In theory our code does have an initial test to try to communicate with the supervisor that should restart the container on failure. The symptom we’ve been seeing appears as though that check passes but queries after that triggered by status requests from the UI are causing it to fail. Need to do a little more digging when I have some time, but I’ll at least make sure to add an async testing loop and some delay between query attempts. That might help.

Hey Adam,
Yes please keep us posted of your findings

Hi guys,

Quick update: we added update locking to our nautilus service and that seems to be working more or less to prevent the supervisor from killing it when it’s active, but we are definitely still seeing the supervisor Network changes detected condition followed by the supervisor restarting a service unnecessarily.

I’ve been doing a lot of testing today trying to handle unexpected supervisor communication failures at power on more gracefully (see https://forums.balena.io/t/supervisor-v1-device-api-missing-commit-if-update-pending/). I seem to have found a way to replicate the reset condition pretty reliably, but on our nemo service this time.

Basically:

  1. Power cycle the device
  2. Restart the nemo service manually (via dashboard)
  3. Power cycle the device again

When the device boots, nemo will start automatically, and then about 20 seconds later the supervisor will restart it. I’m seeing a lot of bursts of 2 or 3 repeats of:

Oct 30 18:24:35 40fa62f 28769d824dbb[1074]: [debug]   Replacing container for service nemo because of config changes:
Oct 30 18:24:35 40fa62f 28769d824dbb[1074]: [debug]     Network changes detected

It seems as though they start printing about 5 seconds before the nemo service has started, or at least before its first console print (not clear from the journal when the Docker container actually started), and continue printing in busts of 2 or 3 occasionally for another ~15 seconds before the supervisor actually performs the service restart. Then it seems to take another ~10 seconds before the container is actually running again.

Throughout this ~25 seconds, the supervisor messages seem to be interleaved with messages about openvpn including at one point:

Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 Closing TUN/TAP interface
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 /sbin/ip addr del dev resin-vpn local 10.240.13.170 peer 52.4.252.97
Oct 30 18:24:41 40fa62f openvpn[1067]: ip: RTNETLINK answers: Operation not permitted
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 Linux ip addr del failed: external program exited with error status: 2
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.240.13.170 52.4.252.97 init
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 ROUTE_GATEWAY 192.168.0.1/255.255.255.0 IFACE=eth0 HWADDR=f8:dc:7a:43:67:c4
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 Exiting due to fatal error
Oct 30 18:24:42 40fa62f systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Oct 30 18:24:42 40fa62f systemd[1]: openvpn.service: Failed with result 'exit-code'.

Not sure if the two are related though.

Any thoughts on what is triggering the restarting behavior? I honestly have no clue why restarting the service once manually before power cycling lets me replicate this, but simply power cycling does not. Power cycling without restarting a service does have a separate issue, which I wrote up here: Duplicate entries in `balena logs` after boot..

Now that it seems like I can replicate this easily, if there’s any journal output or anything you’d like that might be helpful for diagnosing the problem please let me know.

Thanks,

Adam

@dfunckt re the connection refused errors:

We added a small sleep/retry loop to wait for the supervisor to come up. That seems to work after a few seconds the majority of the time. We have seen a couple of times though where the container gets stuck in that state and the connection refused just keeps happening forever. It’s not clear why or how to replicate it yet. Restarting the service is the only way to resolve it. The fact that we can do that though is proof that the supervisor is operating and on the internet at least.

I’ll keep an eye out for it and hopefully can give more information if it happens again.

In the meantime, the restarting issue on boot is still a problem that we need to resolve.

Hey Adam, I’ve figured out a way to trigger that network change detected and all we can do is add better logging to see what about the docker network has changed to trigger this so we can then work on preventing that change from happening. For example, in my setup I have an ha-proxy container and whenever I restarted that container I saw the network change detected message.

I want to confirm something first but do you have a development device we can then put a new version of the Supervisor which will give us better logs on what’s happening ?

Sure, feel free to put it on 40fa62f7a85eb88f5548acf3a13812c0. That’s the device I’m doing all of this crash/restart debugging on. I enabled support access on it.

Great! I’ll put the version on the device at tomorrow at 21:00 UTC so you can keep doing your thing. I’ll let you know if something comes up and I couldn’t build the version in time.

Tomorrow being November 3rd.

Ok awesome, thanks. I’ll keep you posted.

Hey Adam, I forgot that support access is read only so I cannot make any changes to the device. Are you able to add my support account as a developer to your application so I can then put the custom Supervisor build on it. Sorry I messaged this on short notice to my ETA for performing the task. I’ll receive an email when I am invited and then can put the custom version on the test device.

I don’t have owner access to add developers but could get it done if needed. We don’t have any additional developer licenses on our account though, so not sure how to do that without being charged for an extra seat? Is there a good way to handle that?

Ah, in that case I can try another way. No action is needed from you but just be aware i’ll roll out the version in the next 2 hours and let you know when it’s done. It’s a quick roll out I just have to get some more things.

Ok, no rush. Let me know whenever it’s done and what to look for/do.

Hey Adam, I’ve successfully got the modified Supervisor version on the device and everything seems stable. This is running our latest commit so it hasn’t gone through our release process at all so please keep that in mind if an issue comes up. We’ll revert the device to the previous version when we’re done.

The changes consist of just dumping the network configurations that the device thinks has changed then we can see what has changed. Feel free to perform your steps that you said reliably shows the “Network changes detected” and share the logs! I should be able to reproduce this as well and will see what I can find on my end.

Just tried it. It happened for sure. This time it took ~2 minutes instead of 20 seconds.

Take a look in the journal from the reboot at 23:30:15 down. Nemo starts at 23:30:19, and then is killed by the supervisor at 23:31:25 and a second time at 23:31:37, and finally comes back at 23:32:36. Not sure what’s relevant in the supervisor prints. I don’t see the network changes detected message this time though…

Yup I do see the following logs without a reason for why this container was killed.

Nov 03 23:31:25 40fa62f resin-supervisor[2667]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}
Nov 03 23:31:37 40fa62f resin-supervisor[2667]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}
Nov 03 23:31:37 40fa62f resin-supervisor[2667]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}
Nov 03 23:31:37 40fa62f resin-supervisor[2667]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494586,"serviceName":"nemo","releaseId":1582887}}

Keep messing around with it and hopefully we can get the Network changes detected to log. This behavior of stopping the service without logging why seems new. I’ll continue this tomorrow (8 pm for me).

Also, I fully understand the frustration of having your services restart/stop/start unexpectedly. A few weeks ago Balena had an internal hackathon and I personally worked on a feature that tells you which, why, and when your containers are killed/started. It needs more work before we can release to production but it’s something we want to add.

Ok, will do. By the way, I managed to trigger another supervisor crash. All I did was restart Nemo once just now (~00:25 UTC) without doing anything since I posted earlier, and then the supervisor crashed around 00:31 with what looks like a watchdog failure. Not sure why.

Nov 04 00:25:03 40fa62f balenad[1075]: time="2020-11-04T00:25:03.366241256Z" level=info msg="shim reaped" id=8df9c0597cc53d1ced79110961913f76bdd25c239e44136c7dfbdfec55a5e765
Nov 04 00:25:03 40fa62f balenad[1075]: time="2020-11-04T00:25:03.372051267Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 04 00:25:04 40fa62f af6bce370f97[1075]: 2020-11-04 00:25:04,903 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
Nov 04 00:31:03 40fa62f systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Nov 04 00:31:03 40fa62f systemd[1]: balena.service: Killing process 1075 (balenad) with signal SIGABRT.
Nov 04 00:31:03 40fa62f balenad[1075]: SIGABRT: abort
Nov 04 00:31:03 40fa62f balenad[1075]: PC=0x45a52c m=0 sigcode=0
Nov 04 00:31:03 40fa62f balenad[1075]: goroutine 0 [idle]:
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.futex(0x23fd528, 0x0, 0x0, 0x0, 0x0, 0x432ef4, 0x1a0, 0x0, 0x0, 0x4331b0, ...)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/sys_linux_arm64.s:321 +0x1c
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.futexsleep(0x23fd528, 0x0, 0xffffffffffffffff)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/os_linux.go:45 +0x3c
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.notesleep(0x23fd528)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/lock_futex.go:151 +0x84
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.stopm()
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:1952 +0xa0
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.findrunnable(0x442004aa00, 0x0)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:2415 +0x50c
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.schedule()
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:2541 +0xe8
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.park_m(0x442083b200)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/proc.go:2604 +0x90
Nov 04 00:31:03 40fa62f balenad[1075]: runtime.mcall(0x0)
Nov 04 00:31:03 40fa62f balenad[1075]:  /usr/lib/go/src/runtime/asm_arm64.s:169 +0x50
Nov 04 00:31:03 40fa62f balenad[1075]: goroutine 1 [chan receive, 6 minutes]:
Nov 04 00:31:03 40fa62f balenad[1075]: github.com/docker/docker/cmd/dockerd.(*DaemonCli).start(0x44207be5a0, 0x44202b0000, 0x0, 0x0)
...
Nov 04 00:31:03 40fa62f balenad[1075]:  /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/attach.go:177 +0xd0on.(*Daemon).ContainerAttach(0x442041c3c0, 0x44201033c7, 0x40, 0x4420a08b40, 0x0, 0x0)
Nov 04 00:31:03 40fa62f balenad[1075]:  /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/daemon/attach.go:74 +0x438server/router/container.(*containerRouter).postContainersAttach(0x4420a293c0, 0x1984600, 0x4420a08a80, 0x1982200, 0x4421604700, 0x4420a40a00, 0x4420a089f0, 0x17ae68f, 0x4)work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/api/server/router/container/container_routes.go:568 +0x2d

Might be coincidence but 6 minutes is almost exactly the amount of time between when I hit restart and when the watchdog fired.