Unexpected supervisor crash/container restart on network change

Hey folks,

I’ve been running one of our devices today and twice so far it seems to have suddenly stopped operating and restarted our nautilus service - the main application, which runs continuously once started. The logs for the device show normal operation and then suddenly:

[Logs]    [10/11/2020, 5:20:57 PM] Supervisor starting
[Logs]    [10/11/2020, 5:20:59 PM] Killing service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:20:59 PM] Service is already stopped, removing container 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:20:59 PM] Killed service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:00 PM] Installing service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:00 PM] Installed service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:00 PM] Starting service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:02 PM] Started service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'

I tried to ssh into the nemo container - the service that issues control commands to nautilus via Python, and monitors nautilus for errors - and got the following:

OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "process_linux.go:83: starting setns process caused \"fork/exec /proc/self/exe: no such file or directory\"": unknown

I can ssh into nautilus just fine. I’m guessing both containers died for some reason and nautilus restarted (above), which is why I can ssh into it, but nemo did not. For what it’s worth, in docker compose nautilus is set to restart: on-failure and nemo is set to restart: always.

Not quite sure how to diagnose this? This is happening on device 40fa62f7a85eb88f5548acf3a13812c0.

Any advice?

Thanks,

Adam

1 Like

Ok a quick update: I looked in journalctl and at the time of the restart and it looks like the supervisor restarted, or at least restarted my service, due to a claimed network settings change:

Oct 11 21:20:39 40fa62f NetworkManager[1024]: <info>  [1602451239.7688] manager: (veth08a6fc0): new Veth device (/org/freedesktop/NetworkManager/Devices/48)
...
Oct 11 21:20:42 40fa62f NetworkManager[1024]: <info>  [1602451242.6981] device (vethd45bf23): carrier: link connected
Oct 11 21:20:42 40fa62f NetworkManager[1024]: <info>  [1602451242.7016] device (veth2400642): carrier: link connected
...
Oct 11 21:20:44 40fa62f avahi-daemon[1018]: Registering new address record for fe80::a466:1dff:fe33:68bd on veth2400642.*.
Oct 11 21:20:54 40fa62f resin-supervisor[14273]: resin_supervisor
Oct 11 21:20:54 40fa62f resin-supervisor[14353]: active
Oct 11 21:20:54 40fa62f resin-supervisor[14354]: Container config has not changed
Oct 11 21:20:57 40fa62f systemd-journald[603]: Suppressed 1345 messages from balena.service
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [info]    Supervisor v11.4.10 starting up...
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [info]    Supervisor v11.4.10 starting up...
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug]   Starting event tracker
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug]   Starting event tracker
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug]   Starting logging infrastructure
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug]   Starting logging infrastructure
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug]   Starting api binder
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug]   Starting api binder
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [event]   Event: Supervisor start {}
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [event]   Event: Supervisor start {}
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [debug]   Performing database cleanup for container log timestamps
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [debug]   Performing database cleanup for container log timestamps
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [info]    Previous engine snapshot was not stored. Skipping cleanup.
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [info]    Previous engine snapshot was not stored. Skipping cleanup.
...
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [debug]   Replacing container for service nautilus because of config changes:
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [debug]   Replacing container for service nautilus because of config changes:
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [debug]     Network changes detected
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [debug]     Network changes detected
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155837>
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event]   Event: Service already stopped {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId">
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event]   Event: Service already stopped {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releas>
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155837>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376>
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event]   Event: Service installed {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":15583>
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event]   Event: Service installed {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event]   Event: Service start {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event]   Event: Service start {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":15583>
Oct 11 21:21:00 40fa62f NetworkManager[1024]: <info>  [1602451260.5810] manager: (vethbb2a6d5): new Veth device (/org/freedesktop/NetworkManager/Devices/54)
Oct 11 21:21:00 40fa62f kernel: br-8d5301e37d5e: port 9(vethf8042b2) entered blocking state
Oct 11 21:21:00 40fa62f kernel[893]: br-8d5301e37d5e: port 9(vethf8042b2) entered blocking state
...
Oct 11 21:21:04 40fa62f resin-supervisor[14354]: [success] Device state apply success
Oct 11 21:21:04 40fa62f 28769d824dbb[13183]: [success] Device state apply success
Oct 11 21:21:08 40fa62f 28769d824dbb[13183]: [info]    Internet Connectivity: OK
Oct 11 21:21:08 40fa62f resin-supervisor[14354]: [info]    Internet Connectivity: OK

Not 100% sure if I’m reading this right, but it looks like there was a quick disconnect/reconnect on the ethernet interface (possible - this one is going through a wifi bridge), and for some reason that triggered a full supervisor restart. Is that correct? Why would a change, or in this case a simple reconnect, on an external network interface cause the supervisor or containers to restart? Our devices need to operate continuously through brief internet outages, and the services can’t shut down or restart for something like this.

I granted support access to the device in case anyone wants to take a look. I’ve also attached the full journalctl output for the above snippet here:

atlas-container-crash.log (19.0 KB)

I also noticed a ton of blob data prints from my containers that seem to be flooding the log output a little. Not really sure what those are?

Also, not sure if it’s relevant or not but just before that it looks like something segfaulted somewhere in the supervisor:

Oct 11 21:20:33 40fa62f balenad[1090]: fatal error: unexpected signal during runtime execution
Oct 11 21:20:33 40fa62f balenad[1090]: [signal SIGSEGV: segmentation violation code=0x2 addr=0xffff9e626150 pc=0xffff9e4b0a94]
Oct 11 21:20:33 40fa62f balenad[1090]: runtime stack:
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.throw(0x17ffed4, 0x2a)
Oct 11 21:20:33 40fa62f balenad[1090]:         /usr/lib/go/src/runtime/panic.go:616 +0x5c
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.sigpanic()
Oct 11 21:20:33 40fa62f balenad[1090]:         /usr/lib/go/src/runtime/signal_unix.go:372 +0x23c
Oct 11 21:20:33 40fa62f balenad[1090]: goroutine 18518 [syscall]:
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.cgocall(0x131d930, 0x4420c50bf8, 0x100)
...
Oct 11 21:20:33 40fa62f balenad[1090]: goroutine 26 [select]:
Oct 11 21:20:33 40fa62f balenad[1090]: github.com/docker/docker/libcontainerd/supervisor.(*remote).monitorDaemon(0x44200db380, 0x1984540, 0x44207e4740)
Oct 11 21:20:33 40fa62f balenad[1090]:         /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:252 +0x624
Oct 11 21:20:33 40fa62f balenad[1090]: created by github.com/docker/docker/libcontainerd/supervisor.Start
...
Oct 11 21:20:33 40fa62f systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 11 21:20:33 40fa62f systemd[1]: balena.service: Failed with result 'exit-code'.
Oct 11 21:20:33 40fa62f resin-supervisor[4196]: time="2020-10-11T21:20:33.139838050Z" level=error msg="error waiting for container: unexpected EOF"
Oct 11 21:20:33 40fa62f systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Oct 11 21:20:33 40fa62f systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.

I attached a text file with the complete stack trace (it’s super long). Not sure how helpful it is but it’s there just in case.

atlas-supervisor-crash.log (149.6 KB)

Thanks,

Adam

1 Like

Hi there – thanks very much for getting in contact with us, and for all the info you’ve provided. After taking a look at the logs, there are a couple questions/observations I have:

  • The crashes in the supervisor and engine are definitely concerning. We should keep in mind the possibility that there may be larger problems on this device.

  • Can you confirm whether this device has eMMC storage? SD cards generally have a shorter lifespan than eMMC, and are often a source of bad behaviour once they begin degrading.

  • We have seen similar reports from other users about a network configuration change being logged immediately before container restart, and we are investigating. We will be checking with the supervisor team to get their input on this.

  • Are you seeing this behaviour on any other devices in your application? Are your other devices in this application running the same OS and supervisor version? I’m trying to determine if this problem seems correlated with one particular version.

  • The blob messages you’re seeing can be expanded by running journalctl with the --all option; you can read more about this in our balenaOS Masterclass.

Thanks again,
Hugh

Hi Hugh,

  • Agreed, it’s possible it’s a device-specific issue. This is a brand new device and does not seem to be exhibiting any other symptoms though.
  • Yes, the device is running off of eMMC, flashed via the 2.50.1 flasher image. It does not have an SD card attached.
  • We have not noticed this issue on other devices, but most of those are customer devices and unfortunately I don’t know that they would have reported the issue to us or just rebooted the device. As for other non-customer devices, we haven’t noticed there either but honestly we haven’t been looking this closely until now.
    • All devices in the application are using the same OS and supervisor version (except one old model that we’re not actively using)

Thanks,

Adam

Hi Adam – thanks for the additional info. We will take this to the team investigating this, and will get back to you as soon as we have more news.

All the best,
Hugh

Hi Adam,

While we are working on adding additional logging to help detect what specifically caused the restart you experienced, I want to mention that the network changes message refers to configuration changes and not changes in the actual network. You may want to add an update lock (https://www.balena.io/docs/learn/deploy/release-strategy/update-locking/) in your service to prevent restarts while we continue to look into this. We are also investigating the crashes you reported and will be in touch with any findings or additional questions as they arise.

Hi Alan,

Thanks for the update. Please let me know if there’s anything I can do to help debug at any point.

I’m not 100% sure I follow what you mean by configuration changes? Are you talking about user network parameters, or is something else going on under the hood? For sure, when this was happening the device was simply sitting on the table so no user (me) configuration changes. If this could be triggered by the wifi -> ethernet bridge it’s connected to blinking out briefly or something, then maybe there’s something to that. In that case it’s definitely something to resolve once we can pinpoint what’s going on.

Adding an update lock is a good idea. We’ll do that.

Thanks,

Adam

From what I’ve gathered of the investigation so far, it is the configuration of the network defined in the docker-compose file as opposed to something in the network the device is attached to. As they zero in it should become clearer and we’ll certainly keep you updated.

Gotcha. Shouldn’t be anything funky in our docker compose network-wise, but here it is for reference:

version: '2.1'
networks: {}
volumes:
  resin-data: {}
services:
  nautilus:
    build: ./nautilus
    # Note: Disabling 'always' restart until Nautilus application handles
    # start/stop behavior internally.
    restart: on-failure
    devices:
      - "/dev/ttymxc1:/dev/ttymxc1"
      - "/dev/ttymxc2:/dev/ttymxc2"
    cap_add:
      - SYS_RAWIO
    volumes:
      - 'resin-data:/data'
    ports:
      - "5500:5500" # Vision ZMQ
      - "30200:30200" # NMEA-0183
      - "30201:30201" # FusionEngine Client
    labels:
      io.balena.features.procfs: '1' # Needed to enable core dumps
  nemo:
    build: ./nemo
    privileged: true
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "8888:8888" # Nemo REST Interface
    labels:
      io.balena.features.supervisor-api: '1'
      io.balena.features.dbus: '1'
      io.balena.features.procfs: '1'
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
  ui-bridge:
    build: ./nautilus
    restart: always
    ports:
      - "3000:3000" # UI Websocket
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ui_websocket_bridge",
              "--ui-multicast-port=30108",
              "--ws-port=3000"]
  ntrip-proxy:
    build: ./nautilus
    restart: always
    ports:
      - "2101:2101" # Proxied NTRIP Output
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ntrip_proxy",
              "--polaris-api-key=${P1_POLARIS_API_KEY}",
              "--polaris-unique-id=${P1_DEVICE_ID:-${BALENA_DEVICE_NAME_AT_INIT}}-ntrip-proxy",
              "0.0.0.0", "2101", "point_one/applications/ntrip_proxy"]
  eos:
    build: ./eos
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "80:80" # EOS UI

There is one thing we have noticed that may or may not be related, and we haven’t had a chance to pinpoint yet. The nemo service (Python) talks to the supervisor to control stuff, query sw version, etc. When the device first powers on, more often than not we get an infinite loop of connection refused errors from the supervisor. Restarting the nemo container fixes it every time. Power cycling the device fixes it most of the time. Waiting and continuing to try never resolves it. For customers, power cycling has been the only solution (since they can’t restart containers).

I was planning to make a ticket for it when I had some time to dig into it further, and also to make a slightly better work-around in our code where the service restarts itself if it detects the condition, but since it may be related I figured I’d mention it.

When the device first powers on, more often than not we get an infinite loop of connection refused errors from the supervisor.

On boot, your application is started separately from the Supervisor, which means it may come up after your app and take a while to start accepting requests, so your application will need to handle these errors gracefully.

Ah, interesting. We were under the impression that the supervisor always starts and stops the containers. We didn’t think that boot would be special and that they could start on their own independently.

In theory our code does have an initial test to try to communicate with the supervisor that should restart the container on failure. The symptom we’ve been seeing appears as though that check passes but queries after that triggered by status requests from the UI are causing it to fail. Need to do a little more digging when I have some time, but I’ll at least make sure to add an async testing loop and some delay between query attempts. That might help.

Hey Adam,
Yes please keep us posted of your findings

Hi guys,

Quick update: we added update locking to our nautilus service and that seems to be working more or less to prevent the supervisor from killing it when it’s active, but we are definitely still seeing the supervisor Network changes detected condition followed by the supervisor restarting a service unnecessarily.

I’ve been doing a lot of testing today trying to handle unexpected supervisor communication failures at power on more gracefully (see https://forums.balena.io/t/supervisor-v1-device-api-missing-commit-if-update-pending/). I seem to have found a way to replicate the reset condition pretty reliably, but on our nemo service this time.

Basically:

  1. Power cycle the device
  2. Restart the nemo service manually (via dashboard)
  3. Power cycle the device again

When the device boots, nemo will start automatically, and then about 20 seconds later the supervisor will restart it. I’m seeing a lot of bursts of 2 or 3 repeats of:

Oct 30 18:24:35 40fa62f 28769d824dbb[1074]: [debug]   Replacing container for service nemo because of config changes:
Oct 30 18:24:35 40fa62f 28769d824dbb[1074]: [debug]     Network changes detected

It seems as though they start printing about 5 seconds before the nemo service has started, or at least before its first console print (not clear from the journal when the Docker container actually started), and continue printing in busts of 2 or 3 occasionally for another ~15 seconds before the supervisor actually performs the service restart. Then it seems to take another ~10 seconds before the container is actually running again.

Throughout this ~25 seconds, the supervisor messages seem to be interleaved with messages about openvpn including at one point:

Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 Closing TUN/TAP interface
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 /sbin/ip addr del dev resin-vpn local 10.240.13.170 peer 52.4.252.97
Oct 30 18:24:41 40fa62f openvpn[1067]: ip: RTNETLINK answers: Operation not permitted
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 Linux ip addr del failed: external program exited with error status: 2
Oct 30 18:24:41 40fa62f openvpn[1067]: Fri Oct 30 18:24:41 2020 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.240.13.170 52.4.252.97 init
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 ROUTE_GATEWAY 192.168.0.1/255.255.255.0 IFACE=eth0 HWADDR=f8:dc:7a:43:67:c4
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Oct 30 18:24:42 40fa62f openvpn[1067]: Fri Oct 30 18:24:42 2020 Exiting due to fatal error
Oct 30 18:24:42 40fa62f systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Oct 30 18:24:42 40fa62f systemd[1]: openvpn.service: Failed with result 'exit-code'.

Not sure if the two are related though.

Any thoughts on what is triggering the restarting behavior? I honestly have no clue why restarting the service once manually before power cycling lets me replicate this, but simply power cycling does not. Power cycling without restarting a service does have a separate issue, which I wrote up here: Duplicate entries in `balena logs` after boot..

Now that it seems like I can replicate this easily, if there’s any journal output or anything you’d like that might be helpful for diagnosing the problem please let me know.

Thanks,

Adam

@dfunckt re the connection refused errors:

We added a small sleep/retry loop to wait for the supervisor to come up. That seems to work after a few seconds the majority of the time. We have seen a couple of times though where the container gets stuck in that state and the connection refused just keeps happening forever. It’s not clear why or how to replicate it yet. Restarting the service is the only way to resolve it. The fact that we can do that though is proof that the supervisor is operating and on the internet at least.

I’ll keep an eye out for it and hopefully can give more information if it happens again.

In the meantime, the restarting issue on boot is still a problem that we need to resolve.

Hey Adam, I’ve figured out a way to trigger that network change detected and all we can do is add better logging to see what about the docker network has changed to trigger this so we can then work on preventing that change from happening. For example, in my setup I have an ha-proxy container and whenever I restarted that container I saw the network change detected message.

I want to confirm something first but do you have a development device we can then put a new version of the Supervisor which will give us better logs on what’s happening ?

Sure, feel free to put it on 40fa62f7a85eb88f5548acf3a13812c0. That’s the device I’m doing all of this crash/restart debugging on. I enabled support access on it.

Great! I’ll put the version on the device at tomorrow at 21:00 UTC so you can keep doing your thing. I’ll let you know if something comes up and I couldn’t build the version in time.

Tomorrow being November 3rd.

Ok awesome, thanks. I’ll keep you posted.

Hey Adam, I forgot that support access is read only so I cannot make any changes to the device. Are you able to add my support account as a developer to your application so I can then put the custom Supervisor build on it. Sorry I messaged this on short notice to my ETA for performing the task. I’ll receive an email when I am invited and then can put the custom version on the test device.