Unexpected supervisor crash/container restart on network change

Hey folks,

I’ve been running one of our devices today and twice so far it seems to have suddenly stopped operating and restarted our nautilus service - the main application, which runs continuously once started. The logs for the device show normal operation and then suddenly:

[Logs]    [10/11/2020, 5:20:57 PM] Supervisor starting
[Logs]    [10/11/2020, 5:20:59 PM] Killing service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:20:59 PM] Service is already stopped, removing container 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:20:59 PM] Killed service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:00 PM] Installing service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:00 PM] Installed service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:00 PM] Starting service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'
[Logs]    [10/11/2020, 5:21:02 PM] Started service 'nautilus sha256:9356c1596f0f21f6788009f2ab0dc2b2412a999bb8ff54b9ab9eede216b3bc57'

I tried to ssh into the nemo container - the service that issues control commands to nautilus via Python, and monitors nautilus for errors - and got the following:

OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "process_linux.go:83: starting setns process caused \"fork/exec /proc/self/exe: no such file or directory\"": unknown

I can ssh into nautilus just fine. I’m guessing both containers died for some reason and nautilus restarted (above), which is why I can ssh into it, but nemo did not. For what it’s worth, in docker compose nautilus is set to restart: on-failure and nemo is set to restart: always.

Not quite sure how to diagnose this? This is happening on device 40fa62f7a85eb88f5548acf3a13812c0.

Any advice?

Thanks,

Adam

1 Like

Ok a quick update: I looked in journalctl and at the time of the restart and it looks like the supervisor restarted, or at least restarted my service, due to a claimed network settings change:

Oct 11 21:20:39 40fa62f NetworkManager[1024]: <info>  [1602451239.7688] manager: (veth08a6fc0): new Veth device (/org/freedesktop/NetworkManager/Devices/48)
...
Oct 11 21:20:42 40fa62f NetworkManager[1024]: <info>  [1602451242.6981] device (vethd45bf23): carrier: link connected
Oct 11 21:20:42 40fa62f NetworkManager[1024]: <info>  [1602451242.7016] device (veth2400642): carrier: link connected
...
Oct 11 21:20:44 40fa62f avahi-daemon[1018]: Registering new address record for fe80::a466:1dff:fe33:68bd on veth2400642.*.
Oct 11 21:20:54 40fa62f resin-supervisor[14273]: resin_supervisor
Oct 11 21:20:54 40fa62f resin-supervisor[14353]: active
Oct 11 21:20:54 40fa62f resin-supervisor[14354]: Container config has not changed
Oct 11 21:20:57 40fa62f systemd-journald[603]: Suppressed 1345 messages from balena.service
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [info]    Supervisor v11.4.10 starting up...
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [info]    Supervisor v11.4.10 starting up...
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug]   Starting event tracker
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug]   Starting event tracker
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug]   Starting logging infrastructure
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug]   Starting logging infrastructure
Oct 11 21:20:57 40fa62f 28769d824dbb[13183]: [debug]   Starting api binder
Oct 11 21:20:57 40fa62f resin-supervisor[14354]: [debug]   Starting api binder
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [event]   Event: Supervisor start {}
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [event]   Event: Supervisor start {}
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [debug]   Performing database cleanup for container log timestamps
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [debug]   Performing database cleanup for container log timestamps
Oct 11 21:20:58 40fa62f resin-supervisor[14354]: [info]    Previous engine snapshot was not stored. Skipping cleanup.
Oct 11 21:20:58 40fa62f 28769d824dbb[13183]: [info]    Previous engine snapshot was not stored. Skipping cleanup.
...
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [debug]   Replacing container for service nautilus because of config changes:
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [debug]   Replacing container for service nautilus because of config changes:
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [debug]     Network changes detected
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [debug]     Network changes detected
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155837>
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event]   Event: Service already stopped {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId">
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event]   Event: Service already stopped {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releas>
Oct 11 21:20:59 40fa62f 28769d824dbb[13183]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:20:59 40fa62f resin-supervisor[14354]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155837>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376>
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":155>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event]   Event: Service installed {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":15583>
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event]   Event: Service installed {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1>
Oct 11 21:21:00 40fa62f 28769d824dbb[13183]: [event]   Event: Service start {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1558376}}
Oct 11 21:21:00 40fa62f resin-supervisor[14354]: [event]   Event: Service start {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":15583>
Oct 11 21:21:00 40fa62f NetworkManager[1024]: <info>  [1602451260.5810] manager: (vethbb2a6d5): new Veth device (/org/freedesktop/NetworkManager/Devices/54)
Oct 11 21:21:00 40fa62f kernel: br-8d5301e37d5e: port 9(vethf8042b2) entered blocking state
Oct 11 21:21:00 40fa62f kernel[893]: br-8d5301e37d5e: port 9(vethf8042b2) entered blocking state
...
Oct 11 21:21:04 40fa62f resin-supervisor[14354]: [success] Device state apply success
Oct 11 21:21:04 40fa62f 28769d824dbb[13183]: [success] Device state apply success
Oct 11 21:21:08 40fa62f 28769d824dbb[13183]: [info]    Internet Connectivity: OK
Oct 11 21:21:08 40fa62f resin-supervisor[14354]: [info]    Internet Connectivity: OK

Not 100% sure if I’m reading this right, but it looks like there was a quick disconnect/reconnect on the ethernet interface (possible - this one is going through a wifi bridge), and for some reason that triggered a full supervisor restart. Is that correct? Why would a change, or in this case a simple reconnect, on an external network interface cause the supervisor or containers to restart? Our devices need to operate continuously through brief internet outages, and the services can’t shut down or restart for something like this.

I granted support access to the device in case anyone wants to take a look. I’ve also attached the full journalctl output for the above snippet here:

atlas-container-crash.log (19.0 KB)

I also noticed a ton of blob data prints from my containers that seem to be flooding the log output a little. Not really sure what those are?

Also, not sure if it’s relevant or not but just before that it looks like something segfaulted somewhere in the supervisor:

Oct 11 21:20:33 40fa62f balenad[1090]: fatal error: unexpected signal during runtime execution
Oct 11 21:20:33 40fa62f balenad[1090]: [signal SIGSEGV: segmentation violation code=0x2 addr=0xffff9e626150 pc=0xffff9e4b0a94]
Oct 11 21:20:33 40fa62f balenad[1090]: runtime stack:
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.throw(0x17ffed4, 0x2a)
Oct 11 21:20:33 40fa62f balenad[1090]:         /usr/lib/go/src/runtime/panic.go:616 +0x5c
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.sigpanic()
Oct 11 21:20:33 40fa62f balenad[1090]:         /usr/lib/go/src/runtime/signal_unix.go:372 +0x23c
Oct 11 21:20:33 40fa62f balenad[1090]: goroutine 18518 [syscall]:
Oct 11 21:20:33 40fa62f balenad[1090]: runtime.cgocall(0x131d930, 0x4420c50bf8, 0x100)
...
Oct 11 21:20:33 40fa62f balenad[1090]: goroutine 26 [select]:
Oct 11 21:20:33 40fa62f balenad[1090]: github.com/docker/docker/libcontainerd/supervisor.(*remote).monitorDaemon(0x44200db380, 0x1984540, 0x44207e4740)
Oct 11 21:20:33 40fa62f balenad[1090]:         /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.17-dev+git2ab17e0536b6a4528b33c75e8f350447e9882af0-r0/git/src/import/.gopath/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:252 +0x624
Oct 11 21:20:33 40fa62f balenad[1090]: created by github.com/docker/docker/libcontainerd/supervisor.Start
...
Oct 11 21:20:33 40fa62f systemd[1]: balena.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 11 21:20:33 40fa62f systemd[1]: balena.service: Failed with result 'exit-code'.
Oct 11 21:20:33 40fa62f resin-supervisor[4196]: time="2020-10-11T21:20:33.139838050Z" level=error msg="error waiting for container: unexpected EOF"
Oct 11 21:20:33 40fa62f systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=125/n/a
Oct 11 21:20:33 40fa62f systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.

I attached a text file with the complete stack trace (it’s super long). Not sure how helpful it is but it’s there just in case.

atlas-supervisor-crash.log (149.6 KB)

Thanks,

Adam

1 Like

Hi there – thanks very much for getting in contact with us, and for all the info you’ve provided. After taking a look at the logs, there are a couple questions/observations I have:

  • The crashes in the supervisor and engine are definitely concerning. We should keep in mind the possibility that there may be larger problems on this device.

  • Can you confirm whether this device has eMMC storage? SD cards generally have a shorter lifespan than eMMC, and are often a source of bad behaviour once they begin degrading.

  • We have seen similar reports from other users about a network configuration change being logged immediately before container restart, and we are investigating. We will be checking with the supervisor team to get their input on this.

  • Are you seeing this behaviour on any other devices in your application? Are your other devices in this application running the same OS and supervisor version? I’m trying to determine if this problem seems correlated with one particular version.

  • The blob messages you’re seeing can be expanded by running journalctl with the --all option; you can read more about this in our balenaOS Masterclass.

Thanks again,
Hugh

Hi Hugh,

  • Agreed, it’s possible it’s a device-specific issue. This is a brand new device and does not seem to be exhibiting any other symptoms though.
  • Yes, the device is running off of eMMC, flashed via the 2.50.1 flasher image. It does not have an SD card attached.
  • We have not noticed this issue on other devices, but most of those are customer devices and unfortunately I don’t know that they would have reported the issue to us or just rebooted the device. As for other non-customer devices, we haven’t noticed there either but honestly we haven’t been looking this closely until now.
    • All devices in the application are using the same OS and supervisor version (except one old model that we’re not actively using)

Thanks,

Adam

Hi Adam – thanks for the additional info. We will take this to the team investigating this, and will get back to you as soon as we have more news.

All the best,
Hugh

Hi Adam,

While we are working on adding additional logging to help detect what specifically caused the restart you experienced, I want to mention that the network changes message refers to configuration changes and not changes in the actual network. You may want to add an update lock (https://www.balena.io/docs/learn/deploy/release-strategy/update-locking/) in your service to prevent restarts while we continue to look into this. We are also investigating the crashes you reported and will be in touch with any findings or additional questions as they arise.

Hi Alan,

Thanks for the update. Please let me know if there’s anything I can do to help debug at any point.

I’m not 100% sure I follow what you mean by configuration changes? Are you talking about user network parameters, or is something else going on under the hood? For sure, when this was happening the device was simply sitting on the table so no user (me) configuration changes. If this could be triggered by the wifi -> ethernet bridge it’s connected to blinking out briefly or something, then maybe there’s something to that. In that case it’s definitely something to resolve once we can pinpoint what’s going on.

Adding an update lock is a good idea. We’ll do that.

Thanks,

Adam

From what I’ve gathered of the investigation so far, it is the configuration of the network defined in the docker-compose file as opposed to something in the network the device is attached to. As they zero in it should become clearer and we’ll certainly keep you updated.

Gotcha. Shouldn’t be anything funky in our docker compose network-wise, but here it is for reference:

version: '2.1'
networks: {}
volumes:
  resin-data: {}
services:
  nautilus:
    build: ./nautilus
    # Note: Disabling 'always' restart until Nautilus application handles
    # start/stop behavior internally.
    restart: on-failure
    devices:
      - "/dev/ttymxc1:/dev/ttymxc1"
      - "/dev/ttymxc2:/dev/ttymxc2"
    cap_add:
      - SYS_RAWIO
    volumes:
      - 'resin-data:/data'
    ports:
      - "5500:5500" # Vision ZMQ
      - "30200:30200" # NMEA-0183
      - "30201:30201" # FusionEngine Client
    labels:
      io.balena.features.procfs: '1' # Needed to enable core dumps
  nemo:
    build: ./nemo
    privileged: true
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "8888:8888" # Nemo REST Interface
    labels:
      io.balena.features.supervisor-api: '1'
      io.balena.features.dbus: '1'
      io.balena.features.procfs: '1'
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
  ui-bridge:
    build: ./nautilus
    restart: always
    ports:
      - "3000:3000" # UI Websocket
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ui_websocket_bridge",
              "--ui-multicast-port=30108",
              "--ws-port=3000"]
  ntrip-proxy:
    build: ./nautilus
    restart: always
    ports:
      - "2101:2101" # Proxied NTRIP Output
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ntrip_proxy",
              "--polaris-api-key=${P1_POLARIS_API_KEY}",
              "--polaris-unique-id=${P1_DEVICE_ID:-${BALENA_DEVICE_NAME_AT_INIT}}-ntrip-proxy",
              "0.0.0.0", "2101", "point_one/applications/ntrip_proxy"]
  eos:
    build: ./eos
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "80:80" # EOS UI

There is one thing we have noticed that may or may not be related, and we haven’t had a chance to pinpoint yet. The nemo service (Python) talks to the supervisor to control stuff, query sw version, etc. When the device first powers on, more often than not we get an infinite loop of connection refused errors from the supervisor. Restarting the nemo container fixes it every time. Power cycling the device fixes it most of the time. Waiting and continuing to try never resolves it. For customers, power cycling has been the only solution (since they can’t restart containers).

I was planning to make a ticket for it when I had some time to dig into it further, and also to make a slightly better work-around in our code where the service restarts itself if it detects the condition, but since it may be related I figured I’d mention it.

When the device first powers on, more often than not we get an infinite loop of connection refused errors from the supervisor.

On boot, your application is started separately from the Supervisor, which means it may come up after your app and take a while to start accepting requests, so your application will need to handle these errors gracefully.

Ah, interesting. We were under the impression that the supervisor always starts and stops the containers. We didn’t think that boot would be special and that they could start on their own independently.

In theory our code does have an initial test to try to communicate with the supervisor that should restart the container on failure. The symptom we’ve been seeing appears as though that check passes but queries after that triggered by status requests from the UI are causing it to fail. Need to do a little more digging when I have some time, but I’ll at least make sure to add an async testing loop and some delay between query attempts. That might help.

Hey Adam,
Yes please keep us posted of your findings