Unexpected supervisor crash/container restart on network change

Hi @20k-ultra, wanted to update this issue. We’re now seeing another supervisor death symptom under similar circumstances, and quite frequently.

We have someone driving across the country at the moment. His device is set to start our application on boot so he doesn’t have to deal with it, but it seems to fail basically 50% of the time, and pretty catastrophically.

The auto-start works as follows:

  1. Control service (nemo) always starts automatically
  2. If it detects a reboot (vs a container restart later), nemo waits 30 seconds before contacting the supervisor to avoid race conditions where the supervisor hasn’t started yet.
  3. After the wait, it does a /v1/device query. If that fails, it will retry up to 10 times before exiting to force the container to restart – in the past (above) we found that restarting the container sometimes resolved supervisor communication issues.
    • Docker compose for that container is configured to restart always
  4. After getting the device status, it goes on to set some stuff up and then starts up the nautilus container with a /v2/applications/.../restart-service request
  5. That request is failing with Error: (HTTP code 500) server error - mount already exists as shown below, returning a 503 error back to nemo
  6. After that, the supervisor becomes completely unresponsive
  7. As a work-around for the supervisor communication issues earlier in this ticket, months ago we added a sanity check where the nemo container restarts itself by exiting if it gets a 500-level response during the restart-service call. This 503 response is triggering that, but then for some reason Docker never restarts the container even though the docker compose file is configured to do so

As a result of 6 and 7, the device appears to be totally unresponsive from a customer perspective. The only option customers have is to reboot the device and hope it doesn’t happen again. Obviously, that’s very bad, and in some cases isn’t trivial since their device may be powered by the car and they need to turn off the car to reboot. We can’t even reboot it manually through the dashboard for them – none of the dashboard buttons have any effect since the supervisor is dead. Issuing a reboot via ssh works though, so we have been using that option for now.

Unfortunately, even if we were to disable the sanity check in 7 so that nemo doesn’t exit and the UI is still available, there is no way for us to restart the supervisor. Not even manually through the dashboard.

This issue appears to be a race condition at startup but we’re really not sure what exactly. It also seems to much worse in this one particular vehicle where the device is connecting to the internet via a wifi hotspot in the car. Not sure why, and if internet connection stability is maybe a problem for the supervisor at boot.

The following is the journal output at the time of the crash:

supervisor begins restarting nautilus container --v

Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331>
Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331>
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765>
Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}



supervisor error --v

Apr 30 00:44:39 4c02e8e balenad[1074]: time="2021-04-30T00:44:39.057786603Z" level=error msg="Handler for POST /containers/create returned error: mount already exists wi>
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]   Error on POST /v2/applications/1623312/restart-service:  Error: (HTTP code 500) server error - mount already exists>
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]         at /usr/src/app/dist/app.js:14:1391724
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]   Error on POST /v2/applications/1623312/restart-service:  Error: (HTTP code 500) server error - mount already ex>
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]         at /usr/src/app/dist/app.js:14:1391724
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at /usr/src/app/dist/app.js:14:1391656
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:14:1391676)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:14:1390929)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at /usr/src/app/dist/app.js:14:1391656
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:14:1391676)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:14:1390929)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at IncomingMessage.emit (events.js:322:22)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at endReadableNT (_stream_readable.js:1187:12)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at IncomingMessage.emit (events.js:322:22)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at endReadableNT (_stream_readable.js:1187:12)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [api]     POST /v2/applications/1623312/restart-service 503 - 1486.971 ms
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [api]     POST /v2/applications/1623312/restart-service 503 - 1486.971 ms



nemo application catches error and exits, but then the container doesn't get restarted --v

Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: 2021-04-30 00:44:39,104 - point_one.nautilus.runner.atlas.balena - WARNING - Unable to start navigation engine: request faile>
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: Traceback (most recent call last):
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 630, in start
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     payload={'force': self.force_start}, timeout_sec=10.0)
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 2117, in _send_service_request
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     json=arguments, timeout_sec=timeout_sec)
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 2159, in _post
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     response.raise_for_status()
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/root/.local/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     raise HTTPError(http_error_msg, response=self)
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://10.114.104.1:48484/v2/applications/16233>
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: 
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: 2021-04-30 00:44:39,104 - point_one.nautilus.runner.atlas.balena - ERROR - Restarting Nemo to resolve supervisor communicatio>
Apr 30 00:44:39 4c02e8e balenad[1074]: time="2021-04-30T00:44:39.321143435Z" level=error msg="failed to process event" container=04bd6f46ac1e2ff147690fc2223186c8936b7f3b>

I found the following existing github ticket that may be related to this issue since it has a similar error: Supervisor sometimes fails to start up container with mount error · Issue #1131 · balena-os/balena-supervisor · GitHub. Not sure though since there’s really no other context mentioned in the ticket, and it hasn’t been updated since 2019. @CameronDiver, any idea if this is related and if that issue was ever resolved?

I have enabled support access on device 4c02e8eb94f358387cf1136a19a9517a if someone wants to take a look. It is in a car so is only online when the car is being driven. He’s driving long distance for the next few days though, so it will be on for long stretches of time. I’m happy to coordinate access if someone needs.

As I’ve mentioned a few times, we’re still using host OS 2.50.1+rev1 with supervisor 11.4.10 because of kernel issues stopping us from upgrading to a newer host OS version which we are attempting to diagnose (UART Failure after upgrading to 2.71.3+rev1 from 2.50.1+rev1 on Variscite IMX8M - #10 by adamshapiro0). If this is a known issue that has been resolved in a newer version of the supervisor, please let me know.

Thanks,

Adam