Unexpected supervisor crash/container restart on network change

Hi again Adam,

You can disable the navigation for now if you want. It’s the end of the day here so I wont be looking into this until tomorrow. If you can leave the device on for now. It’d be nice to see if the network changes issue reocurrs.

Could you tell me more about how you are interacting with DBus? What operations are you doing? In which cases are you performing those operations? Are you interacting with the network?

Thanks
Felipe

Ok thanks. Looks like it stopped on its own a while ago - looks like it got sent a SIGTERM. I assume that was you?

We use dbus to talk to the network manager. If someone has the UI open, it polls nemo periodically for device status. Part of that status is network status, which we get by reading Ip4Config, AvailableConnections, etc. from org.freedesktop.NetworkManager.Device, as well as querying org.freedesktop.NetworkManager.Settings.Connection. I think that’s done once every 5 seconds by the UI if I remember right.

I did happen have the UI open on my screen for the device while you were looking at it, just by chance, so that was happening.

The only other dbus/network manager operation we do is if someone changes the network settings manually in the UI settings page. When we’ve run into the network changed issue, nobody’s been doing that.

That being said, I do think some of the devices that have experienced this issue are configured for wifi as well as ethernet. I could imagine wifi going in and out, or doing an access point scan, or whatever might trigger something. Not sure if all of the devices with this issue have wifi enabled. Mine does not, but that doesn’t mean the OS isn’t doing periodic ap scans.

Hi again Adam,

Yeah, it’s probably something I did that stopped the navigation, I was trying to simulate service crash through stopping some services directly with the engine.

In general from the code, network changes refers to changes in the docker service configuration (aliases and IPAM configuration), so I would not expect changes to the network interfaces to trigger the issue.

That being said, I’m not discarding some buggy behavior on the supervisor side, so if you think the problem is more likely to happen in devices with wifi and ethernet connection, it might be worth a try connecting your device to wifi and wait to see if the issue reoccurs.

I will continue to monitor your device through the day.

Ok, I just connected it to my wifi and started the nav engine again. I’ll leave it navigating for a little while.

Thank you Adam, I just saw a service kill and service start of the nautilus service right after you connected to wifi. Is that expected? There is a call to the /restart-service API endpoint which I assume is performed by the nemo service.

Yeah, that was probably me clicking the start button on the UI. That sends a REST request to nemo, which issues a restart-service.

Hi Adam,

Monitoring the device running version 12.1.4 of the supervisor and performing some tests I wasn’t able to replicate the ‘Network changes detected’ or observe the behavior. Although we are stilll investigating to find the exact cause, we are fairly confident that this new version fixes the issue.

We are working on getting some details ready on a new release that we expect to put into production soon for the dart imx8mm device type.

In the meantime let us know of any new issues you observe on your devices.

Cheers

Ok, hopefully that is the case. The device on my desk with the 12.1.4 supervisor is sitting on my desk, whereas our customer devices are in vehicles driving around and using spotty cell connections for internet via a wifi router in the car. I’m not really sure how representative my device is for this particular issue unfortunately.

We’ll continue to keep an eye on it in the meantime, and hopefully the new OS update is available soon and resolves the issue.

Hey Adam, I stumbled upon this thread again and we didn’t really get much closure. How are your devices doing ?

We’re still experiencing a number of host OS/supervisor-related issues that are affecting our customers. In particular, we’re seeing problems communicating with the supervisor if a device boots and the local network is available, but the internet is not yet available. In that case, our nemo service periodically either hits the OCI runtime exec failed issue described above where it appears to be running but can’t reach DNS and we can’t SSH into it, or DNS works but it gets 404 responses from the supervisor when trying to hit the restart-service API endpoint.

The 404 is definitely very strange - it’s clearly reaching the supervisor to get a response at all, and we know the REST query is good since the same exact query works normally. Why would it get a 404?

We’re still running a very old Balena release and supervisor on our devices (2.50.1). We want very much to update to the latest, particularly to solve system time issues when booting a device (Chronyc config is bad if device comes online without internet - #51 by alexgg), but right now we’re having issues with the UART dying on the latest Balena release (2.71.3) so we’re trying to get that resolved as soon as possible (UART Failure after upgrading to 2.71.3+rev1 from 2.50.1+rev1 on Variscite IMX8M - #22 by acostach).

Hi @20k-ultra, wanted to update this issue. We’re now seeing another supervisor death symptom under similar circumstances, and quite frequently.

We have someone driving across the country at the moment. His device is set to start our application on boot so he doesn’t have to deal with it, but it seems to fail basically 50% of the time, and pretty catastrophically.

The auto-start works as follows:

  1. Control service (nemo) always starts automatically
  2. If it detects a reboot (vs a container restart later), nemo waits 30 seconds before contacting the supervisor to avoid race conditions where the supervisor hasn’t started yet.
  3. After the wait, it does a /v1/device query. If that fails, it will retry up to 10 times before exiting to force the container to restart – in the past (above) we found that restarting the container sometimes resolved supervisor communication issues.
    • Docker compose for that container is configured to restart always
  4. After getting the device status, it goes on to set some stuff up and then starts up the nautilus container with a /v2/applications/.../restart-service request
  5. That request is failing with Error: (HTTP code 500) server error - mount already exists as shown below, returning a 503 error back to nemo
  6. After that, the supervisor becomes completely unresponsive
  7. As a work-around for the supervisor communication issues earlier in this ticket, months ago we added a sanity check where the nemo container restarts itself by exiting if it gets a 500-level response during the restart-service call. This 503 response is triggering that, but then for some reason Docker never restarts the container even though the docker compose file is configured to do so

As a result of 6 and 7, the device appears to be totally unresponsive from a customer perspective. The only option customers have is to reboot the device and hope it doesn’t happen again. Obviously, that’s very bad, and in some cases isn’t trivial since their device may be powered by the car and they need to turn off the car to reboot. We can’t even reboot it manually through the dashboard for them – none of the dashboard buttons have any effect since the supervisor is dead. Issuing a reboot via ssh works though, so we have been using that option for now.

Unfortunately, even if we were to disable the sanity check in 7 so that nemo doesn’t exit and the UI is still available, there is no way for us to restart the supervisor. Not even manually through the dashboard.

This issue appears to be a race condition at startup but we’re really not sure what exactly. It also seems to much worse in this one particular vehicle where the device is connecting to the internet via a wifi hotspot in the car. Not sure why, and if internet connection stability is maybe a problem for the supervisor at boot.

The following is the journal output at the time of the crash:

supervisor begins restarting nautilus container --v

Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331>
Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331>
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765>
Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}



supervisor error --v

Apr 30 00:44:39 4c02e8e balenad[1074]: time="2021-04-30T00:44:39.057786603Z" level=error msg="Handler for POST /containers/create returned error: mount already exists wi>
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]   Error on POST /v2/applications/1623312/restart-service:  Error: (HTTP code 500) server error - mount already exists>
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]         at /usr/src/app/dist/app.js:14:1391724
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]   Error on POST /v2/applications/1623312/restart-service:  Error: (HTTP code 500) server error - mount already ex>
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]         at /usr/src/app/dist/app.js:14:1391724
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at /usr/src/app/dist/app.js:14:1391656
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:14:1391676)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:14:1390929)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at /usr/src/app/dist/app.js:14:1391656
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:14:1391676)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:14:1390929)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at IncomingMessage.emit (events.js:322:22)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at endReadableNT (_stream_readable.js:1187:12)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at IncomingMessage.emit (events.js:322:22)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at endReadableNT (_stream_readable.js:1187:12)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [api]     POST /v2/applications/1623312/restart-service 503 - 1486.971 ms
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [api]     POST /v2/applications/1623312/restart-service 503 - 1486.971 ms



nemo application catches error and exits, but then the container doesn't get restarted --v

Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: 2021-04-30 00:44:39,104 - point_one.nautilus.runner.atlas.balena - WARNING - Unable to start navigation engine: request faile>
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: Traceback (most recent call last):
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 630, in start
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     payload={'force': self.force_start}, timeout_sec=10.0)
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 2117, in _send_service_request
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     json=arguments, timeout_sec=timeout_sec)
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 2159, in _post
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     response.raise_for_status()
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:   File "/root/.local/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]:     raise HTTPError(http_error_msg, response=self)
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://10.114.104.1:48484/v2/applications/16233>
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: 
Apr 30 00:44:39 4c02e8e 04bd6f46ac1e[1074]: 2021-04-30 00:44:39,104 - point_one.nautilus.runner.atlas.balena - ERROR - Restarting Nemo to resolve supervisor communicatio>
Apr 30 00:44:39 4c02e8e balenad[1074]: time="2021-04-30T00:44:39.321143435Z" level=error msg="failed to process event" container=04bd6f46ac1e2ff147690fc2223186c8936b7f3b>

I found the following existing github ticket that may be related to this issue since it has a similar error: Supervisor sometimes fails to start up container with mount error · Issue #1131 · balena-os/balena-supervisor · GitHub. Not sure though since there’s really no other context mentioned in the ticket, and it hasn’t been updated since 2019. @CameronDiver, any idea if this is related and if that issue was ever resolved?

I have enabled support access on device 4c02e8eb94f358387cf1136a19a9517a if someone wants to take a look. It is in a car so is only online when the car is being driven. He’s driving long distance for the next few days though, so it will be on for long stretches of time. I’m happy to coordinate access if someone needs.

As I’ve mentioned a few times, we’re still using host OS 2.50.1+rev1 with supervisor 11.4.10 because of kernel issues stopping us from upgrading to a newer host OS version which we are attempting to diagnose (UART Failure after upgrading to 2.71.3+rev1 from 2.50.1+rev1 on Variscite IMX8M - #10 by adamshapiro0). If this is a known issue that has been resolved in a newer version of the supervisor, please let me know.

Thanks,

Adam

Just hit another case this morning – this time the supervisor returned a 404 response:

May 01 14:21:10 4c02e8e 926b40eabc4a[1076]: [api]     POST /v2/applications/1623312/restart-service 404 - 42.556 ms
May 01 14:21:10 4c02e8e resin-supervisor[2471]: [api]     POST /v2/applications/1623312/restart-service 404 - 42.556 ms
May 01 14:21:10 4c02e8e 05b53894b39e[1076]: 2021-05-01 14:21:10,817 - point_one.nautilus.runner.atlas.balena - WARNING - Unable to start navigation engine: request failed.
May 01 14:21:10 4c02e8e 05b53894b39e[1076]: Traceback (most recent call last):
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 630, in start
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:     payload={'force': self.force_start}, timeout_sec=10.0)
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 2117, in _send_service_request
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:     json=arguments, timeout_sec=timeout_sec)
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 2159, in _post
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:     response.raise_for_status()
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:   File "/root/.local/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
May 01 14:21:10 4c02e8e 05b53894b39e[1076]:     raise HTTPError(http_error_msg, response=self)
May 01 14:21:10 4c02e8e 05b53894b39e[1076]: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://10.114.104.1:48484/v2/applications/1623312/restart-service?apikey=d139faa6536a6a81bc82eee5f94cad51
May 01 14:21:10 4c02e8e 05b53894b39e[1076]: 
May 01 14:21:10 4c02e8e 05b53894b39e[1076]: 2021-05-01 14:21:10,817 - point_one.nautilus.runner.atlas.balena - ERROR - Restarting Nemo to resolve supervisor communication error.
May 01 14:21:11 4c02e8e balenad[1076]: time="2021-05-01T14:21:11.305097201Z" level=info msg="shim reaped" id=05b53894b39e0ce03127c91caf722995d2ca80037f32aeb4dbbddb04cd54e5b2
May 01 14:21:11 4c02e8e balenad[1076]: time="2021-05-01T14:21:11.310227058Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
May 01 14:21:11 4c02e8e balenad[1076]: time="2021-05-01T14:21:11.371556588Z" level=warning msg="Failed to delete conntrack state for 172.17.0.4: invalid argument"

I don’t even know how a 404 makes any sense whatsoever as a response to restart-service. Either the supervisor isn’t running, in which case you should get no response or connection refused, or it is and the request should succeed or fail with some 500 error. The URL is the same as it always is, so how is it possible to get a 404?

As with the 503 error, nemo exited to force itself to restart and try to fix communication with the supervisor. This time, again, Docker did not restart the container, even though its docker-compose is set to restart: always. That’s really bad.

Just happened again after a brief stop and power cycle. This time Docker did restart the container multiple times, but of course it didn’t fix anything since the issue seems to be on the supervisor side and not the container’s fault. Got 2 503s in a row, followed by many 404s (>10) before I manually rebooted the device.

For reference, it happened around 15:16 UTC in case anyone takes a look at the journal.

An update from some experimenting the last day or two as the vehicle drives:

We believe many of the supervisor issues relate back to possible handling of network connectivity changes at boot. Through the customer experimenting a lot, we have found that the error seems to happen most under the following circumstances:

  1. Wifi network (cell phone tether) is activated
  2. Boot device
  3. Device detects wifi network and network manager changes the network configuration some time after boot, and seemingly after the supervisor has begun to startup
  4. Issue a restart-service request to the supervisor 30 seconds after boot – encounter 503 or 404 error most of the time

If instead we leave the wifi network offline during boot, wait a few minutes, and then turn it on after the restart-service request has been made and our application is running, the supervisor seems to work ok. The customer is using this workaround for now, but this is not really a sustainable solution obviously.

hey @adamshapiro0, I’ve been a bit backlogged on my emails and just got around to seeing this. I’ll get back to you shortly.