Unexpected supervisor crash/container restart on network change

20k-ultra · November 4, 2020, 11:21pm

How are the production devices generally ? They are pretty stable but sometimes experience that “Network changes detected” restart ?

adamshapiro0 · November 4, 2020, 11:31pm

Well we’re hitting a few pretty nasty issues that seem to happen at power up. Once the device is powered up and stable, it tends to run nicely. Getting to that point can be tricky. Honestly the supervisor restarting Nemo at boot, i.e. “network changes” issue, is not quite as critical at the moment.

My biggest concern is this ticket: Chronyc config is bad if device comes online without internet, which has expanded since we first wrote it up and now actually discusses at least two issues, maybe 3. Basically we’ve experienced multiple problems on boot where nemo can’t talk to the supervisor and can’t seem to reach the internet, and we can’t seem to get a shell into the container and instead get a fork error from docker. In these cases, nemo doesn’t get restarted like this network changes error, either intentionally or accidentally. The result is that the software running on nemo just spins. Nemo is the UI back end, so the customers see this directly as “device is busted” and have to power cycle once, or sometimes more than once, to get it to recover.

I think at least one of the causes has to do with powering the device down for multiple days or longer and then turning it on in an environment where internet isn’t available (which is typical for some customers). Our devices don’t have an RTC, so when they come up their date is old until chrony gets an NTP sync. Either that, or the time sync itself, or who knows what seems to cause all kinds of havoc.

20k-ultra · November 4, 2020, 11:42pm

Thanks for clarifying the urgency of the issues you face. It helps me to know which ones to prioritize with my limited time on support. I am one of the maintainers of the Supervisor so I can’t dig much into the time sync issue. I should have more info about a new release tomorrow and also I plan to find out more about why the network changes detected isn’t logging anymore first thing tomorrow morning.

adamshapiro0 · November 4, 2020, 11:43pm

Yeah, I know that story I appreciate the help.

adamshapiro0 · November 12, 2020, 9:55pm

Hey @20k-ultra, “quick” update: we believe we just hit the network change detected issue live on one of our vehicles (not the device with the updated supervisor unfortunately), and it hit a race condition a pretty nasty consequence.

From what I can tell, the supervisor decided to restart our service right after we told it to start, after the container had started executing, and before the application was actually running. Since the application wasn’t actually running yet, our bootstrap wrapper script ended up forwarding the SIGTERM to nobody, and then the application started anyway.

The application then ran normally, unaware that the shutdown request happened, and the supervisor eventually killed it as non-responsive. I assume that kill kills the docker container, and doesn’t forward a SIGKILL to the application. As a result, it couldn’t clean up the update lock. The next time we tried to start it, the update lock already existed nautilus refused to start (it doesn’t know if the lock is from us or the supervisor).

Nov 12 03:00:37 293028d 28769d824dbb[1075]: [event]   Event: Service started {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1595407}}
Nov 12 03:00:37 293028d resin-supervisor[2356]: [event]   Event: Service started {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1595>
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [api]     POST /v2/applications/1623312/restart-service 200 - 4196.242 ms
Nov 12 03:00:37 293028d resin-supervisor[2356]: [api]     POST /v2/applications/1623312/restart-service 200 - 4196.242 ms
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [api]     GET /v1/device 200 - 1488.962 ms
Nov 12 03:00:37 293028d resin-supervisor[2356]: [api]     GET /v1/device 200 - 1488.962 ms
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [api]     GET /v1/device 200 - 108.816 ms
Nov 12 03:00:37 293028d resin-supervisor[2356]: [api]     GET /v1/device 200 - 108.816 ms

Nov 12 03:00:37 293028d resin-supervisor[2356]: [debug]   Replacing container for service nautilus because of config changes:
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [debug]   Replacing container for service nautilus because of config changes:
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [debug]     Network changes detected
Nov 12 03:00:37 293028d resin-supervisor[2356]: [debug]     Network changes detected
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [debug]   Replacing container for service nautilus because of config changes:
Nov 12 03:00:37 293028d resin-supervisor[2356]: [debug]   Replacing container for service nautilus because of config changes:
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [debug]     Network changes detected
Nov 12 03:00:37 293028d resin-supervisor[2356]: [debug]     Network changes detected
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [debug]   Replacing container for service nautilus because of config changes:
Nov 12 03:00:37 293028d resin-supervisor[2356]: [debug]   Replacing container for service nautilus because of config changes:
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [debug]     Network changes detected
Nov 12 03:00:37 293028d resin-supervisor[2356]: [debug]     Network changes detected
Nov 12 03:00:37 293028d 28769d824dbb[1075]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1595407}}
Nov 12 03:00:37 293028d resin-supervisor[2356]: [event]   Event: Service kill {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1595407>

Nov 12 03:00:38 293028d 8548385b0773[1075]: Forwarding shutdown request to application...
Nov 12 03:00:38 293028d 8548385b0773[1075]: Nautilus started with PID 18.

...

Nov 12 03:00:47 293028d balenad[1075]: time="2020-11-12T03:00:47.873128521Z" level=info msg="Container 8548385b0773e94655e2dd6ecfc90f6e142e4db9187b527a43a4f0c30686ce3b failed to exit within 10 seconds of signal 15 - using the force">

...
Nov 12 03:00:55 293028d d3bf40e104ad[1075]: 2020-11-12 03:00:55,726 - point_one.nautilus.runner.atlas.balena - INFO - Starting Nautilus on Atlas...
Nov 12 03:00:55 293028d resin-supervisor[2356]: Unhandled rejection UpdatesLockedError: Updates are locked: EEXIST: file already exists, open '/mnt/root/tmp/balena-super>
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at /usr/src/app/dist/app.js:2:371477
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at tryCatcher (/usr/src/app/dist/app.js:2:220284)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at Promise._settlePromiseFromHandler (/usr/src/app/dist/app.js:2:1708447)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at Promise._settlePromise (/usr/src/app/dist/app.js:2:1709730)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at Promise._settlePromise0 (/usr/src/app/dist/app.js:2:1710770)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at Promise._settlePromises (/usr/src/app/dist/app.js:2:1712344)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at _drainQueueStep (/usr/src/app/dist/app.js:2:1715412)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at _drainQueue (/usr/src/app/dist/app.js:2:1715308)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at Async._drainQueues (/usr/src/app/dist/app.js:2:1716625)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at Immediate.Async.drainQueues (/usr/src/app/dist/app.js:2:1715217)
Nov 12 03:00:55 293028d resin-supervisor[2356]:     at processImmediate (internal/timers.js:456:21)
Nov 12 03:00:55 293028d 28769d824dbb[1075]: Unhandled rejection UpdatesLockedError: Updates are locked: EEXIST: file already exists, open '/mnt/root/tmp/balena-superviso>
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at /usr/src/app/dist/app.js:2:371477
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at tryCatcher (/usr/src/app/dist/app.js:2:220284)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at Promise._settlePromiseFromHandler (/usr/src/app/dist/app.js:2:1708447)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at Promise._settlePromise (/usr/src/app/dist/app.js:2:1709730)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at Promise._settlePromise0 (/usr/src/app/dist/app.js:2:1710770)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at Promise._settlePromises (/usr/src/app/dist/app.js:2:1712344)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at _drainQueueStep (/usr/src/app/dist/app.js:2:1715412)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at _drainQueue (/usr/src/app/dist/app.js:2:1715308)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at Async._drainQueues (/usr/src/app/dist/app.js:2:1716625)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at Immediate.Async.drainQueues (/usr/src/app/dist/app.js:2:1715217)
Nov 12 03:00:55 293028d 28769d824dbb[1075]:     at processImmediate (internal/timers.js:456:21)
Nov 12 03:01:05 293028d 28769d824dbb[1075]: [api]     POST /v2/applications/1623312/restart-service  -  ms
Nov 12 03:01:05 293028d resin-supervisor[2356]: [api]     POST /v2/applications/1623312/restart-service  -  ms
Nov 12 03:01:05 293028d d3bf40e104ad[1075]: 2020-11-12 03:01:05,778 - point_one.nautilus.nemo.main - ERROR - Traceback (most recent call last):
Nov 12 03:01:05 293028d d3bf40e104ad[1075]:   File "/home/point_one/nemo/nautilus/nemo/application_handler.py", line 24, in start
Nov 12 03:01:05 293028d d3bf40e104ad[1075]:     app.runner.start(cold_start=clear_state)
Nov 12 03:01:05 293028d d3bf40e104ad[1075]:   File "/home/point_one/nemo/nautilus/runner/atlas_balena_runner.py", line 419, in start
Nov 12 03:01:05 293028d d3bf40e104ad[1075]:     'progress?') from None
Nov 12 03:01:05 293028d d3bf40e104ad[1075]: RuntimeError: Unable to start navigation engine: request timed out. Is a software update in progress?

We’re going to try to take some precautions against the restart lockout:

Put some content in the update lock file so we can tell if it’s ours and possibly left over
Checking /proc from nemo to see if the nautilus binary is actually running (vs just the service)

Do you have any suggestions of protections we can try?

If the supervisor kills the container outright (i.e., doesn’t even SIGKILL the application) and it hits this race condition, there’s no chance our bootstrap script will be able to clean up the update lock. At the very least, it seems like if the supervisor is truly killing a service, it needs to delete the update lock itself.

The cause of the restart is still very concerning though. I don’t understand what the “network changes” are that it thinks it’s detecting, or why changes to the network configuration restart the containers.

Thanks,

Adam

anathan84 · November 12, 2020, 10:01pm

I’m also working with @adamshapiro0 here, unfortunately this is causing our customer to get very frustrated with our device. The symptom from their perspective is half the time when they turn it on (from no power, i.e. cold start), some of the services just never work.

Our only resolution has been to have them call us, we go into the balena dashboard, disable the update lock check and manually restart the service that is hung up.

We desperately need a work around for this ASAP.

pipex · November 12, 2020, 10:26pm

Hi Adam and Aaron,

I know this is a long thread and there are many things we are trying to handle. For the moment I will just focus on your last message.

Who creates the lock on your service? Does the bootstrap wrapper create it? Why is the lock not released by the service ending? Could it be that you are creating the lock to early?

Is there any chance you could enable support on your customer device to take a look? There are some things I’d like to check out related to the network restart.

Thank you
Felipe

adamshapiro0 · November 12, 2020, 10:40pm

Hey Felipe,

The order is basically:

container start
- bootstrap.sh (CMD)
  - create_lock (C application that just makes the lock file)
  - nautilus (run until signaled via SIGTERM)
  - EXIT trap in bootstrap.sh deletes update lock

bootstrap.sh has an EXIT trap that deletes the update lock after nautilus exits, and a second SIGTERM trap that forwards the shutdown request to nautilus. I tested that previously and confirmed that A) the SIGTERM trap always fires before the EXIT trap, and B) the EXIT trap catches both SIGINT and SIGTERM correctly.

I assumed the only way the exit trap would ever not fire and the update lock would be left is on power loss, in which case /tmp would be cleared anyway. I didn’t realize the supervisor would force-kill the container, which is basically the same as a power loss to the container BUT the host OS doesn’t reboot so /tmp doesn’t get cleared and the update lock doesn’t get deleted.

I enabled support access to 293028dc465f08e8eb7cd1256625e441. That’s our vehicle that I pasted the journal stuff from above. I have not done anything to it since pasting so it should still be in that state if you scroll back in the journal.

pipex · November 13, 2020, 12:02am

Hi Adam, thanks for sharing support access. I have gathered some logs on the device.

Regarding the lock issue, if on stop the container does not terminate in some time, the engine will send a kill signal.

I think your assessment is correct, probably the supervisor sends a SIGTERM after the bootstrap script has set the lock but before the application is prepared to receive the SIGTERM. Since the application cannot respond to the signal, the engine kills the container and the lock remains.

Probably your checks are correct, check that the binary is actually running before setting the lock or have the binary itself set the lock.

We are continuing to investigate the network changes issue and hopefully when we’ll have some more information with the update of the supervisor.

In the meantime we are keeping this ticket open and will continue to investigate the issues one by one.

adamshapiro0 · November 13, 2020, 1:16am

Thanks Felipe. We currently only have one device with the updated supervisor - a different device than the one you just looked at. We can put the updated supervisor on this device as well if needed, but ideally we’d like to continue running what our customers are running. I haven’t really seen this issue recently on the device with the modified supervisor though.

Is there an OS/supervisor update for the dart imx8mm platform in the works any time soon?

adamshapiro0 · November 13, 2020, 1:21am

@pipex I forgot to say that even if we move the lock capture and release into the binary itself, that still won’t resolve the issue where the lock still exists if the supervisor kills the container as unresponsive.

What do you think in that situation? The supervisor knows that it’s killing the service, and also that the service currently holds the lock. Would it make sense for the supervisor to delete the lock?

pipex · November 16, 2020, 7:28pm

Hi Adam,

I have replicated the error you encountered in starting the service, if a lock is left from a service crash, the start-service supervisor API endpoints throws the error you shared. I think this is a bug, since the lock should not prevent from starting the service, I have created an issue here https://github.com/balena-io/balena-supervisor/issues/1523, once that is fixed it should at least allow you to restart the services. Your idea to have the supervisor remove the lock on killing the service would unfortunately not work, since it’s the engine that is killing the service and not the supervisor.

I’m inquiring with our devices team about the update of the imx8mm platform to the latest balenaOS.

Unfortunately I still have no new leads on the ‘Network changes’ issue, I had an hypothesis but I could not replicate the behavior. About your device with supervisor 12.1.1, could you re-enable support on it? I’d like to monitor it for a while and see if I can try a few things.

Thanks
Felipe

adamshapiro0 · November 16, 2020, 8:15pm

Hey Felipe,

I enabled support on it again and turned it on (device 40fa62f7a85eb88f5548acf3a13812c0). Have at it.

Cheers,

Adam

pipex · November 16, 2020, 8:41pm

Thank you Adam. This is a debugging device right? Do I have your permission to try configuration changes or reboot it to attemp to trigger the issue? Thanks again

adamshapiro0 · November 16, 2020, 8:43pm

Yep, feel free. This is the device that’s on my desk so I’m the only one the changes would matter to.

pipex · November 16, 2020, 8:43pm

Also, can you confirm this is running the same version of your application as your customer’s device?

adamshapiro0 · November 16, 2020, 8:45pm

It’s running a recent test version at the moment, but I’ll pin it to the latest release right now.

pipex · November 16, 2020, 8:47pm

Thank you, I see it updating

adamshapiro0 · November 16, 2020, 8:52pm

I just started the device navigating (run the nautilus service) and configured it to start running automatically on boot in case you need to reboot. Just let me know whenever you’re done messing with it so I can stop it (so the disk doesn’t fill up). We have some customer devices that start automatically, and some that don’t. nemo always runs continuously. I’m not 100% sure which mode is the best to test for the network changes issue, but I think probably operating is more stressing on the device if that has any impact.

To be clear though, on my device the nautilus service always runs. When it’s not navigating (i.e., running our software), it does a sleep infinity so the container stays up. To start navigating, nemo issues a restart-service.

pipex · November 16, 2020, 9:33pm

Thanks for that, that helps me understand a bit of the behavior I have seen. I think having the navigation on makes more sense, since it replicates the regular operating conditions.

Topic		Replies	Views
Containers do not come up after the reboot Product support	16	719	September 25, 2020
Services are in a constant restart loop! balenaOS support , balenafin	51	4091	January 15, 2021
DNS failure not caught by supervisor balenaOS	72	2284	November 11, 2020
'balena ps' shows container restarting for many hours; reboot required to clear condition balenaEngine	30	2570	January 26, 2021
Device services stuck at "stopping" state Product support	10	2011	May 8, 2019

Unexpected supervisor crash/container restart on network change

Related topics