In debugging https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change I found another, possibly related issue. Whenever I power cycle from a “fresh” state, meaning haven’t restarted any containers since the previous power cycle, it seems display the same output from our nemo service twice in balena logs going back ~1 second or so since the start of the container: duplicate timestamps and everything. For example:
I’ve definitely seen duplicate messages sometimes and I know they are duplicates because I see the timestamps being the same. Even weirder though is that the timestamp are older then some of the other logs that came before it. I found https://github.com/balena-io/balena-supervisor/issues/1314 which might be related even though it mentions local mode.
Are you using local mode by chance ? Also, are you able to produce the duplicate log behavior consistently when you do these restarts ? If so could you describe some more what steps you’re performing ? It’s a bit unclear to me how I could try to reproduce this.
Hey, sorry missed this reply. Didn’t get a notification for some reason.
No, we are not using local mode. I do appear to be able to reproduce this reliably as described above. Basically:
Power cycle the device so it’s in a “nothing touched by me” state (i.e., I haven’t manually restarted any services)
Power cycle a second time
That’s it. I haven’t tried soft rebooting, not sure if that would cause it too.
Step 1 seems to be important, though I have no clue why since I would think the second power cycle would reset the state of things either way. Seems to be though. I noticed that if after the first power cycle I restart our nemo container in the dashboard, then do step 2, I do not get the duplicate prints. Instead, the supervisor restarts my container unexpectedly about 20 seconds after it started on its own after boot. I wrote that issue up here: Unexpected supervisor crash/container restart on network change
Our other services don’t really print much on startup so I haven’t noticed if this is happening to them too, or if nemo is special somehow. In that other ticket, nemo seems to be the only service the supervisor restarts after 20 seconds, so it might be. Don’t really know why though.
Could this have something to do with the CLI --tail option? I basically always use that option. Right now I’m debugging something on a different device and my logs output is very much out of order and also not updating as new messages are printed (or if it is I don’t know where the new prints are going since things are out of order).
Here’s an example with some stuff removed for easier reading but in the exact same order as printed by balena logs --tail DEVICE:
If I run balena logs DEVICE without --tail everything prints in order as it should. Also, the Dashboard console viewer is in the correct order.
Not sure what’s going on or if it’s even related to the duplicate entries problem. This is certainly the worst I’ve ever seen. Like I said, I use --tail all of the time and have never anything like this, but it’s making it pretty difficult to debug particular device right now.
Hi Adam, I’ve tailed one of my device logs here for a bit and I can’t reproduce this issue. I am using the latest balena CLI on macOS (v12.26.1). I am wondering if this is a function of network latency and nodejs async. loop. All of the logging infrastructure is on US/East. What is your network latency generally like to that geographic location?
Huh interesting. I’m in Buffalo, NY so pretty low.
It seems to work most of the time. Other than the really bad time I had earlier when I posted, and other random times, the only way I’ve found to replicate it reliably is power cycling as described.
OK, that’s probably not it then. Are you able to check the time sync logs on the device. If the time is constantly drifting and cronyd is fixing it up, then in between those times, the timestamps on the logs may be the cause of the broken ordering…
If you want to give it a shot, you should be able to replicate it by rebooting twice as described above. That seems to work for me every time.
FYI, I’ve been trying to replicate yet another issue on that device where the nemo container can’t talk to the supervisor (connection refused) or can’t be mounted (fork/exec /proc/self/exe: no such file or directory) when it boots after being offline for a few days. We thought it might be related to chrony, (https://forums.balena.io/t/chronyc-config-is-bad-if-device-comes-online-without-internet) but it’s not entirely clear if that’s the cause anymore from my testing yesterday and today. In any case, I’ve been changing setting time backwards and clearing the journal to try to replicate that issue. Don’t worry if you notice some weird stuff in the journal.
I just had a go rebooting your device twice and running the logs. I ran balena logs 40fa62f7a85eb88f5548acf3a13812c0 --tail from the command line and got this result:
I just looked and I don’t see duplicates either. I’ve been physically power cycling though instead of rebooting. Not sure why that would have any effect but maybe?
I can try right now - are you watching the device?
@ljewalsh That replicated it - I physically power cycled one time. Didn’t touch anything else. Now when I run balena logs --tail 40fa62 I see prints from 14:01 UTC (your reboot?), followed by some out of order prints from 14:25 UTC, then a duplicate copy of 14:01 followed by 14:25:
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,052 - point_one.nautilus.runner.atlas.balena - INFO - Upload support enabled.
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,140 - point_one.nautilus.runner.atlas.balena - INFO - Initial device status:
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO - Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO - Balena release: 4925d5e525ec509201b2cbf4610041f4
...
[Logs] [11/3/2020, 9:01:01 AM] [nemo] * Debug mode: off
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,247 - werkzeug - INFO - * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs] [11/3/2020, 9:25:59 AM] [nemo] 2020-11-03 14:25:59,079 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
[Logs] [11/3/2020, 9:26:00 AM] [nemo] * Debug mode: off
[Logs] [11/3/2020, 9:26:00 AM] [nemo] 2020-11-03 14:26:00,898 - werkzeug - INFO - * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,052 - point_one.nautilus.runner.atlas.balena - INFO - Upload support enabled.
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,140 - point_one.nautilus.runner.atlas.balena - INFO - Initial device status:
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO - Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO - Balena release: 4925d5e525ec509201b2cbf4610041f4
...
[Logs] [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,247 - werkzeug - INFO - * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs] [11/3/2020, 9:25:59 AM] [nemo] 2020-11-03 14:25:59,079 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
[Logs] [11/3/2020, 9:26:00 AM] [nemo] * Debug mode: off
[Logs] [11/3/2020, 9:26:00 AM] [nemo] 2020-11-03 14:26:00,898 - werkzeug - INFO - * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
I opened a new shell and re-ran the command and I still see the duplicates/out of order stuff. Hopefully you should see it too? I’ll leave the device exactly as is right now.