Duplicate entries in `balena logs` after boot.

Hey all,

In debugging https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change I found another, possibly related issue. Whenever I power cycle from a “fresh” state, meaning haven’t restarted any containers since the previous power cycle, it seems display the same output from our nemo service twice in balena logs going back ~1 second or so since the start of the container: duplicate timestamps and everything. For example:

[Logs]    [10/30/2020, 2:07:22 PM] [nemo] 2020-10-30 18:07:22,313 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
[Logs]    [10/30/2020, 2:07:23 PM] [nemo] 2020-10-30 18:07:23,751 - point_one.nautilus.runner.atlas.balena - INFO -   Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
...

[Logs]    [10/30/2020, 2:07:22 PM] [nemo] 2020-10-30 18:07:22,313 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...

After that it seems to work normally. I checked and the duplicates do not show up in journalctl.

Not sure why that’s happening, seems like a bug in the log upload somewhere? I’m not really sure what process is responsible for that on the host.

Cheers,

Adam

I’ve definitely seen duplicate messages sometimes and I know they are duplicates because I see the timestamps being the same. Even weirder though is that the timestamp are older then some of the other logs that came before it. I found https://github.com/balena-io/balena-supervisor/issues/1314 which might be related even though it mentions local mode.

Are you using local mode by chance ? Also, are you able to produce the duplicate log behavior consistently when you do these restarts ? If so could you describe some more what steps you’re performing ? It’s a bit unclear to me how I could try to reproduce this.

Thanks!

Hey, sorry missed this reply. Didn’t get a notification for some reason.

No, we are not using local mode. I do appear to be able to reproduce this reliably as described above. Basically:

  1. Power cycle the device so it’s in a “nothing touched by me” state (i.e., I haven’t manually restarted any services)
  2. Power cycle a second time

That’s it. I haven’t tried soft rebooting, not sure if that would cause it too.

Step 1 seems to be important, though I have no clue why since I would think the second power cycle would reset the state of things either way. Seems to be though. I noticed that if after the first power cycle I restart our nemo container in the dashboard, then do step 2, I do not get the duplicate prints. Instead, the supervisor restarts my container unexpectedly about 20 seconds after it started on its own after boot. I wrote that issue up here: Unexpected supervisor crash/container restart on network change

Our other services don’t really print much on startup so I haven’t noticed if this is happening to them too, or if nemo is special somehow. In that other ticket, nemo seems to be the only service the supervisor restarts after 20 seconds, so it might be. Don’t really know why though.

What are the OS and supervisor versions your device is running? If you run the device’s diagnostics, is everything green?

Host OS 2.50.1+rev1 (for Variscite DART-MX8M Mini)
Supervisor 11.4.10

Yes, all green on the diag page.

Could this have something to do with the CLI --tail option? I basically always use that option. Right now I’m debugging something on a different device and my logs output is very much out of order and also not updating as new messages are printed (or if it is I don’t know where the new prints are going since things are out of order).

Here’s an example with some stuff removed for easier reading but in the exact same order as printed by balena logs --tail DEVICE:

[Logs]    [11/2/2020, 4:40:16 PM] [eos] 192.168.32.32 - - [02/Nov/2020:21:40:16 +0000] "GET /favicon.ico HTTP/1.1" 200 3561 "http://192.168.32.77/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36" "-"
[Logs]    [11/2/2020, 4:41:07 PM] [eos] 52.4.252.97 - - [02/Nov/2020:21:41:07 +0000] "GET /notification.woff HTTP/1.1" 200 975 "http://localhost:8000/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36" "-"
...
[Logs]    [11/2/2020, 4:25:06 PM] [nemo] 2020-11-02 21:25:06,087 - point_one.log.uploader - INFO - Requesting upload of log /data/logs/2020-11-02/93b5a5d0-68c8-4073-84df-f49a1a493e12.
[Logs]    [11/2/2020, 4:25:06 PM] [nemo] 2020-11-02 21:25:06,204 - botocore.credentials - INFO - Found credentials in environment variables.
...
[Logs]    [11/2/2020, 4:41:04 PM] [nemo] 2020-11-02 21:41:04,157 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
[Logs]    [11/2/2020, 4:41:04 PM] [nemo] 2020-11-02 21:41:04,158 - point_one.nautilus.runner.atlas.balena - INFO -   BALENA_SUPERVISOR_ADDRESS=http://10.114.104.1:48484
...
[Logs]    [11/2/2020, 4:23:05 PM] [nautilus] I20201102 21:23:05.028426    18 common_runner.cc:635] ---> Log Manager...
[Logs]    [11/2/2020, 4:23:05 PM] [nautilus] I20201102 21:23:05.028962    18 common_runner.cc:639] Runner shutdown complete.
[Logs]    [11/2/2020, 4:23:05 PM] [nautilus] I20201102 21:23:05.029014    18 common_runner.h:143] Shutdown complete.
[Logs]    [11/2/2020, 4:23:05 PM] [nautilus] Releasing update lock file (/tmp/balena/updates.lock).

If I run balena logs DEVICE without --tail everything prints in order as it should. Also, the Dashboard console viewer is in the correct order.

Not sure what’s going on or if it’s even related to the duplicate entries problem. This is certainly the worst I’ve ever seen. Like I said, I use --tail all of the time and have never anything like this, but it’s making it pretty difficult to debug particular device right now.

Thanks,

Adam

Are you able to ssh in and run journalctl -a -f to see if the logs are out of order at the source?

Yep, the issue only affects balena logs. The journal looks fine, no duplicates and everything in order (with and without -f).

Thank you, this could be a bug in the cli, let me confirm if this is the case internally first and open a GitHub issue…

Hi Adam, I’ve tailed one of my device logs here for a bit and I can’t reproduce this issue. I am using the latest balena CLI on macOS (v12.26.1). I am wondering if this is a function of network latency and nodejs async. loop. All of the logging infrastructure is on US/East. What is your network latency generally like to that geographic location?

Huh interesting. I’m in Buffalo, NY so pretty low.

It seems to work most of the time. Other than the really bad time I had earlier when I posted, and other random times, the only way I’ve found to replicate it reliably is power cycling as described.

OK, that’s probably not it then. Are you able to check the time sync logs on the device. If the time is constantly drifting and cronyd is fixing it up, then in between those times, the timestamps on the logs may be the cause of the broken ordering…

OK, can you please grant support access to this device and share the uuid. We’ll see if we can narrow down where this ordering is taking place…

Yes, it’s device 40fa62f7a85eb88f5548acf3a13812c0. I actually already granted support access so @20k-ultra could load on a debug supervisor for another ticket related to device restarts (https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change). Feel free to take a look.

If you want to give it a shot, you should be able to replicate it by rebooting twice as described above. That seems to work for me every time.

FYI, I’ve been trying to replicate yet another issue on that device where the nemo container can’t talk to the supervisor (connection refused) or can’t be mounted (fork/exec /proc/self/exe: no such file or directory) when it boots after being offline for a few days. We thought it might be related to chrony, (https://forums.balena.io/t/chronyc-config-is-bad-if-device-comes-online-without-internet) but it’s not entirely clear if that’s the cause anymore from my testing yesterday and today. In any case, I’ve been changing setting time backwards and clearing the journal to try to replicate that issue. Don’t worry if you notice some weird stuff in the journal.

Hey Adam,

I just had a go rebooting your device twice and running the logs. I ran balena logs 40fa62f7a85eb88f5548acf3a13812c0 --tail from the command line and got this result:

From what I can see there aren’t any duplicate messages in there. Do you get the same result if you run the log command now?

I just looked and I don’t see duplicates either. I’ve been physically power cycling though instead of rebooting. Not sure why that would have any effect but maybe?

I can try right now - are you watching the device?

@ljewalsh That replicated it - I physically power cycled one time. Didn’t touch anything else. Now when I run balena logs --tail 40fa62 I see prints from 14:01 UTC (your reboot?), followed by some out of order prints from 14:25 UTC, then a duplicate copy of 14:01 followed by 14:25:

[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,052 - point_one.nautilus.runner.atlas.balena - INFO - Upload support enabled.
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,140 - point_one.nautilus.runner.atlas.balena - INFO - Initial device status:
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Balena release: 4925d5e525ec509201b2cbf4610041f4
...
[Logs]    [11/3/2020, 9:01:01 AM] [nemo]  * Debug mode: off
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,247 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs]    [11/3/2020, 9:25:59 AM] [nemo] 2020-11-03 14:25:59,079 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
[Logs]    [11/3/2020, 9:26:00 AM] [nemo]  * Debug mode: off
[Logs]    [11/3/2020, 9:26:00 AM] [nemo] 2020-11-03 14:26:00,898 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,052 - point_one.nautilus.runner.atlas.balena - INFO - Upload support enabled.
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,140 - point_one.nautilus.runner.atlas.balena - INFO - Initial device status:
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Balena release: 4925d5e525ec509201b2cbf4610041f4
...
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,247 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs]    [11/3/2020, 9:25:59 AM] [nemo] 2020-11-03 14:25:59,079 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
[Logs]    [11/3/2020, 9:26:00 AM] [nemo]  * Debug mode: off
[Logs]    [11/3/2020, 9:26:00 AM] [nemo] 2020-11-03 14:26:00,898 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)

I opened a new shell and re-ran the command and I still see the duplicates/out of order stuff. Hopefully you should see it too? I’ll leave the device exactly as is right now.

Ah that must be the difference. Thanks for checking that for us

I’ll try running logs now and see if I see the same