Duplicate entries in `balena logs` after boot.

Are you able to ssh in and run journalctl -a -f to see if the logs are out of order at the source?

Yep, the issue only affects balena logs. The journal looks fine, no duplicates and everything in order (with and without -f).

Thank you, this could be a bug in the cli, let me confirm if this is the case internally first and open a GitHub issue…

Hi Adam, I’ve tailed one of my device logs here for a bit and I can’t reproduce this issue. I am using the latest balena CLI on macOS (v12.26.1). I am wondering if this is a function of network latency and nodejs async. loop. All of the logging infrastructure is on US/East. What is your network latency generally like to that geographic location?

Huh interesting. I’m in Buffalo, NY so pretty low.

It seems to work most of the time. Other than the really bad time I had earlier when I posted, and other random times, the only way I’ve found to replicate it reliably is power cycling as described.

OK, that’s probably not it then. Are you able to check the time sync logs on the device. If the time is constantly drifting and cronyd is fixing it up, then in between those times, the timestamps on the logs may be the cause of the broken ordering…

OK, can you please grant support access to this device and share the uuid. We’ll see if we can narrow down where this ordering is taking place…

Yes, it’s device 40fa62f7a85eb88f5548acf3a13812c0. I actually already granted support access so @20k-ultra could load on a debug supervisor for another ticket related to device restarts (https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change). Feel free to take a look.

If you want to give it a shot, you should be able to replicate it by rebooting twice as described above. That seems to work for me every time.

FYI, I’ve been trying to replicate yet another issue on that device where the nemo container can’t talk to the supervisor (connection refused) or can’t be mounted (fork/exec /proc/self/exe: no such file or directory) when it boots after being offline for a few days. We thought it might be related to chrony, (https://forums.balena.io/t/chronyc-config-is-bad-if-device-comes-online-without-internet) but it’s not entirely clear if that’s the cause anymore from my testing yesterday and today. In any case, I’ve been changing setting time backwards and clearing the journal to try to replicate that issue. Don’t worry if you notice some weird stuff in the journal.

Hey Adam,

I just had a go rebooting your device twice and running the logs. I ran balena logs 40fa62f7a85eb88f5548acf3a13812c0 --tail from the command line and got this result:

From what I can see there aren’t any duplicate messages in there. Do you get the same result if you run the log command now?

I just looked and I don’t see duplicates either. I’ve been physically power cycling though instead of rebooting. Not sure why that would have any effect but maybe?

I can try right now - are you watching the device?

@ljewalsh That replicated it - I physically power cycled one time. Didn’t touch anything else. Now when I run balena logs --tail 40fa62 I see prints from 14:01 UTC (your reboot?), followed by some out of order prints from 14:25 UTC, then a duplicate copy of 14:01 followed by 14:25:

[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,052 - point_one.nautilus.runner.atlas.balena - INFO - Upload support enabled.
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,140 - point_one.nautilus.runner.atlas.balena - INFO - Initial device status:
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Balena release: 4925d5e525ec509201b2cbf4610041f4
...
[Logs]    [11/3/2020, 9:01:01 AM] [nemo]  * Debug mode: off
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,247 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs]    [11/3/2020, 9:25:59 AM] [nemo] 2020-11-03 14:25:59,079 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
[Logs]    [11/3/2020, 9:26:00 AM] [nemo]  * Debug mode: off
[Logs]    [11/3/2020, 9:26:00 AM] [nemo] 2020-11-03 14:26:00,898 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,052 - point_one.nautilus.runner.atlas.balena - INFO - Upload support enabled.
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,140 - point_one.nautilus.runner.atlas.balena - INFO - Initial device status:
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Device: adam (40fa62f7a85eb88f5548acf3a13812c0)
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,141 - point_one.nautilus.runner.atlas.balena - INFO -   Balena release: 4925d5e525ec509201b2cbf4610041f4
...
[Logs]    [11/3/2020, 9:01:01 AM] [nemo] 2020-11-03 14:01:01,247 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)
[Logs]    [11/3/2020, 9:25:59 AM] [nemo] 2020-11-03 14:25:59,079 - point_one.nautilus.runner.atlas.balena - INFO - Running locally.
...
[Logs]    [11/3/2020, 9:26:00 AM] [nemo]  * Debug mode: off
[Logs]    [11/3/2020, 9:26:00 AM] [nemo] 2020-11-03 14:26:00,898 - werkzeug - INFO -  * Running on http://0.0.0.0:8888/ (Press CTRL+C to quit)

I opened a new shell and re-ran the command and I still see the duplicates/out of order stuff. Hopefully you should see it too? I’ll leave the device exactly as is right now.

Ah that must be the difference. Thanks for checking that for us

I’ll try running logs now and see if I see the same

Yes I can see the duplicates too. I will open an issue for this now

I have opened an issue here: https://github.com/balena-io/balena-cli/issues/2083

Thanks again for reporting this and for investigating it with me

Are you still seeing the issue with the nemo container?

The duplicates issue, or one of the other issues? I’m currently still seeing three issues:

  1. The duplicate entries on power cycle if I touch nothing (this ticket)
  2. Nemo unexpectedly restarted by supervisor if I power cycle after restarting Nemo manually from teh dashboard (https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change)
  3. Docker occasionally has a fork/exec error trying to SSH into Nemo, and when that is happening Nemo cannot seem to reach the internet (AWS queries failing) that we think is related to the device starting up with the wrong date (https://forums.balena.io/t/chronyc-config-is-bad-if-device-comes-online-without-internet)

#3 is the most critical for us right now since the only way to resolve it is to reboot the device and hope it doesn’t happen again, or to restart Nemo from the dashboard. Our customers can’t restart the service, and having to reboot and cross their fingers is a pretty bad solution. Unfortunately I can’t seem to replicate it reliably or figure out exactly what triggers it. I’ve been able to replicate it occasionally (maybe 1 in 10 times) with a sequence of:

  1. Power cycle with ethernet unplugged
  2. Use date -s to set the date back a month or more (not sure what the actual min time is so just using a lot)
  3. Wait a bit (a minute, a few? sometimes helps, sometimes doesn’t)
  4. Plug the ethernet in

I figured I’d update https://forums.balena.io/t/chronyc-config-is-bad-if-device-comes-online-without-internet to continue tracking this issue. I was waiting until I found a way to replicate it, or at least gathered a little more information. I can update it now, though I’m not sure if maybe it should actually be an entirely new thread?

Ah, thanks for summarizing. I’ll see what I can find out about those other issues. I’ll stick to discussing them on their separate threads