Confusing forensics on device that went offline for 12 days

We had a device in the field go offline for 12 days. After a technician visit and power cycle, it came back online. I see some confusing things:

First:

root@0fce263:~# uptime
19:46:06  up   8:57,  2 users,  load average: 0.70, 0.95, 1.09

But:

root@0fce263:~# balena ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                 PORTS               NAMES
6d544ea7a5a7        e6ad3a09401b                    "/usr/bin/entry.sh /…"   12 days ago         Up 12 days                                 solmon_2660289_1512247
97b3e5978d72        balena/rpi-supervisor:v10.8.0   "./entry.sh"             3 months ago        Up 12 days (healthy)                       resin_supervisor

I don’t understand how the container “uptime” can be greater than the device “uptime”.

Then, looking at the journal logs I see the following snippet:

Sep 06 11:50:01 0fce263 chronyd[776]: 2020-09-06T11:50:01Z Source 2606:4700:f1::123 replaced with 51.255.197.148
Sep 06 11:50:02 0fce263 6d544ea7a5a7[1458]: [157B blob data]
Sep 06 11:50:08 0fce263 chronyd[776]: 2020-09-06T11:50:08Z Selected source 51.255.197.148
Sep 18 14:50:08 0fce263 chronyd[776]: 2020-09-06T11:50:08Z System clock wrong by 1047600.109068 seconds, adjustment started
Sep 18 14:50:08 0fce263 chronyd[776]: 2020-09-18T14:50:08Z System clock was stepped by 1047600.109068 seconds
Sep 18 14:50:08 0fce263 journalctl[18774]: Vacuuming done, freed 0B of archived journals from /run/log/journal/0af5b029b9f04117bf64c4ce46a27cbc.
Sep 18 14:50:08 0fce263 openvpn[8786]: Fri Sep 18 14:50:08 2020 [vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
Sep 18 14:50:08 0fce263 openvpn[8786]: Fri Sep 18 14:50:08 2020 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.240.30.115 52.4.252.97 restart
Sep 18 14:50:08 0fce263 openvpn[8786]: Fri Sep 18 14:50:08 2020 SIGUSR1[soft,ping-restart] received, process restarting
Sep 18 14:50:08 0fce263 openvpn[8786]: Fri Sep 18 14:50:08 2020 Restart pause, 5 second(s)
Sep 18 14:50:09 0fce263 6d544ea7a5a7[1458]: [152B blob data]
Sep 18 14:50:12 0fce263 6d544ea7a5a7[1458]: [147B blob data]
Sep 18 14:50:12 0fce263 6d544ea7a5a7[1458]: [143B blob data]
Sep 18 14:50:12 0fce263 6d544ea7a5a7[1458]: [179B blob data]
Sep 18 14:50:13 0fce263 97b3e5978d72[1458]: [info]    Waiting for connectivity...
Sep 18 14:50:13 0fce263 resin-supervisor[9197]: [info]    Waiting for connectivity...

Nothing between Sep 6 and Sep 18. So I suspect that the results of balena ps are incorrect.

Any thoughts on how else I can check if the device was actually powered up and alive between Sep 6 and 18?

Hello @0xff ,

It does seem the device wasn’t alive between Sep-06 and Sep-18, otherwise, we should be seeing some supervisor/engine logs. What’s strange is that there are no logs of the device either booting (if it was turned off) or recovering (if it was stuck somehow).

I’ve reached out to the engine team to see if they have any ideas about why the container uptime might be higher than the device uptime.

Cheers,
Nico.

Does the system have a working RTC?

One possible explanation is that the system booted with the last “good” time (Sep 06) before syncing and setting the correct time (Sep 18). If the containers are started before time synchronization occurs in the boot process then the start date of the containers will be Sep 06. Therefore, if you run balena ps after the time has been updated the system will think they have been running for 12 days.

That’s an interesting idea. However, we just saw this same problem in the field where the device went offline for 8 hours and the logs were completely missing, even boot logs. The device had been working properly and the time was right before it went offline.

Hi there, is it possible these devices are losing network connectivity intermittently and continuing to run offline? Also, what type of hardware is this running on? Is it possible the device is being put to sleep or quiesced like a virtual machine?

@ab77 I don’t think the devices are being put to sleep in any manner that I know of. This is a pi0w device. Unfortunately we’re blind to what’s going on because the logs disappear, and as pointed above by @ntzovanis , weird that there is no bootup or recovery information in the logs

Hi there,

Just as a sanity check, if the device is not configured for persistent logging there wouldn’t be any logs from previous boots. Can you double check that you have persistent logging enabled for the device(s) in question?

Hi, the device did have persistent boots. Note that in the log provided in the post, the issue is missing logs between Sep 6 and Sep18. I suspect if there was a persistent logging issue, we wouldn’t see the Sep6 logs either.

Hi there – taking another look at the log you orginally posted, I notice that chronyd makes a rather large adjustment to the system clock:

Sep 06 11:50:01 0fce263 chronyd[776]: 2020-09-06T11:50:01Z Source 2606:4700:f1::123 replaced with 51.255.197.148
Sep 06 11:50:02 0fce263 6d544ea7a5a7[1458]: [157B blob data]
Sep 06 11:50:08 0fce263 chronyd[776]: 2020-09-06T11:50:08Z Selected source 51.255.197.148
Sep 18 14:50:08 0fce263 chronyd[776]: 2020-09-06T11:50:08Z System clock wrong by 1047600.109068 seconds, adjustment started
Sep 18 14:50:08 0fce263 chronyd[776]: 2020-09-18T14:50:08Z System clock was stepped by 1047600.109068 seconds

If my math is right, the discrepancy that chronyd notes (1,046,600 seconds) is just about 12 days and 3 hours – which in turn matches the jump in the logged dates (Sep 6, 11:50 -> Sep 18, 14:50). Given that, I’m inclined to agree with the original suggestion: that the system clock came up incorrect; balena engine recorded that incorrect start date; chronyd corrected the time; and now it looks like the containers have been up longer than the device. This seems even more likely if your device does not have an RTC module.

I would speculate that the incorrect time came from the chrony drift file; if this file were saved (which happens once per hour), then the device lost power for some time, then got power back and booted, it could explain the logs you’re seeing.

If I understand correctly, you mentioned another device where you saw similar behaviour. If you’re able to provide info on that device, we can try to determine what happened with it.

All the best,
Hugh