Chronyc config is bad if device comes online without internet

Our devices don’t always have internet on first power up, the reason being the upstream router takes a few minutes to boot up.

This seems to cause a conflict with the latest setting of chronyc which attempt to sync "aggressively with NTP servers for the first few minutes the device is on, but then back off to every ~4 hours.

This causes nasty SSL issues and other problems usually caused by bad time.

We’ve been bale to manually run chronyc makestep to work around this, but this can’t be done from the containers (we think?) which makes it not really an option for production.

Any ideas on how we could fix this? Is there a way to force clock sync through the supervisor api?

1 Like

Hey Aaron, this is interesting, I’ll reach out to the OS team.

hey Aaron just to add a bit more here, one of the OS team members is currently working on improving the time sync subsystem of the OS because we have had other cases where people have been bitten by the 4 hour poll time. He is out today but should be back early next week to weigh in on this thread and hopefully give you some good news about future things coming in the OS.

Hey Aaron, we are working on a solution for this, we will inform you on this thread as soon as it is implemented, thank you for patience!

Great. I’ll say my preference is to have the ability to force a sync through the supervisor API since right now waiting for a full OS update would take some time. if there is a way to do so via DBUS that would be ok too.

Hey Aaron, I am checking with the OS team if it makes sense to run chronyc makestep through dbus in the meantime we get the fix out.

Chrony doesn’t have a dbus interface but you can use the systemd interface to restart the service which will trigger the initial synchronisation burst, like so:

DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket dbus-send --system --dest=org.freedesktop.systemd1 --type=method_call --print-reply /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager.RestartUnit string:"chronyd.service" string:"replace"

To access DBus from within a container you would need to follow the instructions here: https://www.balena.io/docs/learn/develop/runtime/#dbus-communication-with-host-os

Points for creativity! We’ll give this a shot.

Hey gang,

Just wanted to see how things are going with the time sync improvements and update you on our progress tracking this down.

The customer that reported this has been hitting it a fair amount. They tend to leave their devices off for days at a time, and then start them up in an environment that does not have internet. Right now, the only way for them to resolve the issue is to physically power cycle the unit one or more times after they leave that environment and hope that it resolves itself, which is not a good solution for them (or anyone generally).

I dug into this issue a bit more this weekend. I do believe the chrony 4 hour issue is a problem. If they have the device up without internet for long enough, chrony won’t update the time for hours (not sure how long that wait is).

That said, I believe something else possibly more problematic is going on and this all may be related. I’ve been doing the following in an attempt to replicate the “chrony didn’t update the time” issue:

  1. Turn off the device and unplug ethernet
  2. Turn the device on
  3. In the host OS, use date -s to set the date backward by 2 months (to Aug 1)
    • I’m not sure what the actual minimum “offline” time is to trigger this issue so I’ve just been going way back
  4. Wait some amount of time - usually at least a minute, sometimes a few
  5. Plug in the ethernet

Doing this, I’ve found that chrony does sync the OS time within a couple of seconds, and that time does propagate to the containers as expected. Don’t know if I just need to wait a lot longer before plugging it in to cause chrony not to update. Despite that, however, about 1 in 5-10 times our nemo container gets in some weird state with the following symptoms:

  • It can’t seem to reach the internet, or at least its attempts to validate AWS credentials is failing repeatedly with EndpointConnectionError in Python (generated by requests library, I believe caused by socket.gaierror - haven’t had a chance to dig into this further to see what OS request/syscall is causing this and what the actual errno is)
    • When this happens, our code retries in a thread once every 5 seconds. We have that thread print the current OS time, so we know that chrony’s updated time is making it into the container but it isn’t resolving anything
  • Docker can’t open a shell into the nemo container: balena exec -it NEMO_ID bash is failing with:
    rpc error: code = 2 desc = "oci runtime error: exec failed: fork/exec /proc/self/exe: no such file or directory"
    
  • One time when I did this I hit the previously reported supervisor crash (https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change), but only once and I haven’t seen it since (yet)

I haven’t figured out what combination of waiting, trying again, etc. is actually causing it. Unfortunately, since the date is set backwards, the journal is also a little tricky to read. I’ve tried clearing the journal just before starting this process, which helps but I still haven’t found anything obvious.

When this happens, the supervisor and all of the other services seem to be running just fine. Only nemo has this issue for some reason. Restarting nemo from the dashboard resolves the issue, but this is not something a customer can do.

I tried googling as much as I could for this and the best I could find was the following ticket: https://github.com/moby/moby/issues/25381. In it, people seemed to have the same symptom and their recommended resolution was to restart Docker (also not a viable solution for us). It doesn’t really look like anyone in it actually figured out what was happening though.

We could really use help with this one. Don’t know if we want to treat the chrony thing and this forking issue as separate; not clear if they are related or not, but they certainly exhibit the same symptoms from our customers’ perspective. Please let me know if you’d like me to open another thread.

Any advice would be much appreciated. I’m also happy to carve out time to work in real time with someone trying to replicate this since it requires physical access to the unit.

Thanks,

Adam

Hello again,

Apologies in the delay in getting back to you. We just released a PR that improves time handling on startup: https://github.com/balena-os/meta-balena/pull/2014

I am trying to track down the issue for improving chrony sychronisation on boot. We will send it to you soon

Hi @ljewalsh,

That PR looks like it might help a little, though it’s still not clear to me what exactly is causing this problem in the first place and I don’t imagine this will resolve the whole issue. Any idea when it’s going to be mainlined for DART MX8M Mini? The last host OS update for it was 2.50.1+rev1 from May, which is using meta-balena 2.47.1 from February.

Cheers,

Adam

Hello @adamshapiro0
I have opened a specific issue for this so that you can monitor progress https://github.com/balena-os/meta-balena/issues/2044

Hi guys,

Just checking up on this issue - it’s been a while.

We had an issue a few weeks ago with a customer device where it recorded 10 data logs with nearly identical start timestamps, all recorded at different times throughout the course of the day. The customer powered down the device each time (turned off their vehicle), and the device is set to start automatically on boot when the turn the vehicle back on. The boot time is pretty deterministic, so if the host OS was starting up with the same timestamp each boot we would expect to see this symptom - the OS timestamp at the start of each log was the same to roughly +/- 1 second.

We believe the issue is that chrony was not updating its “lastest timestamp” value that it stores on disk before the device was powered off, so it started with the same initial timestamp after each boot (our devices do not currently have an RTC). It’s not clear if chrony was able to get time from the Balena time servers at all throughout the day or not. Unfortunately we don’t have journal output from the device at the time to confirm one way or another.

The cell modem in their vehicle takes some time to boot and connect, so it’s possible chrony couldn’t reach the internet immediately after boot. From my testing back at the beginning of this ticket, it seems that chrony does not update the “latest timestamp” on disk when it syncs, just at this infrequent interval. So even if the device did sync some time later after it booted and started recording, the next boot would likely still use the same old timestamp.

That being said, we have not seen this issue before or since in that particular vehicle, so it’s not clear if it was the cell modem having trouble connecting or something else. It could also have been an upstream time server issue or similar (this happened on Feb 9 in case you know of any time server issues from around then). We have confirmed from the data in the logs that the device definitely did have working internet access throughout the day.

We think some of the following features might help if they could be issued from within a container:

  • Ability to query chrony and/or the host OS time sync service (Issues · balena-os/meta-balena · GitHub not merged) to check if time has synced successfully since the last boot
  • Ability to check for an RTC
    • We do not currently have one but plan to add one for new devices in a future hardware update
  • Ability to control how often the time service updates its “last timestamp”
    • I haven’t found how this is controlled currently in chrony, or even where it’s stored on disk, but it seems to be very infrequent
  • Ability to manually tell the time service/chrony to do a sync
  • Ability to add alternative time servers
    • Devices currently only use the Balena time servers, but in the case of an upstream connectivity issue or something it would be nice to fall back to time.gov or similar

For now, we can maybe modify our software to manually hit time.gov when it starts or something to at least check if chrony is incorrect, but that still may help if there’s no internet connection. Obviously, having an RTC is the best long term solution, but in the meantime having chrony save time to disk more frequently would help a lot.

What do you think? Any ideas for ways we can mitigate this?

Thanks,

Adam

Hi Adam, thanks for the details about your use case. Time synchronization with and without an RTC is a critical feature for BalenaOS and we are working on improving the way time is handled.

Recent releases already address the problem chrony was having to update the last timestamp. We have replaced this chrony functionality with a fake.hwclock service, see systemd/timeinit: add fake.hwclock to maintain system time over reboots · balena-os/meta-balena@96c2c49 · GitHub. This change is available since 2.60.

About the NTP servers, by default BalenaOS uses a pool of servers provided by ntp.org which are already distributed (see pool.ntp.org: the internet cluster of ntp servers). However, you can customize your own NTP server list if you prefer, see Configuration - Balena Documentation for details.

You mention querying the OS time sync service to check is time was synchronized and manually do a time sync- however this should not be needed if the time sync service was working properly and we are actively working on making that happen.

We are still working on ways to improve the time synchronization and looking for example to alternatives to NTP when it is not available, see systemd/timeinit: add HTTPS time synchronisation service by markcorbinuk · Pull Request #2074 · balena-os/meta-balena · GitHub.

However I believe that a recent BalenaOS will probably solve your time synchronization problems.

Hi Alex,

Thanks for the update. Any idea when the Variscite Dart-MX8M image will be updated? Our hardware is based on that platform, and we’re trying as hard as we can to use Balena’s mainline host OS images instead of custom builds, which we originally had to do before official support was added for the platform. The last update for the dart image was BalenaOS 2.50.1+rev1 nearly a year ago.

Also, it looks from the update like fake.hwclock saves once an hour. Is there any way we can speed that up a bit (or a container can manually tell it to save)? Most car trips are much shorter than that, and in those cases it wouldn’t get updated and would experience the same repeated timestamp issue.

Cheers,

Adam

Hi Adam, I have created an update request for our devices team to prioritize at Update to a recent BalenaOS version (>2.60) to take advantage of time sync improvements · Issue #73 · balena-os/balena-variscite-mx8 · GitHub. We will update this ticket once the release is available.
OS customization is not something that we currently support but we are working on improving the configuration framework so that these use cases become feasible. I will open a discussion internally so that we can provide a solution to your use case in the short term.

Ok, thanks. Much appreciated.

One thing that occurs to me is that your application container would start the fake-hwclock-update service using d-bus to sync the time to disk on demand. Would that work?

Yes, absolutely that would work for us. We have a management container that already does other background stuff like that, so it would be reasonable to add. Is there an example of starting a service via dbus?

Hi Adam, we do have some examples on the OS masterclass available at BalenaOS Masterclass - Balena Documentation. I haven’t tried the fake-hwclock-update service in particular but it should work in the same way.