Chronyc config is bad if device comes online without internet

Our devices don’t always have internet on first power up, the reason being the upstream router takes a few minutes to boot up.

This seems to cause a conflict with the latest setting of chronyc which attempt to sync "aggressively with NTP servers for the first few minutes the device is on, but then back off to every ~4 hours.

This causes nasty SSL issues and other problems usually caused by bad time.

We’ve been bale to manually run chronyc makestep to work around this, but this can’t be done from the containers (we think?) which makes it not really an option for production.

Any ideas on how we could fix this? Is there a way to force clock sync through the supervisor api?

1 Like

Hey Aaron, this is interesting, I’ll reach out to the OS team.

hey Aaron just to add a bit more here, one of the OS team members is currently working on improving the time sync subsystem of the OS because we have had other cases where people have been bitten by the 4 hour poll time. He is out today but should be back early next week to weigh in on this thread and hopefully give you some good news about future things coming in the OS.

Hey Aaron, we are working on a solution for this, we will inform you on this thread as soon as it is implemented, thank you for patience!

Great. I’ll say my preference is to have the ability to force a sync through the supervisor API since right now waiting for a full OS update would take some time. if there is a way to do so via DBUS that would be ok too.

Hey Aaron, I am checking with the OS team if it makes sense to run chronyc makestep through dbus in the meantime we get the fix out.

Chrony doesn’t have a dbus interface but you can use the systemd interface to restart the service which will trigger the initial synchronisation burst, like so:

DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket dbus-send --system --dest=org.freedesktop.systemd1 --type=method_call --print-reply /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager.RestartUnit string:"chronyd.service" string:"replace"

To access DBus from within a container you would need to follow the instructions here: https://www.balena.io/docs/learn/develop/runtime/#dbus-communication-with-host-os

Points for creativity! We’ll give this a shot.

Hey gang,

Just wanted to see how things are going with the time sync improvements and update you on our progress tracking this down.

The customer that reported this has been hitting it a fair amount. They tend to leave their devices off for days at a time, and then start them up in an environment that does not have internet. Right now, the only way for them to resolve the issue is to physically power cycle the unit one or more times after they leave that environment and hope that it resolves itself, which is not a good solution for them (or anyone generally).

I dug into this issue a bit more this weekend. I do believe the chrony 4 hour issue is a problem. If they have the device up without internet for long enough, chrony won’t update the time for hours (not sure how long that wait is).

That said, I believe something else possibly more problematic is going on and this all may be related. I’ve been doing the following in an attempt to replicate the “chrony didn’t update the time” issue:

  1. Turn off the device and unplug ethernet
  2. Turn the device on
  3. In the host OS, use date -s to set the date backward by 2 months (to Aug 1)
    • I’m not sure what the actual minimum “offline” time is to trigger this issue so I’ve just been going way back
  4. Wait some amount of time - usually at least a minute, sometimes a few
  5. Plug in the ethernet

Doing this, I’ve found that chrony does sync the OS time within a couple of seconds, and that time does propagate to the containers as expected. Don’t know if I just need to wait a lot longer before plugging it in to cause chrony not to update. Despite that, however, about 1 in 5-10 times our nemo container gets in some weird state with the following symptoms:

  • It can’t seem to reach the internet, or at least its attempts to validate AWS credentials is failing repeatedly with EndpointConnectionError in Python (generated by requests library, I believe caused by socket.gaierror - haven’t had a chance to dig into this further to see what OS request/syscall is causing this and what the actual errno is)
    • When this happens, our code retries in a thread once every 5 seconds. We have that thread print the current OS time, so we know that chrony’s updated time is making it into the container but it isn’t resolving anything
  • Docker can’t open a shell into the nemo container: balena exec -it NEMO_ID bash is failing with:
    rpc error: code = 2 desc = "oci runtime error: exec failed: fork/exec /proc/self/exe: no such file or directory"
    
  • One time when I did this I hit the previously reported supervisor crash (https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change), but only once and I haven’t seen it since (yet)

I haven’t figured out what combination of waiting, trying again, etc. is actually causing it. Unfortunately, since the date is set backwards, the journal is also a little tricky to read. I’ve tried clearing the journal just before starting this process, which helps but I still haven’t found anything obvious.

When this happens, the supervisor and all of the other services seem to be running just fine. Only nemo has this issue for some reason. Restarting nemo from the dashboard resolves the issue, but this is not something a customer can do.

I tried googling as much as I could for this and the best I could find was the following ticket: https://github.com/moby/moby/issues/25381. In it, people seemed to have the same symptom and their recommended resolution was to restart Docker (also not a viable solution for us). It doesn’t really look like anyone in it actually figured out what was happening though.

We could really use help with this one. Don’t know if we want to treat the chrony thing and this forking issue as separate; not clear if they are related or not, but they certainly exhibit the same symptoms from our customers’ perspective. Please let me know if you’d like me to open another thread.

Any advice would be much appreciated. I’m also happy to carve out time to work in real time with someone trying to replicate this since it requires physical access to the unit.

Thanks,

Adam

Hello again,

Apologies in the delay in getting back to you. We just released a PR that improves time handling on startup: https://github.com/balena-os/meta-balena/pull/2014

I am trying to track down the issue for improving chrony sychronisation on boot. We will send it to you soon

Hi @ljewalsh,

That PR looks like it might help a little, though it’s still not clear to me what exactly is causing this problem in the first place and I don’t imagine this will resolve the whole issue. Any idea when it’s going to be mainlined for DART MX8M Mini? The last host OS update for it was 2.50.1+rev1 from May, which is using meta-balena 2.47.1 from February.

Cheers,

Adam

Hello @adamshapiro0
I have opened a specific issue for this so that you can monitor progress https://github.com/balena-os/meta-balena/issues/2044