Chronyc config is bad if device comes online without internet

Hey gang,

Just wanted to see how things are going with the time sync improvements and update you on our progress tracking this down.

The customer that reported this has been hitting it a fair amount. They tend to leave their devices off for days at a time, and then start them up in an environment that does not have internet. Right now, the only way for them to resolve the issue is to physically power cycle the unit one or more times after they leave that environment and hope that it resolves itself, which is not a good solution for them (or anyone generally).

I dug into this issue a bit more this weekend. I do believe the chrony 4 hour issue is a problem. If they have the device up without internet for long enough, chrony won’t update the time for hours (not sure how long that wait is).

That said, I believe something else possibly more problematic is going on and this all may be related. I’ve been doing the following in an attempt to replicate the “chrony didn’t update the time” issue:

  1. Turn off the device and unplug ethernet
  2. Turn the device on
  3. In the host OS, use date -s to set the date backward by 2 months (to Aug 1)
    • I’m not sure what the actual minimum “offline” time is to trigger this issue so I’ve just been going way back
  4. Wait some amount of time - usually at least a minute, sometimes a few
  5. Plug in the ethernet

Doing this, I’ve found that chrony does sync the OS time within a couple of seconds, and that time does propagate to the containers as expected. Don’t know if I just need to wait a lot longer before plugging it in to cause chrony not to update. Despite that, however, about 1 in 5-10 times our nemo container gets in some weird state with the following symptoms:

  • It can’t seem to reach the internet, or at least its attempts to validate AWS credentials is failing repeatedly with EndpointConnectionError in Python (generated by requests library, I believe caused by socket.gaierror - haven’t had a chance to dig into this further to see what OS request/syscall is causing this and what the actual errno is)
    • When this happens, our code retries in a thread once every 5 seconds. We have that thread print the current OS time, so we know that chrony’s updated time is making it into the container but it isn’t resolving anything
  • Docker can’t open a shell into the nemo container: balena exec -it NEMO_ID bash is failing with:
    rpc error: code = 2 desc = "oci runtime error: exec failed: fork/exec /proc/self/exe: no such file or directory"
    
  • One time when I did this I hit the previously reported supervisor crash (https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change), but only once and I haven’t seen it since (yet)

I haven’t figured out what combination of waiting, trying again, etc. is actually causing it. Unfortunately, since the date is set backwards, the journal is also a little tricky to read. I’ve tried clearing the journal just before starting this process, which helps but I still haven’t found anything obvious.

When this happens, the supervisor and all of the other services seem to be running just fine. Only nemo has this issue for some reason. Restarting nemo from the dashboard resolves the issue, but this is not something a customer can do.

I tried googling as much as I could for this and the best I could find was the following ticket: https://github.com/moby/moby/issues/25381. In it, people seemed to have the same symptom and their recommended resolution was to restart Docker (also not a viable solution for us). It doesn’t really look like anyone in it actually figured out what was happening though.

We could really use help with this one. Don’t know if we want to treat the chrony thing and this forking issue as separate; not clear if they are related or not, but they certainly exhibit the same symptoms from our customers’ perspective. Please let me know if you’d like me to open another thread.

Any advice would be much appreciated. I’m also happy to carve out time to work in real time with someone trying to replicate this since it requires physical access to the unit.

Thanks,

Adam