Hey gang,
Just wanted to see how things are going with the time sync improvements and update you on our progress tracking this down.
The customer that reported this has been hitting it a fair amount. They tend to leave their devices off for days at a time, and then start them up in an environment that does not have internet. Right now, the only way for them to resolve the issue is to physically power cycle the unit one or more times after they leave that environment and hope that it resolves itself, which is not a good solution for them (or anyone generally).
I dug into this issue a bit more this weekend. I do believe the chrony 4 hour issue is a problem. If they have the device up without internet for long enough, chrony won’t update the time for hours (not sure how long that wait is).
That said, I believe something else possibly more problematic is going on and this all may be related. I’ve been doing the following in an attempt to replicate the “chrony didn’t update the time” issue:
- Turn off the device and unplug ethernet
- Turn the device on
- In the host OS, use
date -s
to set the date backward by 2 months (to Aug 1)
- I’m not sure what the actual minimum “offline” time is to trigger this issue so I’ve just been going way back
- Wait some amount of time - usually at least a minute, sometimes a few
- Plug in the ethernet
Doing this, I’ve found that chrony does sync the OS time within a couple of seconds, and that time does propagate to the containers as expected. Don’t know if I just need to wait a lot longer before plugging it in to cause chrony not to update. Despite that, however, about 1 in 5-10 times our nemo
container gets in some weird state with the following symptoms:
I haven’t figured out what combination of waiting, trying again, etc. is actually causing it. Unfortunately, since the date is set backwards, the journal is also a little tricky to read. I’ve tried clearing the journal just before starting this process, which helps but I still haven’t found anything obvious.
When this happens, the supervisor and all of the other services seem to be running just fine. Only nemo has this issue for some reason. Restarting nemo from the dashboard resolves the issue, but this is not something a customer can do.
I tried googling as much as I could for this and the best I could find was the following ticket: https://github.com/moby/moby/issues/25381. In it, people seemed to have the same symptom and their recommended resolution was to restart Docker (also not a viable solution for us). It doesn’t really look like anyone in it actually figured out what was happening though.
We could really use help with this one. Don’t know if we want to treat the chrony thing and this forking issue as separate; not clear if they are related or not, but they certainly exhibit the same symptoms from our customers’ perspective. Please let me know if you’d like me to open another thread.
Any advice would be much appreciated. I’m also happy to carve out time to work in real time with someone trying to replicate this since it requires physical access to the unit.
Thanks,
Adam