Recently our RPi4 devices running BalenaOS 2.105.21 stopped reporting the correct day/time. The online docs suggest that Balena hosts the service but I wonder if this is only for cloud customers. Can someone confirm where Open gets the correct time and why devices may be failing to get proper time?
Hey @brownster!
I think NTP is managed in the same way for both openBalena and balenaCloud, pointing to our own NTP servers (docs here). You might want to check that UDP port 123
is open to ensure it’s able to connect. But you can also choose to configure your own NTP servers in the config.json
if you prefer (docs here).
Let me know if that’s helpful; if not, tell me more about the situation (are you seeing errors? is the time just off? what do you see coming from chronyd
?) and I’ll reach out to our OS engineers to find out a bit more.
Hello @the-real-kenna,
We would like to take you up on your offer to check with the devs.
We have confirmed that the HostOS for OpenBalena is dependent on some NTP service that lives within our hosted OpenBalena services. There is no named service that makes it clear where such a service would live. (see pic below) We think it might be openbalena-haproxy since we’ve seen this service hang and then cause problems.
The problems are significant and would be considered an outage. If the Openbalena backend is struggling, the local time snaps to an old date and time that is way off killing https connections. Our assumption was that the backend is for deployments only and not really needed for operation. Not true unfortunately.
Questions:
- Is there a way to change the NTP servers on our current fleet?
- What other services are the devices dependent on to run properly? As we end-of-life this device we were hoping to bring the Balena backend down and allow customers to continue to run the last state at their own risk as long as they want.
Reminder that we standardized on BalenaOS 2.105.21
Hey @brownster,
The team shared some details with me, so I’ll avoid trying to rephrase and just give it to you verbatim.
Question 1: Is there a way to change the NTP servers on our current fleet?
“openBalena” is not an NTP time source, nor a client. We can’t comment on whatever (if anything) is providing NTP to the network where it is running. Devices (clients of openBalena) running balenaOS will get their NTP configuration as per the docs we sent (unless they’ve changed that via custom
config.json
params).Also, there isn’t a way to change the NTP remotely for the whole fleet; it’s the same as balenaCloud. You’ll need to do this on the
config.json
per device via hostOS terminal access and I think reboot the device.
Question 2: What other services are the devices dependent on to run properly? As we end-of-life this device we were hoping to bring the Balena backend down and allow customers to continue to run the last state at their own risk as long as they want.
I assume they are trying to understand if there are anything that an operational device needs from the backend to stay operational? The answer is no, they should be able to take the backend down and the devices will just continue to run as usual. However, they won’t have remote access, etc. so they might want to put sshkeys on the device for their users to locally ssh in
Hopefully that’s the insight you needed; let us know.
@the-real-kenna - This is not what we are seeing. We can reproduce this issue consistently. Keep in mind we are not talking about devices that exist behind one firewall or on one network with issues. Our devices are deployed in hundreds of very different customer environments, yet all behave the same way when the openBalena backend goes down. (Loses local time causing a cascade of other issues)
Although not easy to demo, we can demo this for you.
Can you share a bit more about what happens when you bring the backend down? What other services are running that they might be dependent on?
The screenshot you shared of your Kubernetes dashboard has some services that aren’t built by us, so I’m wondering if those are contributing to the issue.
@brownster one other piece of information that would be helpful (in addition to the OS version that you already provided) is which version of each of the components in the openbalena stack you are running (i.e. open-balena-api, open-balen-vpn, etc). We should then look at the release timing of that vs your host os version.
In our experience, the openbalena stack is compatible with balenaos versions at or prior to when it was released (because balena cloud needs to support legacy devices and not force upgrades of host os’s), but not the other way around - and new features / services could be (and regularly are) introduced in newer versions of balena os that require newer versions of openbalena. This isn’t a problem for balena cloud because balena regularly updates the versions of the stack they run in balena cloud, but openbalena customers need to be aware of this dynamic.
We have had to update our openbalena stack many times for this reason, and we lock the host os versions that devices are running to known compatible versions, so that we can coordinate updates of them with updates to openbalena.
Let me provide a little more info about what it seems like we are seeing. It seems that something is happening when our openbalena server is down that is preventing it from properly sync’ing with ntp servers. Not that the openbalena server itself is providing the ntp services.
Is there documentation somewhere about when/how the balena devices sync with ntp that might provide a bit more info about what could be happening?
Hi @jwdev,
Thanks for the additional insight.
The NTP settings are configured on the devices themselves, as part of the config.json
, so it’s not about an actual balena service, but something about communication to the devices that’s problematic. (it could be a balena service that’s the issue, but the point is that what’s breaking is the communication between the devices and their time server it seems).
I’m not sure if you’ve found this document on our site, it’s a bit buried, but it might help you get to the root of things: Time management - Balena Documentation.
Are you able to share more about how your configuration works? i.e. what is between these devices and their time server, what of your services that is being stopped could break communication between the two, etc.?
If it were me, I would probably do a network trace as well. See what the trace looks like when the time server is running properly, and see what might be failing or stoping when you bring down the openBalena stack.
If you’re able to answer @drcnyc question about which version of each component you are running, that would be helpful too. It may be that there’s something about certain versions that would give us a clue.
Looking forward to finding the answer to this one… it’s mysterious and I want to know, lol.
Hey @jwdev,
Just wanted to check in with you. I’m checking Forums each day to make sure I don’t miss your reply. Quick list of open questions:
- Are you able to share more about your setup / configuration (particularly what services, firewalls, etc. might be between your devices and the time server)?
- What were the results of a network trace? Are you able to notice any differences between a device working as-expected when the openBalena services are running and when they are shutdown?
- Can you share the version of each openBalena component you’re running?
Thank you!
Sorry for the lack of response here. I was able to figure out the issue and it is the same issue described in this post: NTP (chrony) not starting on newly provisioned device - #10 by nebbles.
By default it looks like it is indeed required that the openbalena backend be up and running in order for ntp to work correctly. On our devices we saw the same behavior as in that post, but the error occurred because our openbalena server was down. Basically, the timesync-https service just keeps making calls to “[apiEndpoint]/connectivity-check” forever if it can’t connect and chronyd.service is waiting on that service to complete/exit before it will startup. We aren’t yet certain how we will work around this behavior. It would be nice to just be able to disable the timesync-https service, but I don’t believe that is possible?
Hey @jwdev
I’m (unfortunately) familiar with this particular part of BalenaOS now, as you’ve seen. The timesync-https service does make calls to the connectivity-check endpoint, but this is configurable in the config.json file (see below), you just change the os.network.connectivity.uri
field.
"os": {
"network": {
"connectivity": {
"interval": "0",
"uri": "https://api.balena-cloud.com/connectivity-check"
}
},
timesync-https is waiting for a simple response from an endpoint. It’s so simple, in fact, that we created our own one on our server. All that’s needed is an endpoint that responds 204 (No Content) to a GET request and includes a Date
header with the current datetime. This allows timesync-https to complete and unblock chronyd ntp client to do its thing.
We had to do it for our own reasons (we’re not using OpenBalena) but this might help for you too. Note you don’t even need to have a server running, you could use https://httpstat.us/204 if you’re willing to trust it to return the correct time on the Date header, and be available!
Hope this helps
Thank you @jwdev and @nebbles - So when @the-real-kenna reported that the Balena devs commented “they should be able to take the backend down and the devices will just continue to run as usual” this is not true. There is a connectivity check that points to the openBalena backend which can cause serious problems.
Thank again all. Now to figure out the best way to update the config.json file on several thousand openBalena devices!
@brownster doesn’t sound like fun, sorry to hear that. I know there is a tool called configizer that is supposed to help make mass config.json updates, but I haven’t tried it - and you might be better off building tailored shell script to handle it. Just a thought but you might want to explore setting up a forwarding DNS entry for the endpoint that the devices are currently pointing to which redirects to one of the public servers noted above.
Hi, probably mixing two different things.
One is the initial http time sync check that balenaOS uses to update the system time before trying to use certificates to authenticate. This is mostly so that devices without an RTC can authenticate with the cloud early on and use cases like captive portals where NTP does not initially run, or is blocked by a firewall.
The time sync uses the connectivity check URL, which by default is the cloud environment but can be configured to another endpoint that responds in the same way.
The connectivity URL can also be set to “null” to disable both the time sync and connectivity checks.
So devices can continue running when the backend is down but they need to be configured for that event.
@alexgg we are also facing this issue on thousands of devices.
Are you saying that what @nebbles has said above is correct, but you can just change the uri parameter to be null instead of the actual uri?
Also, how long without the openBalena backend will the devices work before the NTP issue starts to occur?
We took down in August and already it’s happened.
It is a shame that this isn’t documented somewhere very obvious in the openBalena docs. Is 1 to 2 months the approx time?
Hey Aaron, if you refer to the initial time sync, it is documented in meta-balena.
The connectivity check URL is used for both the initial time sync and periodic connectivity checks. It defaults to the initial cloud environment but it can be changed in config.json
, and can also be set to null if no time sync/connectivity check is desired.
@alexgg The behaviour we have seen is that if you take down the backend, it loses the time on the SBC eventually and you then at some point get TLS errors due to date/time issues.
What is not clear, is why it needs to backend to get a time sync from NTP?
Hey Aaron, it doesn’t need the backend to sync NTP time. However, if the connectivity URL is left as default ( defaults to $API_ENDPOINT/connectivity-check
), the device is configured to use it to check connectivity, and if the endpoint is not reachable it assumes there is no network connectivity and no network services, including NTP, are started.
The default configuration is optimal for balenaCloud connected devices and it allows to delay connection attempts until the network is available - for example in cases where captive portals or slow cellular devices are used and connectivity is delayed.
If your devices do not have a backend, the correct configuration for them is to set the connectivity URL to null to specify that it should not be used for connectivity checks.