We have had numerous devices this morning (previously working fine) for which we have been unable to access the public device url. Is there an unreported problem or outage related to the public device urls? Or could something else be causing this problem?
Public device urls are a critical component of our operations, and I’d greatly appreciate any help resolving this.
Hi, we have no reported outages with our Cloud Link service today (which the public URL feature depends on)
Are you still experiencing the issue? Can you access the device via the dashboard terminal? What is the status of these devices?
We were still experiencing the issue this morning, but it has cleared up on at least some of the devices I’ve tested in the last few minutes. This is occurring on devices that are ‘Online’, and, no, I cannot access the device via the dashboard terminal when the public_device_url is not accessible.
I’d appreciate any insight into what caused this and how to prevent it in the future, or resolve it if it does recur.
As my colleague Felipe (pipex) shared earlier, we did not have a VPN service (i.e ‘cloudlink’) outage in that time window.
Can you share answers to the following questions, please. They will help us narrow down the problem.
- How many devices encountered this issue?
- Did they face the issue at around the same time?
- Is this the first time you experienced this?
- Did all the devices eventually recover? How long did it take? Were reboots needed?
From what you describe, this does not seem like a device side (ie balenaOS) issue. As someone who is fairly familiar with this part of the balenaCloud, I can guess that the devices got disconnected during a scale-down event of the cloudlink service (in the balenaCloud backend). We are working on making the scaling seamless (from the PoV of devices).
Rebooting the device should definitely resolve this issue. In fact, just restarting the openvpn service on the device will resolve the problem. However, without VPN connectivity, logging into the device may not be feasible (unless you have physical access or another way to
ssh into the device).
Hope that helps.
Thanks and regards,
Thanks for your help figuring this out.
Our main fleet had about 200 devices at the time of the issue. We did not attempt to access all of their public device urls, but more than 75% of those we did attempt to access were experiencing this issue.
Yes, this definitely occurred around the same time, and it was the first time we experienced this issue.
Yes, all the devices recovered. It took a little over 24 hours. Reboots were not required (and, as you noted, without physical access to the devices, rebooting while experiencing connectivity issues is not always feasible). It is definitely helpful to know that rebooting will resolve the problem if we encounter it in the future, as it will allow us to resolve the situation for those devices we do have physical access to.
I hope this information helps you narrow down the problem.
Thanks for taking the time to document this Kathryn. We’ve captured the notes in our patterns. Please let us know if you experience any further issues.