Polling the vpn connection for status

Hey Guys,

We ran into a strange connectivity issue this Friday.

We had a support tech out with a portable access point so we could get in a debug one of our locations.
I was able to get most of the devices online, but I was not able to get a random 2 or 3 out of the group. @CameronDiver was helping me out, but we couldn’t figure out the route cause.

Some of the devices would only be connected to the debugging tempory network, but could see the production network.
A simple:
nmcli c up client_connection or systemctl restart NetworkManager.service
brought that connection right back.

However later, I noticed I could ping some of the offline devices on IPs listed under them from when they were online. Using the hostvia tool, was I was able to ssh into two of them thru another device.

That revealed a strange status, everything but the vpn was working fine… I could ping 8.8.8.8, and other servers. I could even ping the what I think is a balena VPN server: 54.4.252.97 via the route:
52.4.252.97 dev resin-vpn scope link src 10.240.14.47 from ip route s
I verified this with traceroute, it was a direct connection.
However, the device still showed up as offline.
Here are the OpenVPN logs.

I was able to get the device back online with:
systemctl status openvpn

I can do this from within my container, so my question other than why is this happening, how can I detect this happening and then reset the vpn from within the device. I know how to do the second part via the dbus, but my best guess for the first part is this endpoint but I can’t see to hit it when I am not in local mode.

Thanks
-Thomas

Hi @taclog thanks for the detailed report! Could you let us know what OS version and devices you are using?

Hey @lucianbuzzo,

This from the Odroid UX4 running 2.38.3 rev3 and sometimes rev2

-Thomas

Thanks for the report, we are passing it on to our networking/VPN expert, who’s looking into this, and we will get back to you soon with more information.

If you could enable persistent logging on these devices, that might help, and try to catch the devices in the act (of not being connected, but still reachable over hostvia). Would that work for you? We are looking into a couple of other reports, and digging into what might be happening.

Also for the “how can I detect this happening” I heard a recommendation - a bit hacky, but if you query from the device the device’s status on the API, you can see if the backend marks the device “online” (ie. connected to the VPN) or not https://www.balena.io/docs/reference/api/resources/device/
It’s likely not recommended, but could work.

But we’ll get back to you with more information soon!

Hi @taclog, just an update about the VPN issue you were facing. I just talked to our VPN engineer, and the issue that we had with the VPN for the past couple of weeks should be resolved now, so you shouldn’t be experiencing this issue anymore. Let us know if there is anything else we can help with.

Thanks guys,

@imrehg
I found that we don’t really need to know this now that everything is working great again.

@sradevski
Give Will my best, everything has been great on our end. (minus the API going down during an install last night) but that was resolved quickly enough.

Overall the problem with our devices isn’t a problem with balena. :slight_smile:

-Thomas

Hi, a heads up, that we have finished our investigation, and released a post-mortem for the long running VPN incident, when online devices were incorrectly marked offline. You can read it here https://status.balena.io/incidents/xg4n3sh37qnt and please don’t hesitate letting us know if you have any further questions or issues regarding this! Thank you!