Unable to connect to devices

Hello,
This morning I have been unsuccessfully attempting to connect to any devices in our fleet. I have tried both the balena-cli and the web app. Both seem to hang. The devices show as “online” and appear to be sending logs, but I have not been able to connect. status.balena.io indicates all systems functioning, any thoughts?

Hey, when you say connect to the devices, what exactly do you mean? Could you link to one of these devices and enable support access please?

I am unable to connect to the hostOS through either balena ssh or through the web console.
I just spun up a new device for you to check and enabled support access, what is the best way to link?

That’s very strange, I’m not sure why that would be the case.

The url of the device status page on the dashboard is enough.

https://dashboard.balena-cloud.com/devices/956462cf19d77d87f5f6d214f30b224d/summary

thank you

I can definitely reproduce, it’s strange, it’s as if the SSH traffic itself is being blocked, which is consistent with the communication between the supervisor and API still working.

Has something changed with the network that these devices are part of recently? Without being able to open a shell on the device, it’s fairly difficult for me to do any debugging.

You mentioned that you just brought this device up for me to check? That certainly implies it’s something to do with the network that the devices are part of. I’ve checked some of my devices and they all have no problem accessing the host OS implying it’s not a problem with our backend.

It is strange, These devices are spread out across many different networks. I am having this issue with both our main application as well as our testing application. We do tunnel connections using the redsocks config, but nothing has changed in the last few days. I will check out our proxy server, it seems strange though that the devices can still heartbeat to the balena app though.
Thanks for checking it out, I will let you know if I find anything strange.

I just noticed on your application, you have the VPN disabled (from the device configuration page) - this is why a connection to the host OS cannot be made, this goes via the VPN.

More information can be found here: https://www.balena.io/docs/reference/supervisor/bandwidth-reduction/#side-effectwarning

That seems odd, There is no fleet wide configuration defined for VPN. and the device-specific config has it enabled. I am still unable to access. Is there a different global switch that I am missing?

Sorry, you’re right, I’m misinterpreting the switch - I saw that and ran with it as it makes so much sense. Are you on the same network as the test device you just brought up?

It would be interesting to try a local ssh into the device with balena ssh <device-ip> and make sure the SSH daemon is running.

If you do manage to connect, the journal logs would be interesting, you can retrieve them with

journalctl -f -n 1000

If there is any sensitive information in your logs, you’d be better to PM me the logs.

Hi, thanks again for looking into it. We have one device that is not using the redsocks config and it seems to be working just fine. This seems more likely to be an issue with our proxy server. We are digging into that now, I’ll let you know if we find anything strange that might be relevant to your system. Thanks.

That does sound fairly likely, please do let us know what you find.

Ok, I think I have tracked down the issue. In our redsocks configuration we put local IP addresses in the no-proxy file. It appears that the resin-vpn interface has changed from a 172.16.0.0/12 type address to a 100.64.0.0/16 address. I was able to solve the login issue by adding that range to the no-proxy file. I am wondering if there is some documentation somewhere that explains the proper range that I should be using here? Additionally, wondering if there is any good way now to update the no-proxy file on ~100 devices without being able to connect through the vpn?

This seems to be related to OpenBalena OpenVPN Service

Is there a way to set an environment variable and use the old VPN subnet?

Seems to be the only mention of the IP range that I can find.

Hi @nleonardi,

That is not something you can change, the settings referenced there are infrastructure-level not device-level.

I am not sure I follow what you are saying about the no-proxy file. The previous VPN subnets were inside the 10.240.0.0/12 range (not 172.16/12 which is likely to be a docker network subnet), is/was this subnet present in your no-proxy file?

Interesting, Honestly I couldn’t remember the exact subnet that it used to be, I had assumed it was a 172.
looking at the no-proxy, I did not have a 10.240.0.0/12, but I did have a 10.0.0.0/8 so that would have covered it

Oh, I think I see the problem; because you’ve got a proxy in front of ssh, the proxy is trying to connect to the VPN interface (on the 100.64/10 address) which is presumably getting redirected by redsocks. I can’t immediately think of an easy remote fix for this.

Can you configure tinyproxy to rewrite the requests from 100.64/10 to localhost?

Another, possibly better, option is to have your application use the supervisor api to add vpn subnet to the network.proxy.noProxy list using the /v1/device/host-config endpoint.

Didn’t realize you had the whole /10 , ok, will update with that then.

I will see if we can do something like that through tinyproxy, in the meantime, we are figuring out a way to update from a privileged container using some ssh shenanigans.

Are there release announcements that we can subscribe to for changes like this in the future? Found the PR in github, can see why you would change from the 10.240/12, but would be nice to know before things blow up. I guess our configuration is probably not the most standard out there, but I’m guessing this will affect other people using redsocks.

We’ve actually changed the VPN subnets before without concern, so I didn’t think we’d have any issues this time either. The reason for the change is that the 100.64/10 subnet is reserved for Carrier-Grade NAT and we were seeing some issues when people were using something inside 10.240/12 locally. While what we’re doing isn’t strictly speaking CGN, the address space felt like a good fit.

If I’m understanding the issue fully, then I think application updates should still work? In which case you could query the current vpn subnet at startup and add it to noproxy, which would future-proof you against any further (albeit unlikely) topo chages in the vpn.