Device offline on web dashboard and cli

After placing an Intel NUC device on the production site, the device went offline on the web dashboard and cli, but the device is still reporting logs, deploys container updates, shows the correct container status…

How can I debug this? I suspect the device is not capable of connecting to the vpn service, but it would be useful to check this assumption somewhere on the device.

That sounds like it. Do you happen to have another device in the same network that we can use to hop through and investigate this issue? Otherwise, are you able to deploy one for debugging purposes?

Currently there is not another balena device on the site, but we will install one in the near future. Thanks!

Also, was the device showing connected before and dropped out of nowhere? If yes, are you aware of any changes on the local network side (especially firewall related)?

The connection issues are firewall related, we did some testing with our own vpn servers and detected some connection resets from the third party firewall.

We integrated some basic supervisor api features shutdown/reboot/update in our main app container, but I wondered if balenaOS itself can be updated without the VPN?

Hi,

This is a list of network requirements to use BalenaOS
https://www.balena.io/docs/reference/OS/network/2.x/#network-requirements

If the VPN port443 is not open you won’t be able to access your boards.

Same problem here, but no firewall in place.

That is, one of my devices is marked offline since 21 hours, but it still communicate with my own server. The device is a Raspberry Pi 3 running balenaOS 2.32.0+rev1 and Supervisor 9.14.0. It connects to the Internet via a 3G USB dongle. It was been seen online before, but now the Balena VPN connection seems to be down and I cannot SSH into it, nor reboot it to see if the VPN comes back again.
Is there a kind of automatic retry on the VPN connection?
Other suggestions to get the VPN back?

Hi,

Thank you for the report. We’re looking into this, but in the short term, is the device running a development release, or is there another online device on the same network? Either would allow access to try and investigate what is occurring on the device.

Best regards,

Heds

No, sorry. The devices runs a production image a it only has a 3G USB dongle with a a common SIM with private addressing.

Is there a kind of automatic retry on the VPN connection?
That is, is the VPN expected to retry infinitely to connect, once every 24 hours or something like that?

@daghemo the VPN should reconnect almost immediately after a disconnect but we are currently seeing some instances where the client is not restarting when the connection times out. If you are able to power cycle your device this should get it back online, but I cannot currently provide any more information on the actual issue.

Hi @daghemo,

While we are still investigating what caused this issue with the VPN in the first place, we have performed some triage from the server side and most devices should have now reconnected. Again, we are still actively investigating this issue, but please let us know if you observe the behavior at all going forward.

Thanks for reporting this, and we will update you when we have more information on the underlying issues!

HI @xginn8 and @wrboyce.
I have another device with the same problem. I can also provider its URL, but it cannot be managed at the moment, because of the VPN being down and no direct access.

We are seeing similar issues when switching between SSIDs.

Our setup always has an ethernet connection as well. We change SSIDs on the fly and when that happens the device seems to go offline in the Balena Dashboard, but we keeps receiving logs from the device from our software.

It seems whenever we change the WiFi network on the device the whole VPN goes off, plus we are also seeing the ethernet connection being interrupted.

Is it possible to bind the VPN to the ethernet port if there is connectivity on it?

Same problem here, in the past. I should check again.
For sure some system just lost connection, that is, no VPN nor log output too.

To @agherzan, @xginn8, @wrboyce and @hedss, would it be possible to reset/restart the VPN on the host device from within an application container?
That is, having no active VPN on the device means that the device itself cannot be reached/managed remotely, but I can put a small script within the application container to check if the VPN is gone, then try to restart it when needed.
I think that the VPN is not managed directly from the NetworkManager, but within the Supervisor.
Is there any way to force che VPN to restart?

Hi,
If only the VPN is not working, but the device can reach our API, one can trigger a VPN restart, by changing the device configuration “Enable/Disable VPN” in the dashboard. The supervisor polls our API not via the VPN, but directly via HTTPS. The supervisor polls the API about every 10 minutes, so please make sure to wait for a couple of minutes after a change to that setting so that the supervisor syncs it. The device itself could also set this in our cloud API. We don’t offer any service to perform this locally.

Hi, a heads up, that we have finished our investigation, and released a post-mortem for the long running VPN incident, when online devices were incorrectly marked offline. You can read it here https://status.balena.io/incidents/xg4n3sh37qnt and please don’t hesitate letting us know if you have any further questions or issues regarding this! Thank you!

I am also facing this issue, things were working fine till yesterday. But now when i create a new device. As soon as I boot it up it shows online but goes offline post updating the docker images.

I tried many a times but not able to figure out the exact reason.

I have running devices in this account those are running since sometime, they are doing fine. But the new devices appears online as soon as I boot them from a fresh image, but post downloading docker images they go offline.

I am using Raspberry Pi 4.

Hello @vikalp

Does power-cycle make your device online again?
If yes, I would suggest enabling persistent logging (in the device configuration) and trying to reproduce again. If the device gets online after a reboot, with persistent logging enabled, you can get relevant information in the device logs (that you can access with journalctl in a terminal session).

Let us know how it goes.