Stuck Cycling Between VPN-only and offline

One of my remote Intel NUC’s has gotten stuck in offline mode, and, every 4 to 15 to 60 minutes (varies) it goes into VPN-only mode for a few minutes.

While in VPN-only mode, I’ve tried the Balena Dashboard tools: reboot button, logging in to terminal, stopping the services, all of which fail to execute.

It is looking like a manual reboot is the only way out. In your experience, what types of error could be triggering this cycling between offline mode and VPN-only mode?
Thanks,
Sandy

Hi, welcome to the forums. The VPN-only mode happens when the supervisor running in the device is not able to reach the balenaCloud API. The first thing to check is that the network the device is connected to complies with the Balena network requirements as specified in Network Setup on balenaOS - Balena Documentation.

If the network is compliant, log into the hostOS shell and take a look at the supervisor logs with:

journalctl -u balena-supervisor --no-pager

Please paste the output of that here.

Alex,
Thanks for the speedy reply. I was unable to open hostOS shell during VPN-only mode (never advanced passed “Connecting…”), and finally resorted to a manual reboot. Upon manual reboot of the NUC, I was able to examine the supervisor logs and network port assignments.
Sandy

Supervisor Logs
root@2321700:~# journalctl -u balena-supervisor --no-pager
– Journal begins at Tue 2022-08-09 19:23:48 UTC, ends at Tue 2022-08-09 19:25:45 UTC. –
Aug 09 19:24:02 2321700 balena-supervisor[2981]: [info] Reported current state to the cloud
Aug 09 19:24:12 2321700 balena-supervisor[2981]: [info] Reported current state to the cloud
Aug 09 19:24:33 2321700 balena-supervisor[2981]: [info] Reported current state to the cloud
Aug 09 19:24:44 2321700 balena-supervisor[2981]: [info] Reported current state to the cloud
Aug 09 19:25:05 2321700 balena-supervisor[2981]: [info] Reported current state to the cloud
Aug 09 19:25:16 2321700 balena-supervisor[2981]: [info] Reported current state to the cloud

Network Setup Ports
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root 22u IPv6 31406 0t0 TCP 10.245.46.53:22222->52.4.252.97:62715 (ESTABLISHED)
systemd 1 root 68u IPv6 13774 0t0 TCP *:22222 (LISTEN)
NetworkMa 1022 root 23u IPv4 26838 0t0 UDP 192.168.13.100:68->192.168.13.31:67
dnsmasq 1051 nobody 4u IPv4 1893 0t0 UDP 10.114.102.1:53
dnsmasq 1051 nobody 5u IPv4 1894 0t0 TCP 10.114.102.1:53 (LISTEN)
dnsmasq 1051 nobody 6u IPv4 1895 0t0 UDP 127.0.0.2:53
dnsmasq 1051 nobody 7u IPv4 1896 0t0 TCP 127.0.0.2:53 (LISTEN)
avahi-dae 1194 avahi 11u IPv4 13953 0t0 UDP *:5353
avahi-dae 1194 avahi 12u IPv6 13954 0t0 UDP *:5353
avahi-dae 1194 avahi 13u IPv4 13955 0t0 UDP *:52431
avahi-dae 1194 avahi 14u IPv6 13956 0t0 UDP *:56405
openvpn 1204 openvpn 3u IPv4 26822 0t0 TCP 192.168.13.100:62402->52.7.228.224:443 (ESTABLISHED)
balena-en 1793 root 8u IPv4 20646 0t0 TCP *:80 (LISTEN)
balena-en 1842 root 8u IPv6 20130 0t0 TCP *:80 (LISTEN)
balena-en 1865 root 8u IPv4 20155 0t0 TCP *:502 (LISTEN)
balena-en 1939 root 8u IPv6 22787 0t0 TCP *:502 (LISTEN)
balena-en 1952 root 8u IPv4 22810 0t0 TCP *:2021 (LISTEN)
balena-en 2017 root 8u IPv6 20289 0t0 TCP *:2021 (LISTEN)
node 3020 root 21u IPv6 22196 0t0 TCP *:48484 (LISTEN)
node 3020 root 22u IPv4 27700 0t0 TCP 192.168.13.100:55544->18.232.229.138:443 (ESTABLISHED)
chronyd 3274 root 3u IPv4 27838 0t0 UDP 127.0.0.1:323
chronyd 3274 root 4u IPv6 27839 0t0 UDP [::1]:323
chronyd 3274 root 7u IPv4 27877 0t0 UDP *:1234
chronyd 3274 root 8u IPv6 27878 0t0 UDP *:1234
sshd 5006 root 3u IPv6 31406 0t0 TCP 10.245.46.53:22222->52.4.252.97:62715 (ESTABLISHED)
sshd 5006 root 4u IPv6 31406 0t0 TCP 10.245.46.53:22222->52.4.252.97:62715 (ESTABLISHED)

Hi,
Thanks for sharing an excerpt of the supervisor logs. From the logs, it is apparent that the device is able to reach balena’s API endpoint at least some times. We can see that there are two TCP connections established to port 443. One is by openvpn and the other is by a node daemon to IP address 18.232.229.138 (which is an AWS EC2 IP). This process is most likely the supervisor. Sharing journalctl logs for the supervisor or for all services will help us narrow down the issue effectively.
In my experience, such errors are caused by network appliances between the device and balena’s API servers. Do you happen to have a firewall or a transparent proxy or a Deep Packet Inspection appliance that could be causing this?

Best regards,
Pranav

Missed asking this - do you have any other devices that exhibit similar/identical behaviour? Are those devices in the same network as this device?

I had this happen because my ISP and router were not properly routing IPV6 traffic. Since my application does not need IPV6 at this time, I disabled it in Network Manager.

Pranav,
Thanks for weighing in on this. We do not have a firewall, transparent proxy or deep packet inspection appliance. The NUC unit running BalenaOS usually experiences uninterrupted access to the Balena API endpoint via cell modem.
Today:

  • balenaCloud Terminal is unable to connect to either hostOS or any container. “Red Dot”
  • Cannot access journalctl logs due to balenaCloud Terminal unable to connect
  • balenaCloud Log shows the NUC unit is running its installed program

Attached:

  • Device Health Check
  • Device Diagnostics
  • Screen shot of container status & Terminal session
    Host OS: balenaOS 2.98.33

Supervisor: 14.0.13

What else can I do to regain access to my system, and get you the necessary journalctl information?

Thanks,
Sandy

BalenaCloudDashboardScreenshot-FH-Terminal.pdf (37.6 KB)

Fierce_Hill_23217009bdacc06c967b41cf0ab1c97e_checks_2022.08.16_16.26.21+0000 (1).pdf (18.2 KB)

23217009bdacc06c967b41cf0ab1c97e_diagnostics_2022.08.16_16.28.27+0000.pdf (588 KB)

Barrett,
Thanks for passing along your experiences with this issue. I don’t need IPV6 either. I’ll check this out.

Hi, when you say:

The NUC unit running BalenaOS usually experiences uninterrupted access to the Balena API endpoint via cell modem.

How is the cell modem configured? The diagnostics you attached show mmcli not detecting any modem. There is an ethernet interface with an assigned IP address.

How is the device connected to the internet, via ethernet or via cellular?

Alex, the cell modem is connected to the NUC via the NUC’s ethernet card.

Hi, so I understand that you use an external cellular router and the device connects to it via ethernet. And you are running balenaOS 2.98.33 which is fairly new.

I am wondering whether this is something that has been introduced in recent balenaOS versions. Do you have other devices in your fleet also connecting via the same type of cellular router on older balenaOS releases? If so, do they experience the problem?

I am thinking for example on networkmanager: Use default DHCP timeout by majorz · Pull Request #2597 · balena-os/meta-balena · GitHub which reverts a NM setting related with DHCP timeouts. It might be worth to add:

ipv4.dhcp-timeout=2147483647

To the ethernet connection in the faulty device to see if it has any effect.

1 Like

Alex,
Thanks for the suggestion about adding a default timeout. I’ll give it a try.

We have three NUC setups, all are on the same balenaOS & supervisor versions:

  1. The one in question “Fierce_Hill”: NUC, external, dedicated cellular device connected via ethernet (cellular device is not shared with other devices)
  2. NUC, external, dedicated cellular device connected via ethernet (cellular device is not shared with other devices)
  3. NUC, connected via ethernet to my home network

Only Fierce_Hill had the stuck-in-VPN problems.