Device "bricked": NetworkManager not changing route to network interface with full connectivity

in some situations the device will get complete offline and it is not possible to recover it with the openBalena standard tools like NetworkManager.

We are using some gateways with the raspberrypi4-64 image.
Each device have an build in EC25 Quectel modem and 2 wired ethernet interfaces.

Now we have the huge issue, that we allow our customer to change the network settings (dhcp, staticip, default gw etc.) the modem is used for us as Fallback-Interface.
=> If there is a faulty configuration, we should be able to connect to our device always and give support.

But if the device will be installed into a machine with an Machine - Intranet (some times the machines router will block the internetconnection), it is not possible to connect via balena vpn ssh to the gateway.

But in the mobile provider webapp we are able to see, that the SIM Card is online.
If the customer removes the wired conenction, a connectio to the device over vpn is possible again.

I was able to reproduce this issue in my network.

Test 1:

  1. LAN1 Static IP, Metrik 1
    
  2. Celluar, Metrik 4
    
  3. Start with full connectivity => GW Online
    
  4. Block Internet Connectivity of Static IP Interface on MAC Layer with Router Access Control
    
  5. Gateway goes offline, although celluar have full internet access
    

Test 2:

  1. LAN1 Static IP, Metrik 1
    
  2. Celluar, Metrik 4
    
  3. Power down Gateway
    
  4. Block Internet Connectivity of Static IP Interface with Router Access Control
    
  5. Power up Gateway
    
  6. Gateway stays offline, although celluar have full internet access
    

Test 3:

  1. LAN1 DHCP IP, Metrik 1
    
  2. Celluar, Metrik 4
    
  3. Power off device
    
  4. Block Internet Connectivity of Static IP Interface with Router Access Control
    
  5. Power on device
    
  6. Gateway never goes online, even onomodo portal says mobile is connected
    

In no case, the NetworkManager is doing his job.
But this is issue occurs only in the combination USB-Modem and Wired Network.

if i use WiFi and Wired Network the Network Manager is able to switch the connection.

In our case i added some functions in our main application in the docker container to detect this situation and resolve it by ourself.

But is there a configuration of the NetworkManager to allow him to change the metric to a mobile device?

I also saw, that the DNS Resolve was not possible in this situation if Wired (No Internet) had Metrik 1 and Mobile (Full Internet) had Metrik 4.

At the moment Mobile get Metrik 1, the DNS Resolve of nmcli connectivity check was possible and i got a online device.

Some cli commands for debugging:

nmcli general logging level DEBUG domains ALL

nmcli networking connectivity check

journalctl -f

route -n
1 Like

I believe I’ve experienced more or less the same issue, and at the very least I’ve been looking to learn more about using cellular or well anything else to act as a fallback using Network Manager.

I’m curious if the case is that the specific route-metric takes full control here - i.e due to the wired ethernet having a lower metric it is always used instead of the cellular connection.

Wild guess, but what happens if you do not specify a specific route-metric for either wired or cellular; maybe this allows Network Manager to figure it out by itself on the fly?

I’m really not sure about best practices here, but I’m also looking to learn more how e.g cellular fallback and net-interface prioritisation is typically done with e.g Network Manager.

if i specify no metric, i see the same behaviour .

in this case eth0 have metric 100, wlan have metric 600 and mobile metric 1000

=> mobile highest metric so networkmanager will never switch to it.

It is realy strange. but i think there must be a bug in the networkmanager…

Hi,

If your server has a static IP, you could try adding a static route in your network connection profile.
Something like
nmcli connection modify MyConnection +ipv4.routes 10.20.30.40/32 10
To add a route with metric 10 to 10.20.30.40.

In your connection file, this will something like

[ipv4]
route1=10.1.1.12/32,0.0.0.0,10

The 0.0.0.0 in this entry is the next hop, which basically means you don’t know/care who will take care of it.
Note that you may need reload your connection to make the route actually work.

Hy,

this could be work on an single device, but i have to manage tens of devices, without knowing the ip of the route in many cases.
For me there is only the question, which interface have access.

Badly there must be a bug in the NetworkManager.
Maybe i will find sometime the time to dig into the source of network manager to find the source of the issue.

For me was the solution to modify the network connection by my application.
I check every minute the state of interfaces.
If i detect a bad state i automatically set the metric of mobile to 1 and metric of eth/wifi interfaces to 4.

With this “fix” i got the device to 100%, even if the machine network blocks randomly internet access over the day.