BalenaOS DNS failure

Hi all,

Seldom we experience a DNS failure in some devices Raspberry Pi3 (containers and Raspberry Pi Zero W

I’ve read some topics on this forum but, although sometimes insightful, couldn’t find one that helps with this.

Both containers and the Host cannot ping domains but can ping IP’s

This leads to lose track of time, I guess because it cannot resolve NTP server names

NetworkManager and Dnsmasq are both running in the Host and I tried restarting them but it didn’t solve.

If I reboot system, it can be solved but I was looking for a smoother recovery and reboot only as a last resort.

Any hints?

Thank you and best regards

Hi, when the problem occurs can you ping the default gateway and DNS server by IP address?

Hello @alexgg ,

Thanks for replying.
As soon as I catch a similar situation, I’ll make the test you’re suggesting!

Hi @alexgg

I’m facing a similar situation: ping IP addresses but not domains

Captura de ecrã 2023-07-06, às 16.03.56

It somehow lost reference to 8.8.8.8 DNS server in resolv.dnsmasq

Captura de ecrã 2023-07-06, às 16.18.05

A BIT OF CONTEXT
This is a device running raspberrypi zero 2W (64 bit)
BalenaOs in development mode

It is connected via a lte 4G modem and it connects to the balena VPN: I have access, via dashboard, to both HostOS and a container

Also it is connected via ethernet to a router which I disconnected from the internet (gateway: 192.168.2.10)

and despite from being connected to the internet via 4G modem, it seems that it is routing DNS through the router (that is disconnected from internet)

I’m trying to wrap my head around this.

If I understood correctly your suggestion, it pings the gateway: 192.168.2.10

Any ideas ?

FINAL EDIT:
As soon as I unplugged the eth from the router (which is not connected to the internet) all is fine again…

Captura de ecrã 2023-07-06, às 16.44.18

My guess is that somehow it uses primarily the DNS routing from the eth (besides the actual active connection).

PS - plugging the eth cable again, it falls back again to be unable to resolve DNS

If you can help understanding this behavior and some way to avoid this.

Thank you very much
If need be I can provide you with more details about the fleet and device

Best regards

Hi,

Could it be that this is simply an issue of metrics?
In general ethernet devices get assigned a better routing metric than wireless or gsm.
You can check this with ip route.
You can try to force the route to use your 4G by adding a rule to your connection settings.
nmcli connection modify <name> ipv4.routes 8.8.8.8/32 <metric>

Regarding your DNS entries getting replaced all together, you can try to add ipv4.ignore-auto-dns to your ethernet connection settings; that should prevent DHCP from adding new DNS servers.

Hi @TJvV !

Thank you for the quick reply.
I also thought about that possibility, but we had already defined the route metrics as follows:

Ethernet: 6
nmcli connection show eth0

Cellular (4G): 3
nmcli connection show cellular

To be more concise:

With the ETH cable plugged (links to a router without internet)

nmcli: cellular connection is prioritized due to the route metric

ping domain - NOK, ping IP - OK

ip route

nslookup

Captura de ecrã 2023-07-07, às 16.08.50

Without ETH cable

nmcli

ping domain - OK, ping IP - OK

ip route

nslookup

IN SHORT

Despite the route metric favoring cellular connection over the ethernet connection, it seems that whenever I plug the ethernet cable, it automatically tries to resolve DNS “through it”

Thank you and best regards

Hi,

The configuration does sound sane.

Can you show your /etc/resolv.conf in both scenarios?
My guess is that your DHCP adds a server in the 192.168.2.9/24 subnet, which would get picked up with metric 6 as the cellular only has a /30 subnet.

Again, can you also try it with ipv4.ignore-auto-dns enabled on your ethernet connection?
That should prevent DHCP from adding new servers.

Hi @TJvV

Without ethernet cable

With ethernet cable plugged

After adding:

nmcli connection modify eth0 ipv4.ignore-auto-dns yes

The behavior stops and, as you suggested, and the resolv files remain unchanged

Now We have to ponder if this is an overall good configuration for our purposes.
Thank you a lot for the suggestion and best regards

Hi,

glad to hear it’s working :slight_smile:

If you don’t want to discard the DHCP DNS altogether, it seems there’s also a setting ipv4.dns-priority that might help in fixing the order of your nameservers.
Maybe try setting that on both connections, with the cellular having a lower value; note that the default values are 50 for VPN and 100 for others and 0 selects the default.

Based on your routing table, it should then first try the 172.30.8.5 and 172.30.8.6 via cdc-wdm0, only going to 192.168.2.10 via eth0