Supervisor fails to resolve DNS on v4, v5 in offline/air-gapped setup using open-balena

We deploy open-balena to an air-gapped network where the router resolves all the required balena domains: e.g. api.aivero.lan and advertises that DNS server via DHCP.

We balena os configure RaspberryPi3 with balenaOS v2.80.3 and these connect nicely.

We also tried adding a dnsServers: "null" entry to config.json to disable the automatic injection of 8.8.8.8 into the list of DNS servers. In certain cases having 8.8.8.8 caused a timeout waiting on a response from this server which is not reachable due to our air-gapped network.

However, these old images don’t have the fixed/updated HQ camera sensor-mode 5 so we need a newer version.

However, the newest v5.0.8, or v2.115.18+rev2 versions do not connect to open balena. The supervisors errors with getaddrinfo EAI_AGAIN api.aivero.lan:

EDIT: The latest openBalena version for RaspberryPi3 that has the HQ camera fix AND connects correctly is the v2.94.4
For the RaspberryPi4 we are using v2.88.4+rev0 which has both the HQ fix AND connects correctly.

root@9dc1123:~# balena ps
CONTAINER ID   IMAGE                                                            COMMAND                  CREATED          STATUS                             PORTS     NAMES
c699ff174f56   registry2.balena-cloud.com/v2/c5636e5430e2762232e60e19e79c773f   "/usr/src/app/entry.…"   49 seconds ago   Up 41 seconds (health: starting)             balena_supervisor
root@9dc1123:~# balena logs c699ff174f56 -f
INFO: Found device /dev/mmcblk0p1 on current boot device mmcblk0, using as mount for '(resin|balena)-boot'.
INFO: Found device /dev/mmcblk0p5 on current boot device mmcblk0, using as mount for '(resin|balena)-state'.
INFO: Found device /dev/mmcblk0p6 on current boot device mmcblk0, using as mount for '(resin|balena)-data'.
find: /mnt/root/tmp/balena-supervisor/services: No such file or directory
[info]    Supervisor v15.0.4 starting up...
[info]    Setting host to discoverable
[debug]   Starting systemd unit: avahi-daemon.service
[debug]   Starting systemd unit: avahi-daemon.socket
[debug]   Starting logging infrastructure
[info]    Starting firewall
[warn]    Invalid firewall mode: . Reverting to state: off
[info]    Applying firewall mode: off
[success] Firewall mode applied
[debug]   Starting api binder
[debug]   Performing database cleanup for container log timestamps
[info]    Previous engine snapshot was not stored. Skipping cleanup.
[debug]   Handling of local mode switch is completed
(node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
[info]    API Binder bound to: https://api.aivero.lan/v6/
[event]   Event: Supervisor start {}
[info]    Starting API server
[info]    Supervisor API successfully started on port 48484
[debug]   Ensuring device is provisioned
[debug]   Connectivity check enabled: true
[debug]   Starting periodic check for IP addresses
[event]   Event: Device bootstrap {}
[info]    Waiting for connectivity...
[info]    VPN connection is not active.
[info]    New device detected. Provisioning...
[success] Initialised splash image backend
[info]    Reporting initial state, supervisor version and API info
[info]    Attempting to load any preloaded applications
[error]   LogBackend: unexpected error: Error: getaddrinfo EAI_AGAIN api.aivero.lan
[error]         at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:119:26)
[event]   Event: Device bootstrap failed, retrying {"delay":30000,"error":{"cause":{},"isOperational":true,"errno":-3001,"code":"EAI_AGAIN","syscall":"getaddrinfo","hostname":"api.aivero.lan"}}
^C
root@9dc1123:~# ^C
root@9dc1123:~# ping api.aivero.lan
PING api.aivero.lan (192.168.88.243): 56 data bytes
64 bytes from 192.168.88.243: seq=0 ttl=64 time=1.528 ms
64 bytes from 192.168.88.243: seq=1 ttl=64 time=1.777 ms
^C

In the hostOS we can nslookup, ping or curl api.aivero.lan just fine.

Inside the supervisor container nslookup resolves it to the correct IP, but shows it as a Non-Authoritative answer.


@acostach any insights here? Thank you :slight_smile:

We have a temporary workaround:

The latest openBalena version for RaspberryPi3 that has the HQ camera fix AND connects correctly is the v2.94.4

For the RaspberryPi4 version v2.88.4+rev0 has both the HQ fix AND connects correctly.


The question is how we can get the new 5.x.x versions fixed such that RPI3 and RPI4 connect correctly.

Any news on a potential real fix or any guidance on how to debug this further?

FYI @alexgg This is the forum entry we have for Supervisor fails to resolve DNS on v4, v5 in offline/air-gapped setup using open-balena · Issue #2237 · balena-os/balena-supervisor · GitHub

Will document how to reproduce this. One thing we noticed was that our version of open balena was rather old. So we are in the progress of upgrading that.

After a long back and forth it turns out the main blocker was the DNS resolver on the router, plus some caching on both the router and within open balena confuscating the problem.

Instead of using the DNS resolver on the router we are now running https://knot-resolver.readthedocs.io/ on a machine in the network and having the router advertise the DNS server through DHCP.

We are using the steps from Quick tips: Server Name Indication for internal domains with Turris Omnia | dennogumi.org to configure a DNS entry for our aivero.lan domain and all sub-, sub-sub domains.

I will create a MR to open balena that allows for running open balena fully air-gapped and I’ll need some suggestions on how to best integrate that with the usage flow around make.