Healthcheck and supervisor container DNS resolution issues

Greetings,
i have an issue regarding the OpenBalena setup.

I wish to deploy OpenBalena for a few Raspberry Pi 4 devices on my local network. I have followed the setup guide on the openbalena documentation page, and have successfully set up the following:

  • OpenBalena server on a local device (an Intel NUC used as a development server) - openBalena - Home
  • Balena-cli on my local machine with working certificates (balena login, balena scan, balena deploy etc, they all work)
  • A local DNS server running via dnsmasq on another local NUC, with all the necessary addresses pointing to my openbalena server (api, registry, vpn, s3, tunnel)
  • Ran balena os-configure to configure a new downloaded BalenaOS image for the Raspberry Pi 4-s in the fleet
  • Configured static IPs for all of the devices, including the BalenaOS devices via the resin-ethernet file in system-connections (includes the dns field with my DNS address)

The working segments of the setup are as follows:

  • The devices (when flashed with the preconfigured image), boot normally and connect to the network as it’s specified in the network file.
  • The balena scan command shows the devices’ info (i have them running in development mode) and the version of balenaOS (2.88.5+rev1)
  • I am able to ssh onto the devices.

This is where the issues start:

The clock on the devices does not sync (left the device running for a day, clock did not sync). I have then successfully forced the time sync with /usr/sbin/chronyd.
After getting the correct clock, the device then manages to get the configuration for the fleet, pulls the docker images, and spins up the containers. Everything seems to be functioning normally docker-wise.
However, upon issuing “balena devices” on the server, both of the devices are reported offline. I can even access their logs, but the healthcheck seems to be failing. When I have ssh’d onto the devices, I can confirm the following:

  • The dnsmasq service still uses the google DNS address (8.8.8.8) on top of my local DNS IP. I have checked all of the configuration files used, and changed the only file that contains 8.8.8.8 as the nameserver (/run/dnsmasq.servers) to the IP of my DNS. The file gets overwritten by the google DNS every time dnsmasq is restarted.
  • The supervisor container periodically reports “Event: Device state report failure {“error”:“getaddrinfo ENOTFOUND api.mydomain”}”. I have tried to curl -k api.mydomain/ping address from inside the balena_supervisor container, and the address resolution fails. However, if i run the same command on the device itself, the address is resolved and i get OK as the response.
  • If I manually set the “dns” value inside a created /etc/docker/daemon.json that points to my address, the balena docker engine fails to start.

I suspect that the DNS resolution for the balena/docker engine is the main culprit, and prevents the devices to be shown as online in the openbalena server side.

Any and all insight on the topic is most welcome, as I have been trying to debug the situation for days, to no avail. Thank you!

1 Like

Hello,

thanks for the detailed information. I’m trying to go through the issues you report inline:

The dnsmasq service still uses the google DNS address (8.8.8.8) on top of my local DNS IP. I have checked all of the configuration files used, and changed the only file that contains 8.8.8.8 as the nameserver (/run/dnsmasq.servers) to the IP of my DNS. The file gets overwritten by the google DNS every time dnsmasq is restarted.

Where did you change this, on the devices itself? I cannot fully follow where these changes took place.

The supervisor container periodically reports “Event: Device state report failure {“error”:“getaddrinfo ENOTFOUND api.mydomain”}”. I have tried to curl -k api.mydomain/ping address from inside the balena_supervisor container, and the address resolution fails. However, if i run the same command on the device itself, the address is resolved and i get OK as the response.

This indicates that the forwarding of the DNS resolution is not working properly. Can you please try to use nslookup api.mydomain to check the DNS resolution?
ping and nslookup work differently. In addition, I’d like to know that the certificates are working for the local domain name, therefore you need to check the certificate with openssl commands:
openssl s_client -servername api.mydomain -connect api.mydomain:443
Can you also try this command on the hostOS and inside of the balena_supervisor container:
curl -v https://api.mydomain

Additional, there are some special cases considered in the balenaOS / supervisor for .local domains can you please share with us, if the domain is ending with .local

If I manually set the “dns” value inside a created /etc/docker/daemon.json that points to my address, the balena docker engine fails to start.

I don’t see the value of this and it should not be applied.

Best Regards
Harald

Hello,

thank you for the swift and comprehensive reply.

can you please share with us, if the domain is ending with .local

To address the last question first, I am not using .local as my domain ending anymore. I have done so for my first setup, then found a thread on this forum suggesting that it can cause issues, and proceeded to set-up everything from scratch with a different domain name (without .local).

Where did you change this, on the devices itself? I cannot fully follow where these changes took place.

The networking files that i have changed the DNS server IP setting were /etc/systemd/system/dnsmasq.service.d/dnsmasq.conf, /etc/resolv.dnsmasq, and /run/dnsmasq.servers. I am aware that these files should not be changed, but I have tried to only in an attempt to debug the situation.

Update as of 9.3.2022:
I have found the reason DNS resolution wasn’t working. The config.json file was missing the “dnsServers” key (I must have overlooked it while copying to a new device), and this solves the issue of 8.8.8.8 being used as the default DNS server. I have used the nslookup command as suggested, and verified that the domain resolution now works from the container. The curl -v https://api.mydomain command also works with the correct certificates from the supervisor container and balenaOS itself.

I have since done a clean flash of another Raspberry Pi with my configured image, put in my network files, config.txt (I’m using a PiCamera) and config.json, and the device started up successfully, but the configuration was not pulled from the server because “certificate was not valid yet”. I was able to ssh onto the device, and saw that chronyd wasn’t running(?). When i started it manually, the system clock was stepped, and the configuration was then pulled normally.
The deploy command on the server works, and the configuration is updated correctly if I change the docker-compose file of my services used in .balena deploy command. The command curl -k https://api.openbalena.livs/os/v1/config successfully returns the device config both on the BalenaOS and in its supervisor container. My deployed services are also fully functional. :slightly_smiling_face:

However, the healthcheck (balena devices commandon the server) still shows the device as Status: Idle Is_Online: false. Do you have any insight why would this be happening? The only error i see in journalctl of the device is LogBackend: server responded with status code: 504. Could this be the issue?