DNS failure not caught by supervisor

Hi Aidan,
Something we found before rebooting the containers and that we find strange is that the default balena-engine bridge only contained the mqtt container. And name resolution was working from inside that container (via the 127.0.0.2 default resolver), but not from inside other containers.
In comparison, after rebooting the containers all of them are part of the default network.
So what we are looking for is the cause for containers to leave the default bridge.
We will discuss internally to try and hypothesize something and come back to you.

Ok, cool, thank you very much.

Hi,

So this is indeed a very weird one. Once a container is created it isnā€™t possible to change itā€™s network so how it ended up with some of the services missing from the default bridge is beyond me. Also you mentioned that itā€™s only DNS which was affected, and that direct IP comms was OK; this doesnā€™t tally with the services being removed from the network.

I think for this weā€™re going to need a reproduction of the issue and then try and determine what steps led up to it. If this happens again then please let us know and we can try and dive in and investigate ASAP.

Well, lucky you. It looks suspiciously like itā€™s doing it right now.
Iā€™ve enabled support for a week - why bother with days at this point.

Same device.

As an example, UPS container happens to have curl installed, maintenance has a ping endpoint we can hit:
image

Hitting the name fails - could not resolve host.
Hitting the IP succeeds.

Good afternoon,

Any thoughts on the matter yet?

I thought Iā€™d just reiterate a note that may have been missed in my original post:

The device was running v2.38.0+rev1/9.15.7, and not once in nearly a year of running on these OS/Supervisor versions had this issue.

Itā€™s only since updating it to 2.51.1+rev1/11.4.10 versions has it started to do this.

I think itā€™s possibly a fairly common occurrence, actually, that may have been/is being hidden by the PiJuice hardware watchdog restarting the units before we notice. This particular unit is old enough that it doesnā€™t have a PiJuice in it, thus standing out.

Iā€™m currently taking a look at PiWatcher as a WD replacement for the PiJuice that weā€™re phasing out and my dev device has just gone into the same state (no longer hidden by the PiJuice restarting because itā€™s got a non-configured PiWatcher in it, instead).

Hi there, weā€™ve got this ticket escalated to engineering at present, since the investigation on Friday revealed the device has somehow managed to be running two containers on the default bridge with same MAC/IP:

balena inspect 5303ea44ca39 (mqtt)

            "Networks": {
                "1390781_default": {
                    "Aliases": [
                        "mqtt",
                        "5303ea44ca39"
                    ],
                    "NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
                    "EndpointID": "d99d28bb60f97613ddd8ad59cc0a47ac0be45839028eb54eb7c22819d0dae15d",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "MacAddress": "02:42:ac:11:00:02", ...



balena inspect 649b2d719d47 (networking)

            "Networks": {
                "1390781_default": {
                    "Aliases": [
                        "networking",
                        "649b2d719d47"
                    ],
                    "NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
                    "EndpointID": "99a1e937f2d736ed7429ef2a4c54d173f8052182d37857f480a3e8e757c5be6e",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "MacAddress": "02:42:ac:11:00:02", ...

This shouldnā€™t be happening obviously, so we are keep to find out why.

Ah, interesting. Fair enough, thanks for the update.

Do you want the device kept in the current state or can I restart it and get it capturing data again (for however long itā€™ll stay working, at leastā€¦)?

Hi Aidan, Iā€™ll just try to capture some more logs now, and you can restart the device after that, Iā€™ll let you know when Iā€™m done. Thanks for helping out!

1 Like

Iā€™ve fetched all the diagnostics that we need, you can go ahead and restart the device. Let us know if you hit the same issue again, and we will reach out of we need to look into the device more

1 Like

Thanks a lot. Iā€™ve restarted and itā€™s doing its thing.

Iā€™ll keep it up to the side and check in on it as time ticks on and yell whenever it dies again.

Do you think you could share the app that made it get to this state? Or some sort of way of reproducing the problem on our side?

Iā€™ve granted support access to the whole app for you - once again for a whole week because why not.

The device youā€™ve been looking at so far (once again in the error state, btw) is more or less the only one without a PiJuice. All the others have one and so, if they happen to also get into that state, should only display as such for ~10m before getting restarted and the issue disappearing.

I have no idea how to reproduce the issue, unfortunately, or else Iā€™d have been doing that to help track down the cause myself.

One bit of info I can provide is the time at which this seems to have happened most recentlyā€¦


Somewhere between 17:21:04 and 17:21:10 (UTC) yesterday.

Device literally just gained the issue 4 minutes ago!

So looking at this issue again, the container mqtt seems to be the only one that works with respect to DNS queries, both externally and it can resolve itself via the Docker local DNS server on the default bridge:

# cat /etc/resolv.conf 
nameserver 127.0.0.11
...

# nslookup google.com 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Name:   google.com
Address: 2a00:1450:4009:808::200e

# nslookup mqtt 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   mqtt
Address: 172.17.0.2

Is there anything different about how that container is configured? Are there any iptables rules specified for example?

Edit: MQTT container on a device currently showing the issue cannot resolve, unlike your logs suggest it managed:
image

Nyet. Not unless MQTT does something itself under the hood to that effect.
The entire sum total of that container:

mosquitto.conf
image

docker.template
image

docker-compose.yml entry
image

It is, by a large margin, the most basic of our containers.

The only container we have that plays with IP routing is hotspot, enabling internet access for things connected to the wifi point we run.

P.S. Additional note: Any and all configuration, if any is done, is done once at container/app startup, for all containers. Once theyā€™re up and running, they just do their thing. We donā€™t monitor anything in particular outside of our own stuff, we donā€™t really react to x y or z happening with the device, other than simply rebooting if something (of ours) isnā€™t working and hasnā€™t managed to recover. We donā€™t keep playing with settings/configs as time goes on at all.

Hey there! We havenā€™t seen similar issues on any other balena app yet, which makes me think there might me something in the application code sporadically causing this. While we continue investigating, given that the hotspot container is the only one that plays with routing, would it be possible to comment out the hotspot container, push an update to just one device, and see if the issue does happen again?

Possible, yes, so I guess Iā€™ll see about doing just that when I have a moment.

On the OS/supervisor side of things, assuming you can enumerate such things, what changes that may be related to this have been made between 2.38.0+rev1/9.15.7 and 2.51.1+rev1/11.4.10?

As Iā€™ve said multiple times now, this device had no issues while it was on the older OS/supervisor for an entire year - only started displaying this issue once pulled up to date. Therefore, what has happened OS/supervisor side that may contribute to this?

Hi there ā€“ thanks for getting back to us. I understand your frustration, and we thank you for your patience as we work to find out whatā€™s going on here.

As my colleague noted, we have not seen this error on other devices, so weā€™re doing our best to investigate this and understand whatā€™s happening. We will update you on that as soon as we can. In the meantime, as you mentioned, if youā€™re able to try your application with the hotspot container commented out, this may help narrow down the problem.

The supervisor and balenaOS are both open source applications, with changelogs published as part of their repos on Github:

  • Supervisor
  • balenaOS (note that the repo is named ā€œmeta-balenaā€, but is used for building balenaOS images)

One thing I note in that changelog is a bug, similar to what weā€™re seeing, regarding DNS for local containers. A fix for this has been released as part of balenaOS v2.54.2, which is currently available on our staging environment (https://dashboard.balena-staging.com). Iā€™ll check with our team to see when we should have that out in production, or if there is a way to test this by manually adding a firewall rule that matches this fix. The firewall code has changed a fair bit between what youā€™re running and the newer version, so Iā€™m not certain that this applies; however, itā€™s worth investigating. Weā€™ll get back to you with more information.

Thanks again,
Hugh

Hola,

We very much appreciate the attention you are showing our issue. The fact that there are so many names all jumping into the conversation makes us feel warm and fuzzy.
That or thereā€™s only really one of you that canā€™t decide who he wants to be each dayā€¦

Thanks for the reminder that it is open source - surprisingly, in this day and age, not something I tend to think about. Iā€™ll have a poke through, but itā€™s no substitue for those that know it inside and out. From the horses mouth and all that.

[Time has passed]

Yep, distracted by the logs.

I can see about 5 things between the 9.15.4-11.4.10 versions that directly relate in some fashion (other than the js rewrites that could cover any number of things) - i.e. contain any reference to network/ip/dns/fw/etc.
And since 11.4.10 there looks to be about 5 more networky logs up to current 11.12.11.

Similarly, 3/2 split for the OS versions.

Ok then, I guess the sensible course is to do as desired and get one up and rocking without hotspot. If that results in no change, then - short of any other immediate suggestions regarding our containers - we play with the staging builds to see if any of those fixes do the job?

Hi,

One of my teammates previously noticed that two containers ā€“ mqtt and networking ā€“ are getting the same MacAddress, which can be hosing the rest of DNS. A couple things you could try is explicitly assigning a Mac Address assigned in your docker-compose.yml. You might also try disabling one of those two services (and temporarily removing any depends_on references) in your next build and seeing if DNS works without that container running.

Also, are you using any other build tools as part of your build (such as Portainer), or are you just running balena push <app-name> with docker-compose.yml and Dockerfiles?

Thanks,
John