DNS failure not caught by supervisor

alexgg · August 19, 2020, 5:07pm

Hi Aidan,
Something we found before rebooting the containers and that we find strange is that the default balena-engine bridge only contained the mqtt container. And name resolution was working from inside that container (via the 127.0.0.2 default resolver), but not from inside other containers.
In comparison, after rebooting the containers all of them are part of the default network.
So what we are looking for is the cause for containers to leave the default bridge.
We will discuss internally to try and hypothesize something and come back to you.

krenom · August 19, 2020, 5:20pm

Ok, cool, thank you very much.

richbayliss · August 21, 2020, 4:30pm

Hi,

So this is indeed a very weird one. Once a container is created it isn’t possible to change it’s network so how it ended up with some of the services missing from the default bridge is beyond me. Also you mentioned that it’s only DNS which was affected, and that direct IP comms was OK; this doesn’t tally with the services being removed from the network.

I think for this we’re going to need a reproduction of the issue and then try and determine what steps led up to it. If this happens again then please let us know and we can try and dive in and investigate ASAP.

krenom · August 21, 2020, 5:05pm

Well, lucky you. It looks suspiciously like it’s doing it right now.
I’ve enabled support for a week - why bother with days at this point.

Same device.

As an example, UPS container happens to have curl installed, maintenance has a ping endpoint we can hit:

Hitting the name fails - could not resolve host.
Hitting the IP succeeds.

krenom · August 24, 2020, 12:10pm

Good afternoon,

Any thoughts on the matter yet?

I thought I’d just reiterate a note that may have been missed in my original post:

The device was running v2.38.0+rev1/9.15.7, and not once in nearly a year of running on these OS/Supervisor versions had this issue.

It’s only since updating it to 2.51.1+rev1/11.4.10 versions has it started to do this.

I think it’s possibly a fairly common occurrence, actually, that may have been/is being hidden by the PiJuice hardware watchdog restarting the units before we notice. This particular unit is old enough that it doesn’t have a PiJuice in it, thus standing out.

I’m currently taking a look at PiWatcher as a WD replacement for the PiJuice that we’re phasing out and my dev device has just gone into the same state (no longer hidden by the PiJuice restarting because it’s got a non-configured PiWatcher in it, instead).

ab77 · August 24, 2020, 7:09pm

Hi there, we’ve got this ticket escalated to engineering at present, since the investigation on Friday revealed the device has somehow managed to be running two containers on the default bridge with same MAC/IP:

balena inspect 5303ea44ca39 (mqtt)

            "Networks": {
                "1390781_default": {
                    "Aliases": [
                        "mqtt",
                        "5303ea44ca39"
                    ],
                    "NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
                    "EndpointID": "d99d28bb60f97613ddd8ad59cc0a47ac0be45839028eb54eb7c22819d0dae15d",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "MacAddress": "02:42:ac:11:00:02", ...



balena inspect 649b2d719d47 (networking)

            "Networks": {
                "1390781_default": {
                    "Aliases": [
                        "networking",
                        "649b2d719d47"
                    ],
                    "NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
                    "EndpointID": "99a1e937f2d736ed7429ef2a4c54d173f8052182d37857f480a3e8e757c5be6e",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "MacAddress": "02:42:ac:11:00:02", ...

This shouldn’t be happening obviously, so we are keep to find out why.

krenom · August 25, 2020, 7:42am

Ah, interesting. Fair enough, thanks for the update.

Do you want the device kept in the current state or can I restart it and get it capturing data again (for however long it’ll stay working, at least…)?

sradevski · August 25, 2020, 12:00pm

Hi Aidan, I’ll just try to capture some more logs now, and you can restart the device after that, I’ll let you know when I’m done. Thanks for helping out!

sradevski · August 25, 2020, 12:09pm

I’ve fetched all the diagnostics that we need, you can go ahead and restart the device. Let us know if you hit the same issue again, and we will reach out of we need to look into the device more

krenom · August 25, 2020, 3:50pm

Thanks a lot. I’ve restarted and it’s doing its thing.

I’ll keep it up to the side and check in on it as time ticks on and yell whenever it dies again.

floion · August 25, 2020, 9:54pm

Do you think you could share the app that made it get to this state? Or some sort of way of reproducing the problem on our side?

krenom · August 26, 2020, 5:30am

I’ve granted support access to the whole app for you - once again for a whole week because why not.

The device you’ve been looking at so far (once again in the error state, btw) is more or less the only one without a PiJuice. All the others have one and so, if they happen to also get into that state, should only display as such for ~10m before getting restarted and the issue disappearing.

I have no idea how to reproduce the issue, unfortunately, or else I’d have been doing that to help track down the cause myself.

One bit of info I can provide is the time at which this seems to have happened most recently…

Somewhere between 17:21:04 and 17:21:10 (UTC) yesterday.

krenom · August 26, 2020, 12:53pm

Device literally just gained the issue 4 minutes ago!

ab77 · August 27, 2020, 7:37pm

So looking at this issue again, the container mqtt seems to be the only one that works with respect to DNS queries, both externally and it can resolve itself via the Docker local DNS server on the default bridge:

# cat /etc/resolv.conf 
nameserver 127.0.0.11
...

# nslookup google.com 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Name:   google.com
Address: 2a00:1450:4009:808::200e

# nslookup mqtt 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   mqtt
Address: 172.17.0.2

Is there anything different about how that container is configured? Are there any iptables rules specified for example?

krenom · August 28, 2020, 4:59am

Edit: MQTT container on a device currently showing the issue cannot resolve, unlike your logs suggest it managed:

Nyet. Not unless MQTT does something itself under the hood to that effect.
The entire sum total of that container:

mosquitto.conf

docker.template

docker-compose.yml entry

It is, by a large margin, the most basic of our containers.

The only container we have that plays with IP routing is hotspot, enabling internet access for things connected to the wifi point we run.

P.S. Additional note: Any and all configuration, if any is done, is done once at container/app startup, for all containers. Once they’re up and running, they just do their thing. We don’t monitor anything in particular outside of our own stuff, we don’t really react to x y or z happening with the device, other than simply rebooting if something (of ours) isn’t working and hasn’t managed to recover. We don’t keep playing with settings/configs as time goes on at all.

jviotti · August 28, 2020, 1:18pm

Hey there! We haven’t seen similar issues on any other balena app yet, which makes me think there might me something in the application code sporadically causing this. While we continue investigating, given that the hotspot container is the only one that plays with routing, would it be possible to comment out the hotspot container, push an update to just one device, and see if the issue does happen again?

krenom · August 28, 2020, 1:41pm

Possible, yes, so I guess I’ll see about doing just that when I have a moment.

On the OS/supervisor side of things, assuming you can enumerate such things, what changes that may be related to this have been made between 2.38.0+rev1/9.15.7 and 2.51.1+rev1/11.4.10?

As I’ve said multiple times now, this device had no issues while it was on the older OS/supervisor for an entire year - only started displaying this issue once pulled up to date. Therefore, what has happened OS/supervisor side that may contribute to this?

saintaardvark · August 28, 2020, 9:04pm

Hi there – thanks for getting back to us. I understand your frustration, and we thank you for your patience as we work to find out what’s going on here.

As my colleague noted, we have not seen this error on other devices, so we’re doing our best to investigate this and understand what’s happening. We will update you on that as soon as we can. In the meantime, as you mentioned, if you’re able to try your application with the hotspot container commented out, this may help narrow down the problem.

The supervisor and balenaOS are both open source applications, with changelogs published as part of their repos on Github:

Supervisor
balenaOS (note that the repo is named “meta-balena”, but is used for building balenaOS images)

One thing I note in that changelog is a bug, similar to what we’re seeing, regarding DNS for local containers. A fix for this has been released as part of balenaOS v2.54.2, which is currently available on our staging environment (https://dashboard.balena-staging.com). I’ll check with our team to see when we should have that out in production, or if there is a way to test this by manually adding a firewall rule that matches this fix. The firewall code has changed a fair bit between what you’re running and the newer version, so I’m not certain that this applies; however, it’s worth investigating. We’ll get back to you with more information.

Thanks again,
Hugh

krenom · August 29, 2020, 5:30am

Hola,

We very much appreciate the attention you are showing our issue. The fact that there are so many names all jumping into the conversation makes us feel warm and fuzzy.
That or there’s only really one of you that can’t decide who he wants to be each day…

Thanks for the reminder that it is open source - surprisingly, in this day and age, not something I tend to think about. I’ll have a poke through, but it’s no substitue for those that know it inside and out. From the horses mouth and all that.

[Time has passed]

Yep, distracted by the logs.

I can see about 5 things between the 9.15.4-11.4.10 versions that directly relate in some fashion (other than the js rewrites that could cover any number of things) - i.e. contain any reference to network/ip/dns/fw/etc.
And since 11.4.10 there looks to be about 5 more networky logs up to current 11.12.11.

Similarly, 3/2 split for the OS versions.

Ok then, I guess the sensible course is to do as desired and get one up and rocking without hotspot. If that results in no change, then - short of any other immediate suggestions regarding our containers - we play with the staging builds to see if any of those fixes do the job?

jtonello · August 31, 2020, 6:06pm

Hi,

One of my teammates previously noticed that two containers – mqtt and networking – are getting the same MacAddress, which can be hosing the rest of DNS. A couple things you could try is explicitly assigning a Mac Address assigned in your docker-compose.yml. You might also try disabling one of those two services (and temporarily removing any depends_on references) in your next build and seeing if DNS works without that container running.

Also, are you using any other build tools as part of your build (such as Portainer), or are you just running balena push <app-name> with docker-compose.yml and Dockerfiles?

Thanks,
John

Topic		Replies	Views
Unexpected supervisor crash/container restart on network change balenaOS	74	4647	May 6, 2021
Balena Containers leaving Bridge network balenaOS support , docker , network	25	1575	December 28, 2021
Device In Unrecoverable State balenaOS support	8	393	July 19, 2022
Services are in a constant restart loop! balenaOS support , balenafin	51	4070	January 15, 2021
Multicontainer project that works on RPI3 fails on Fin board (avahi & supervisor interaction) Product support	26	2168	December 25, 2018

DNS failure not caught by supervisor

Related topics