DNS failure not caught by supervisor

It looks like this:

> node-gyp rebuild
gyp
 ERR! 
configure error
 

gyp 
ERR! stack Error: Can't find Python executable "python", you can set the PYTHON env variable.

gyp 
ERR! 
stack
     at PythonFinder.failNoPython (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:484:19)
gyp 
ERR! stack     at PythonFinder.<anonymous> (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:406:16)
gyp 
ERR! stack
     at F (/usr/local/lib/node_modules/npm/node_modules/which/which.js:68:16)
gyp 
ERR! stack     at E (/usr/local/lib/node_modules/npm/node_modules/which/which.js:80:29)

gyp ERR!
 stack
     at /usr/local/lib/node_modules/npm/node_modules/which/which.js:89:16
gyp
 
ERR!
 
stack
     at /usr/local/lib/node_modules/npm/node_modules/isexe/index.js:42:5
gyp
 
ERR!
 
stack
     at /usr/local/lib/node_modules/npm/node_modules/isexe/mode.js:8:5

gyp
 
ERR!
 
stack
     at FSReqWrap.oncomplete (fs.js:154:21)

gyp
 
ERR!
 
System Linux 4.15.0-45-generic

gyp
 
ERR!
 
command
 "/usr/local/bin/node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"

gyp
 
ERR!
 
cwd
 /usr/src/app/node_modules/abstract-socket

gyp
 
ERR! 
node -v
 v10.14.0

gyp
 
ERR!
 
node-gyp -v
 v3.8.0

gyp
 
ERR!
 
not ok 

added 191 packages from 229 contributors and audited 878 packages in 12.729s
found 5433 vulnerabilities (3916 low, 13 moderate, 1503 high, 1 critical)
  run `npm audit fix` to fix them, or `npm audit` for details
npm WARN basestation-networking@1.0.0 No repository field.
npm WARN basestation-networking@1.0.0 No license field.
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: abstract-socket@2.0.0 (node_modules/abstract-socket):
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: abstract-socket@2.0.0 install: `node-gyp rebuild`
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: Exit status 1

Oh, you’re looking at build logs. Yes, those are gyp errors that aren’t relevant. I don’t recall the particulars of it as it was explained to me over a year ago now but something someting attempts to rebuild a specific version something something falls back to some pre-built version. Cleaning our builds of all such warnings/errors is on our rather large to-do list, however.

Our networking container is working fine with no running errors that we’re aware of. Networking simply ensures that the relevant settings are in place (using DBus) for the modem/sim so that we have 3/4G connectivity.

Hi Aidan,

Ok. Just wanted to confirm. Is it possible restart all the containers so we can recheck name resolution?

John

Restarting all now.

Containers appear to be able to talk to each other again - what black magic went on in the background to make that happen?

Hi Aidan,
Something we found before rebooting the containers and that we find strange is that the default balena-engine bridge only contained the mqtt container. And name resolution was working from inside that container (via the 127.0.0.2 default resolver), but not from inside other containers.
In comparison, after rebooting the containers all of them are part of the default network.
So what we are looking for is the cause for containers to leave the default bridge.
We will discuss internally to try and hypothesize something and come back to you.

Ok, cool, thank you very much.

Hi,

So this is indeed a very weird one. Once a container is created it isn’t possible to change it’s network so how it ended up with some of the services missing from the default bridge is beyond me. Also you mentioned that it’s only DNS which was affected, and that direct IP comms was OK; this doesn’t tally with the services being removed from the network.

I think for this we’re going to need a reproduction of the issue and then try and determine what steps led up to it. If this happens again then please let us know and we can try and dive in and investigate ASAP.

Well, lucky you. It looks suspiciously like it’s doing it right now.
I’ve enabled support for a week - why bother with days at this point.

Same device.

As an example, UPS container happens to have curl installed, maintenance has a ping endpoint we can hit:
image

Hitting the name fails - could not resolve host.
Hitting the IP succeeds.

Good afternoon,

Any thoughts on the matter yet?

I thought I’d just reiterate a note that may have been missed in my original post:

The device was running v2.38.0+rev1/9.15.7, and not once in nearly a year of running on these OS/Supervisor versions had this issue.

It’s only since updating it to 2.51.1+rev1/11.4.10 versions has it started to do this.

I think it’s possibly a fairly common occurrence, actually, that may have been/is being hidden by the PiJuice hardware watchdog restarting the units before we notice. This particular unit is old enough that it doesn’t have a PiJuice in it, thus standing out.

I’m currently taking a look at PiWatcher as a WD replacement for the PiJuice that we’re phasing out and my dev device has just gone into the same state (no longer hidden by the PiJuice restarting because it’s got a non-configured PiWatcher in it, instead).

Hi there, we’ve got this ticket escalated to engineering at present, since the investigation on Friday revealed the device has somehow managed to be running two containers on the default bridge with same MAC/IP:

balena inspect 5303ea44ca39 (mqtt)

            "Networks": {
                "1390781_default": {
                    "Aliases": [
                        "mqtt",
                        "5303ea44ca39"
                    ],
                    "NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
                    "EndpointID": "d99d28bb60f97613ddd8ad59cc0a47ac0be45839028eb54eb7c22819d0dae15d",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "MacAddress": "02:42:ac:11:00:02", ...



balena inspect 649b2d719d47 (networking)

            "Networks": {
                "1390781_default": {
                    "Aliases": [
                        "networking",
                        "649b2d719d47"
                    ],
                    "NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
                    "EndpointID": "99a1e937f2d736ed7429ef2a4c54d173f8052182d37857f480a3e8e757c5be6e",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "MacAddress": "02:42:ac:11:00:02", ...

This shouldn’t be happening obviously, so we are keep to find out why.

Ah, interesting. Fair enough, thanks for the update.

Do you want the device kept in the current state or can I restart it and get it capturing data again (for however long it’ll stay working, at least…)?

Hi Aidan, I’ll just try to capture some more logs now, and you can restart the device after that, I’ll let you know when I’m done. Thanks for helping out!

1 Like

I’ve fetched all the diagnostics that we need, you can go ahead and restart the device. Let us know if you hit the same issue again, and we will reach out of we need to look into the device more

1 Like

Thanks a lot. I’ve restarted and it’s doing its thing.

I’ll keep it up to the side and check in on it as time ticks on and yell whenever it dies again.

Do you think you could share the app that made it get to this state? Or some sort of way of reproducing the problem on our side?

I’ve granted support access to the whole app for you - once again for a whole week because why not.

The device you’ve been looking at so far (once again in the error state, btw) is more or less the only one without a PiJuice. All the others have one and so, if they happen to also get into that state, should only display as such for ~10m before getting restarted and the issue disappearing.

I have no idea how to reproduce the issue, unfortunately, or else I’d have been doing that to help track down the cause myself.

One bit of info I can provide is the time at which this seems to have happened most recently…


Somewhere between 17:21:04 and 17:21:10 (UTC) yesterday.

Device literally just gained the issue 4 minutes ago!

So looking at this issue again, the container mqtt seems to be the only one that works with respect to DNS queries, both externally and it can resolve itself via the Docker local DNS server on the default bridge:

# cat /etc/resolv.conf 
nameserver 127.0.0.11
...

# nslookup google.com 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Name:   google.com
Address: 2a00:1450:4009:808::200e

# nslookup mqtt 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   mqtt
Address: 172.17.0.2

Is there anything different about how that container is configured? Are there any iptables rules specified for example?

Edit: MQTT container on a device currently showing the issue cannot resolve, unlike your logs suggest it managed:
image

Nyet. Not unless MQTT does something itself under the hood to that effect.
The entire sum total of that container:

mosquitto.conf
image

docker.template
image

docker-compose.yml entry
image

It is, by a large margin, the most basic of our containers.

The only container we have that plays with IP routing is hotspot, enabling internet access for things connected to the wifi point we run.

P.S. Additional note: Any and all configuration, if any is done, is done once at container/app startup, for all containers. Once they’re up and running, they just do their thing. We don’t monitor anything in particular outside of our own stuff, we don’t really react to x y or z happening with the device, other than simply rebooting if something (of ours) isn’t working and hasn’t managed to recover. We don’t keep playing with settings/configs as time goes on at all.