DNS failure not caught by supervisor

Good morning,

We seem to have an occassionally recurring issue wherein our containers no longer have working DNS and cannot talk to each other.

As an aside on this, we believe it to be a Balena-side issue that has been introduced sometime since OS/supervisor v2.38.0+rev1/9.15.7.

We have had one device up and running with nearly zero issues for just shy of a year now - previously running that OS/supervisor combination. This device was literally one of the first we put out there and frankly we have expected it to fall over a lot sooner (our software has improved since then, of course).

One thing of note about this device is that it doesn’t have a UPS (PiJuice) as it was before we introduced them. Therefore, we have had no hardware watchdog in place for this device at all.

On Friday, we decided it would make a good test device to take and update to the latest 2.21.1+rev1/11.4.10. If any issues popped up, this would give some insight into whether it’s our own software or anything underlying on the Balena-side of things that may be a cause.

Over the weekend, an issue popped up. It’s one we’ve seen a few times before on more recent devices (cannot provide anything more accurate than ‘recent’, I’m afraid).

All of the containers are unable to contact each other via name at all.
image

They can successfully communicate via IP:
image

This doesn’t help, however, because all the apps are using names so we don’t have to deal with IPs.

Is this a known issue? Has anyone else come across this?

A simple restart solves the issue, but the issue itself isn’t detected automatically by the supervisor as one would hope/expect. A hardware WD also solves the issue - without our containers able to contact each other we can’t send a keepalive, and so it gets rebooted.

Given our picked test device does not have any hardware WD capabilities; the only thing to have changed is the Balena OS/supervisor versions and not our own code; and we know that it has not had this issue for the past year - it seems to indicate an issue on the Balena-side of the stack.

I realise that we can implement our own check for inter-container connectivity, but it doesn’t look like it’ll do us any good as any requests to the supervisor get refused, so we wouldn’t be able to issue a restart:

Any suggested/expected fixes?

Please note that I have not attempted any restarts of this device at the moment as I want to keep it in this state throughout this process should any diagnostics/troubleshooting steps be requested. While the device is in this state we are losing data, but figure that is a fair trade if we can fix this issue.

Thanks

Hello Aidan, my first suggestion would be to run device health checks and diagnostics from the Diagnostics tab.
That would get us an overview of the device’s state and all relevant logs to hopefully catch the problem. Good call on not rebooting the device, as unless you have persistent logging enabled (from Device Configuration tab) you’ll loose all device state information.
Feel free to post here your results (screenshot or diagnostic logs) and we’ll take a look. You can DM me the files if you don’t want to expose any sensitive data.

Also, can you share what device you are running? balenaOS 2.21.1+rev1 might be a rather old version depending on your device (for some devices we are on 2.53 for instance).

The device is a BalenaFin!

We do have persistent logging enabled, but the device doesn’t have a Diagnostics tab. There’s a shiny button that says Grant support access - would that do in place?

That way you can have a poke around without needing to work through my fingers.

Hello Aidan, sure thing, grant us support access and share the device UUID with us, and we’ll take a look

Done: 99bc729de242c8617e2db2d3c35a11ed

Hello Aidan,
Is there any change you can share with us the Dockerfile? more specifically we want to make sure if the services are using : network:host
We looked into the health check and found the following errors:
(The health checks you can find under Diagnostics on the vertical menu on the left of your dashboard)

Hola!

As mentioned originally, I don’t have a diagnostics option for this device. I have for others, but not this one:

The only change to networking mode is for our hotspot container within our docker-compose.yml:
image

On startup, Hotspot takes ownership of the wlan0 interface from Network Manager and runs its own instances of Dnsmasq and HostApd in order to control a wireless access point for our sensors to connect.

Other than that, our Networking container controls the modem settings via Dbus commands:
image
But as you can see, contains no network mode changes.

There’s no sign of :network:host changes in any of our Dockerfile.template files or the compose.

Hi @JuanFRidano,

Is there anything else to be gained for leaving the device in the current state? If so, fine, if not then we’ll get it restarted so that we’re no longer losing data.

Please let us know.

Hello,

Is there anything else to be gained for leaving the device in the current state? If so, fine, if not then we’ll get it restarted so that we’re no longer losing data.

If the device has not been rebooted, please allow support access to it so we can have another look.

A bit late, but to answer this:

I realise that we can implement our own check for inter-container connectivity, but it doesn’t look like it’ll do us any good as any requests to the supervisor get refused, so we wouldn’t be able to issue a restart:

In order to communicate with the supervisor, you need to add the io.balena.features.supervisor-api label to your container and use the BALENA_SUPERVISOR_ADDRESS and BALENA_SUPERVISOR_API_KEY env vars. See https://www.balena.io/docs/reference/supervisor/supervisor-api/#http-api-reference

Hi @zvin,

Still not rebooted, no - I have enabled support access for another day, please poke at will.

In order to communicate with the supervisor, you need to add the io.balena.features.supervisor-api label to your container

Ah, I thought it just gave you access to the env variables for it, not that it whitelisted your container. Fair enough.

Thanks, I’m having a look.

1 Like

Hi,

I notice in the logs for your networking container that there are quite a few errors, starting with a Can't find Python executable "python" error. Can you double-check the Dockerfile for that service?

Thanks,
John

Good afternoon!

Where are you seeing these errors?
We’re aware of a variety of errors that get spewed out of that container for a variety of reasons, but that’s not a familiar one.

It certainly can find python, because it’s running…

Hi Aidan,

If you look the latest Releases (under the app dashboard view, not the device view) , click on the last successful release and then scroll down past your docker-compose.yml, click on “networking” under Build logs. The error isn’t immediately apparent, but scrolling reveals errors in red.

John

It looks like this:

> node-gyp rebuild
gyp
 ERR! 
configure error
 

gyp 
ERR! stack Error: Can't find Python executable "python", you can set the PYTHON env variable.

gyp 
ERR! 
stack
     at PythonFinder.failNoPython (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:484:19)
gyp 
ERR! stack     at PythonFinder.<anonymous> (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:406:16)
gyp 
ERR! stack
     at F (/usr/local/lib/node_modules/npm/node_modules/which/which.js:68:16)
gyp 
ERR! stack     at E (/usr/local/lib/node_modules/npm/node_modules/which/which.js:80:29)

gyp ERR!
 stack
     at /usr/local/lib/node_modules/npm/node_modules/which/which.js:89:16
gyp
 
ERR!
 
stack
     at /usr/local/lib/node_modules/npm/node_modules/isexe/index.js:42:5
gyp
 
ERR!
 
stack
     at /usr/local/lib/node_modules/npm/node_modules/isexe/mode.js:8:5

gyp
 
ERR!
 
stack
     at FSReqWrap.oncomplete (fs.js:154:21)

gyp
 
ERR!
 
System Linux 4.15.0-45-generic

gyp
 
ERR!
 
command
 "/usr/local/bin/node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"

gyp
 
ERR!
 
cwd
 /usr/src/app/node_modules/abstract-socket

gyp
 
ERR! 
node -v
 v10.14.0

gyp
 
ERR!
 
node-gyp -v
 v3.8.0

gyp
 
ERR!
 
not ok 

added 191 packages from 229 contributors and audited 878 packages in 12.729s
found 5433 vulnerabilities (3916 low, 13 moderate, 1503 high, 1 critical)
  run `npm audit fix` to fix them, or `npm audit` for details
npm WARN basestation-networking@1.0.0 No repository field.
npm WARN basestation-networking@1.0.0 No license field.
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: abstract-socket@2.0.0 (node_modules/abstract-socket):
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: abstract-socket@2.0.0 install: `node-gyp rebuild`
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: Exit status 1

Oh, you’re looking at build logs. Yes, those are gyp errors that aren’t relevant. I don’t recall the particulars of it as it was explained to me over a year ago now but something someting attempts to rebuild a specific version something something falls back to some pre-built version. Cleaning our builds of all such warnings/errors is on our rather large to-do list, however.

Our networking container is working fine with no running errors that we’re aware of. Networking simply ensures that the relevant settings are in place (using DBus) for the modem/sim so that we have 3/4G connectivity.

Hi Aidan,

Ok. Just wanted to confirm. Is it possible restart all the containers so we can recheck name resolution?

John

Restarting all now.

Containers appear to be able to talk to each other again - what black magic went on in the background to make that happen?