Oh, you’re looking at build logs. Yes, those are gyp errors that aren’t relevant. I don’t recall the particulars of it as it was explained to me over a year ago now but something someting attempts to rebuild a specific version something something falls back to some pre-built version. Cleaning our builds of all such warnings/errors is on our rather large to-do list, however.
Our networking container is working fine with no running errors that we’re aware of. Networking simply ensures that the relevant settings are in place (using DBus) for the modem/sim so that we have 3/4G connectivity.
Hi Aidan,
Something we found before rebooting the containers and that we find strange is that the default balena-engine bridge only contained the mqtt container. And name resolution was working from inside that container (via the 127.0.0.2 default resolver), but not from inside other containers.
In comparison, after rebooting the containers all of them are part of the default network.
So what we are looking for is the cause for containers to leave the default bridge.
We will discuss internally to try and hypothesize something and come back to you.
So this is indeed a very weird one. Once a container is created it isn’t possible to change it’s network so how it ended up with some of the services missing from the default bridge is beyond me. Also you mentioned that it’s only DNS which was affected, and that direct IP comms was OK; this doesn’t tally with the services being removed from the network.
I think for this we’re going to need a reproduction of the issue and then try and determine what steps led up to it. If this happens again then please let us know and we can try and dive in and investigate ASAP.
I thought I’d just reiterate a note that may have been missed in my original post:
The device was running v2.38.0+rev1/9.15.7, and not once in nearly a year of running on these OS/Supervisor versions had this issue.
It’s only since updating it to 2.51.1+rev1/11.4.10 versions has it started to do this.
I think it’s possibly a fairly common occurrence, actually, that may have been/is being hidden by the PiJuice hardware watchdog restarting the units before we notice. This particular unit is old enough that it doesn’t have a PiJuice in it, thus standing out.
I’m currently taking a look at PiWatcher as a WD replacement for the PiJuice that we’re phasing out and my dev device has just gone into the same state (no longer hidden by the PiJuice restarting because it’s got a non-configured PiWatcher in it, instead).
Hi there, we’ve got this ticket escalated to engineering at present, since the investigation on Friday revealed the device has somehow managed to be running two containers on the default bridge with same MAC/IP:
Hi Aidan, I’ll just try to capture some more logs now, and you can restart the device after that, I’ll let you know when I’m done. Thanks for helping out!
I’ve fetched all the diagnostics that we need, you can go ahead and restart the device. Let us know if you hit the same issue again, and we will reach out of we need to look into the device more
I’ve granted support access to the whole app for you - once again for a whole week because why not.
The device you’ve been looking at so far (once again in the error state, btw) is more or less the only one without a PiJuice. All the others have one and so, if they happen to also get into that state, should only display as such for ~10m before getting restarted and the issue disappearing.
I have no idea how to reproduce the issue, unfortunately, or else I’d have been doing that to help track down the cause myself.
One bit of info I can provide is the time at which this seems to have happened most recently…
So looking at this issue again, the container mqtt seems to be the only one that works with respect to DNS queries, both externally and it can resolve itself via the Docker local DNS server on the default bridge:
P.S. Additional note: Any and all configuration, if any is done, is done once at container/app startup, for all containers. Once they’re up and running, they just do their thing. We don’t monitor anything in particular outside of our own stuff, we don’t really react to x y or z happening with the device, other than simply rebooting if something (of ours) isn’t working and hasn’t managed to recover. We don’t keep playing with settings/configs as time goes on at all.