Hi Aidan,
Something we found before rebooting the containers and that we find strange is that the default balena-engine bridge only contained the mqtt container. And name resolution was working from inside that container (via the 127.0.0.2 default resolver), but not from inside other containers.
In comparison, after rebooting the containers all of them are part of the default network.
So what we are looking for is the cause for containers to leave the default bridge.
We will discuss internally to try and hypothesize something and come back to you.
Ok, cool, thank you very much.
Hi,
So this is indeed a very weird one. Once a container is created it isnāt possible to change itās network so how it ended up with some of the services missing from the default bridge is beyond me. Also you mentioned that itās only DNS which was affected, and that direct IP comms was OK; this doesnāt tally with the services being removed from the network.
I think for this weāre going to need a reproduction of the issue and then try and determine what steps led up to it. If this happens again then please let us know and we can try and dive in and investigate ASAP.
Well, lucky you. It looks suspiciously like itās doing it right now.
Iāve enabled support for a week - why bother with days at this point.
Same device.
As an example, UPS container happens to have curl installed, maintenance has a ping endpoint we can hit:
Hitting the name fails - could not resolve host.
Hitting the IP succeeds.
Good afternoon,
Any thoughts on the matter yet?
I thought Iād just reiterate a note that may have been missed in my original post:
The device was running v2.38.0+rev1/9.15.7, and not once in nearly a year of running on these OS/Supervisor versions had this issue.
Itās only since updating it to 2.51.1+rev1/11.4.10 versions has it started to do this.
I think itās possibly a fairly common occurrence, actually, that may have been/is being hidden by the PiJuice hardware watchdog restarting the units before we notice. This particular unit is old enough that it doesnāt have a PiJuice in it, thus standing out.
Iām currently taking a look at PiWatcher as a WD replacement for the PiJuice that weāre phasing out and my dev device has just gone into the same state (no longer hidden by the PiJuice restarting because itās got a non-configured PiWatcher in it, instead).
Hi there, weāve got this ticket escalated to engineering at present, since the investigation on Friday revealed the device has somehow managed to be running two containers on the default bridge with same MAC/IP:
balena inspect 5303ea44ca39 (mqtt)
"Networks": {
"1390781_default": {
"Aliases": [
"mqtt",
"5303ea44ca39"
],
"NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
"EndpointID": "d99d28bb60f97613ddd8ad59cc0a47ac0be45839028eb54eb7c22819d0dae15d",
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"MacAddress": "02:42:ac:11:00:02", ...
balena inspect 649b2d719d47 (networking)
"Networks": {
"1390781_default": {
"Aliases": [
"networking",
"649b2d719d47"
],
"NetworkID": "182e9c726b791bd9b3c21634c2b219f4d34b1c5c870758d2418af4418888f38c",
"EndpointID": "99a1e937f2d736ed7429ef2a4c54d173f8052182d37857f480a3e8e757c5be6e",
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"MacAddress": "02:42:ac:11:00:02", ...
This shouldnāt be happening obviously, so we are keep to find out why.
Ah, interesting. Fair enough, thanks for the update.
Do you want the device kept in the current state or can I restart it and get it capturing data again (for however long itāll stay working, at leastā¦)?
Hi Aidan, Iāll just try to capture some more logs now, and you can restart the device after that, Iāll let you know when Iām done. Thanks for helping out!
Iāve fetched all the diagnostics that we need, you can go ahead and restart the device. Let us know if you hit the same issue again, and we will reach out of we need to look into the device more
Thanks a lot. Iāve restarted and itās doing its thing.
Iāll keep it up to the side and check in on it as time ticks on and yell whenever it dies again.
Do you think you could share the app that made it get to this state? Or some sort of way of reproducing the problem on our side?
Iāve granted support access to the whole app for you - once again for a whole week because why not.
The device youāve been looking at so far (once again in the error state, btw) is more or less the only one without a PiJuice. All the others have one and so, if they happen to also get into that state, should only display as such for ~10m before getting restarted and the issue disappearing.
I have no idea how to reproduce the issue, unfortunately, or else Iād have been doing that to help track down the cause myself.
One bit of info I can provide is the time at which this seems to have happened most recentlyā¦
Somewhere between 17:21:04 and 17:21:10 (UTC) yesterday.
So looking at this issue again, the container mqtt
seems to be the only one that works with respect to DNS queries, both externally and it can resolve itself via the Docker local DNS server on the default bridge:
# cat /etc/resolv.conf
nameserver 127.0.0.11
...
# nslookup google.com 127.0.0.11
Server: 127.0.0.11
Address: 127.0.0.11:53
Non-authoritative answer:
Name: google.com
Address: 2a00:1450:4009:808::200e
# nslookup mqtt 127.0.0.11
Server: 127.0.0.11
Address: 127.0.0.11:53
Non-authoritative answer:
Non-authoritative answer:
Name: mqtt
Address: 172.17.0.2
Is there anything different about how that container is configured? Are there any iptables rules specified for example?
Edit: MQTT container on a device currently showing the issue cannot resolve, unlike your logs suggest it managed:
Nyet. Not unless MQTT does something itself under the hood to that effect.
The entire sum total of that container:
mosquitto.conf
docker.template
docker-compose.yml entry
It is, by a large margin, the most basic of our containers.
The only container we have that plays with IP routing is hotspot, enabling internet access for things connected to the wifi point we run.
P.S. Additional note: Any and all configuration, if any is done, is done once at container/app startup, for all containers. Once theyāre up and running, they just do their thing. We donāt monitor anything in particular outside of our own stuff, we donāt really react to x y or z happening with the device, other than simply rebooting if something (of ours) isnāt working and hasnāt managed to recover. We donāt keep playing with settings/configs as time goes on at all.
Hey there! We havenāt seen similar issues on any other balena app yet, which makes me think there might me something in the application code sporadically causing this. While we continue investigating, given that the hotspot
container is the only one that plays with routing, would it be possible to comment out the hotspot
container, push an update to just one device, and see if the issue does happen again?
Possible, yes, so I guess Iāll see about doing just that when I have a moment.
On the OS/supervisor side of things, assuming you can enumerate such things, what changes that may be related to this have been made between 2.38.0+rev1/9.15.7 and 2.51.1+rev1/11.4.10?
As Iāve said multiple times now, this device had no issues while it was on the older OS/supervisor for an entire year - only started displaying this issue once pulled up to date. Therefore, what has happened OS/supervisor side that may contribute to this?
Hi there ā thanks for getting back to us. I understand your frustration, and we thank you for your patience as we work to find out whatās going on here.
As my colleague noted, we have not seen this error on other devices, so weāre doing our best to investigate this and understand whatās happening. We will update you on that as soon as we can. In the meantime, as you mentioned, if youāre able to try your application with the hotspot container commented out, this may help narrow down the problem.
The supervisor and balenaOS are both open source applications, with changelogs published as part of their repos on Github:
- Supervisor
- balenaOS (note that the repo is named āmeta-balenaā, but is used for building balenaOS images)
One thing I note in that changelog is a bug, similar to what weāre seeing, regarding DNS for local containers. A fix for this has been released as part of balenaOS v2.54.2, which is currently available on our staging environment (https://dashboard.balena-staging.com). Iāll check with our team to see when we should have that out in production, or if there is a way to test this by manually adding a firewall rule that matches this fix. The firewall code has changed a fair bit between what youāre running and the newer version, so Iām not certain that this applies; however, itās worth investigating. Weāll get back to you with more information.
Thanks again,
Hugh
Hola,
We very much appreciate the attention you are showing our issue. The fact that there are so many names all jumping into the conversation makes us feel warm and fuzzy.
That or thereās only really one of you that canāt decide who he wants to be each dayā¦
Thanks for the reminder that it is open source - surprisingly, in this day and age, not something I tend to think about. Iāll have a poke through, but itās no substitue for those that know it inside and out. From the horses mouth and all that.
[Time has passed]
Yep, distracted by the logs.
I can see about 5 things between the 9.15.4-11.4.10 versions that directly relate in some fashion (other than the js rewrites that could cover any number of things) - i.e. contain any reference to network/ip/dns/fw/etc.
And since 11.4.10 there looks to be about 5 more networky logs up to current 11.12.11.
Similarly, 3/2 split for the OS versions.
Ok then, I guess the sensible course is to do as desired and get one up and rocking without hotspot. If that results in no change, then - short of any other immediate suggestions regarding our containers - we play with the staging builds to see if any of those fixes do the job?
Hi,
One of my teammates previously noticed that two containers ā mqtt
and networking
ā are getting the same MacAddress, which can be hosing the rest of DNS. A couple things you could try is explicitly assigning a Mac Address assigned in your docker-compose.yml. You might also try disabling one of those two services (and temporarily removing any depends_on
references) in your next build and seeing if DNS works without that container running.
Also, are you using any other build tools as part of your build (such as Portainer), or are you just running balena push <app-name>
with docker-compose.yml and Dockerfiles?
Thanks,
John