We seem to have an occassionally recurring issue wherein our containers no longer have working DNS and cannot talk to each other.
As an aside on this, we believe it to be a Balena-side issue that has been introduced sometime since OS/supervisor v2.38.0+rev1/9.15.7.
We have had one device up and running with nearly zero issues for just shy of a year now - previously running that OS/supervisor combination. This device was literally one of the first we put out there and frankly we have expected it to fall over a lot sooner (our software has improved since then, of course).
One thing of note about this device is that it doesn’t have a UPS (PiJuice) as it was before we introduced them. Therefore, we have had no hardware watchdog in place for this device at all.
On Friday, we decided it would make a good test device to take and update to the latest 2.21.1+rev1/11.4.10. If any issues popped up, this would give some insight into whether it’s our own software or anything underlying on the Balena-side of things that may be a cause.
Over the weekend, an issue popped up. It’s one we’ve seen a few times before on more recent devices (cannot provide anything more accurate than ‘recent’, I’m afraid).
All of the containers are unable to contact each other via name at all.
They can successfully communicate via IP:
This doesn’t help, however, because all the apps are using names so we don’t have to deal with IPs.
Is this a known issue? Has anyone else come across this?
A simple restart solves the issue, but the issue itself isn’t detected automatically by the supervisor as one would hope/expect. A hardware WD also solves the issue - without our containers able to contact each other we can’t send a keepalive, and so it gets rebooted.
Given our picked test device does not have any hardware WD capabilities; the only thing to have changed is the Balena OS/supervisor versions and not our own code; and we know that it has not had this issue for the past year - it seems to indicate an issue on the Balena-side of the stack.
I realise that we can implement our own check for inter-container connectivity, but it doesn’t look like it’ll do us any good as any requests to the supervisor get refused, so we wouldn’t be able to issue a restart:
Any suggested/expected fixes?
Please note that I have not attempted any restarts of this device at the moment as I want to keep it in this state throughout this process should any diagnostics/troubleshooting steps be requested. While the device is in this state we are losing data, but figure that is a fair trade if we can fix this issue.