Container network issue with 2.89.15 and cp-zookeeper/Redhat ubi8 containers

When testing the new 2.89.15 release of Balena I came across a container networking issue: I can’t ping certain containers by their name, although DNS resolution and pinging by IP work fine.

Setup
We’re using a multi-container setup in our fleet of UpBoards running balena. Some of these containers run a kafka stack, e.g. the cp-zookeeper image.
I noticed that the confluent containers couldn’t connect to each other, which I could reproduce by not being able to ping the confluent containers by their name. Apparently this works fine on the old 2.68 Balena version.

Test setup
I simplified the issue but looking up the base image used (Redhat ubi8) and was able to generate a simple docker-compose.yml to reproduce this issue:

version: "2.1"

services:
  test1:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
    container_name: test1
  test2:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
    container_name: test2
  zookeeper:
    image: registry.access.redhat.com/ubi8/ubi-minimal
    entrypoint: ["tail", "-f", "/dev/null"]
    container_name: zookeeper

I can ping the containers test1 and test2 from each other but I cannot ping zookeeper by hostname. The DNS seems to be resolved correctly and ping by IP works, however ping by hostname doesn’t (only get report of one package after stopping it with ctrl+c).

Container test1:

root@37df5e0c5fe5:/# nslookup zookeeper
Server: 127.0.0.11
Address: 127.0.0.11#53

Non-authoritative answer:
Name: zookeeper
Address: 172.17.0.2

root@37df5e0c5fe5:/# ping zookeeper
PING zookeeper (172.17.0.2) 56(84) bytes of data.
^C64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.229 ms

— zookeeper ping statistics —
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.229/0.229/0.229/0.000 ms

root@37df5e0c5fe5:/# ping 172.17.0.2
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.259 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.196 ms

Notice the “^C” before the printout in the second command. I’ve never seen this behavior of the ping command before.

I’m looking for advice on how to pinpoint the issue. If useful I can upload further information (balena inspect of the container or so).

Hi @hesch,

Thanks for all the info. It would definitely be helpful to include some info from inspecting the containers, as well as inspect output of the Docker networks on the device. By default, if the network is not specified, the Supervisor on the device will add all the containers to a managed bridge network, thus avoiding the Docker issue of containers not being pingable with a default bridge network. With managed bridge network, the containers should be able to ping each other if they’re on the same subnet and don’t have conflicting IPs - you’ll be able to see this when you inspect the containers. Normally with Docker networks, there shouldn’t be conflicting IPs, but we’ve seen instances of IP conflicts possibly caused by unclean engine cleanup. For example: Port already in use, because proxy keeps binding to the wrong container IP · Issue #272 · balena-os/balena-engine · GitHub, although I don’t know if this is related.

Thanks,
Christina