Balena Containers leaving Bridge network

Hi,

I’ve been having an issue recently with a few containers in my multi container application. I have the need to set a custom bridge network on the containers (over docker-compose and the need for it is that I can’t use the default address of the regular bridge network) and they were working fine for a long time, but I have noticed that after the latests updates to the OS and the Supervisor, the containers appear to be abandoning the network.

Any ideas why would this be happening? I have 4 containers on this network and now 0 appear… the image below is the output of balena network inspect ${NETWORK_ID}

On a healthy device it looks like this:

I already read a thread in the forums about some people having a similar issue… Any ideas what would be the root cause of this? And why we would see it often after the release of supervisor 12.5.10 and 2.68.1+rev1 for Intel-NUC?

Attached are the device diagnostics obtained from Balena Dashboard

4344fdf33df6309a9d41fec7a05db7fb_diagnostics_2021.07.04_01.13.22+0000.txt (669.2 KB)

Thanks in advance

Hey there, would you be able to share the compose file from your project so we can attempt to reproduce? How often are you experiencing this issue when running 2.68.1+rev1?

Would you be willing to update to the latest supervisor via the dashboard actions? It may resolve any issues you are seeing.

Hi @klutchell ,

We are seeing it pretty often, I do not have the exact numbers and how often it happens per device, but it does happen pretty frequently on our fleet… At least 3 or 4 different devices every day. Normally after a reboot is fixed. I have had lots of issues pinpointing the reason for this, but definitely the containers leaving the network it’s what is causing rest of the issues that we are seeing on devices like losing DNS functionality.

I’m willing to update the supervisor to the latest version, and you say that it may help, but is it a known fact that it will help because there are bug fixes for issues like that? Or just that it may improve it because of the update?

I’m confortable sharing the compose file, I just added placeholders to a few of the services names

services:
  test-reverse-proxy-agent:
    container_name: reverse-proxy-agent
    image: ${DOCKER_REPO}/test-reverse-proxy-agent:intel-nuc-579d5e6ae3d3372ac2d21a97dff3816d5bc7998b
    network_mode: "host"
    ports:
      - "443:443"
      - "127.0.0.1:8080:8080"
    privileged: true
    restart: always
    volumes:
      - rpdata:/nginx_certs

  test-timeseries:
    container_name: timeseries
    depends_on:
      - test-networkmanager
    image: ${DOCKER_REPO}/test-timeseries:intel-nuc-0a4ab06830b8155a157783b8c00c7ca423947e11
    networks:
      balena:
        ipv4_address: 172.239.239.12
        aliases:
          - timeseries
    ports:
      - "127.0.0.1:8086:8086"
    restart: always
    volumes:
      - timeseries_data:/etc/influxdb
      - timeseries_conf:/var/lib/influxdb
      - timeseries_balena_data:/data

  test-ai-detector-intel:
    container_name: ai-detector-intel
    depends_on:
      - test-iotgateway
      - test-agent
    image: ${DOCKER_REPO}/test-ai-detector-intel:intel-nuc-01d701b1ab784cfca91d6771e16e321b98022298
    network_mode: "host"
    privileged: true
    restart: always
    volumes:
      - ai_data:/ai_data

  test-iotgateway:
    container_name: iotgateway
    depends_on:
      - test-networkmanager
    image: ${DOCKER_REPO}/test-iotgateway:intel-nuc-5467bf461ea536a0d42c140f8d5d487df83f3dfb
    networks:
      balena:
        ipv4_address: 172.239.239.11
        aliases:
          - iotgateway
    ports:
      - "127.0.0.1:8888:8888"
    restart: always
    volumes:
      - iotgateway_certificates:/certificates

  test-agent:
    container_name: agent
    depends_on:
      - test-networkmanager
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
    image: ${DOCKER_REPO}/test-agent:intel-nuc-d13217d4e1b8902716494780744991a958f91655
    labels:
      io.balena.features.dbus: '1'
      io.balena.features.balena-api: '1'
      io.balena.features.supervisor-api: '1'
    networks:
      balena:
        ipv4_address: 172.239.239.10
        aliases:
          - test-agent
    ports:
      - "127.0.0.1:8800:8800"
      - "127.0.0.1:554:554"
      - "127.0.0.1:1935:1935"
      - "127.0.0.1:11935:11935"
      - "127.0.0.1:8081:8081"
      - "127.0.0.1:8082:8082"
    privileged: true
    restart: always
    volumes:
      - image_data:/www/data
      - test_conf:/etc/test
      - test_log:/var/log/test
      - reencoding_data:/reencode/custom
      - test_service_conf:/service_data/conf

  test-networkmanager:
    container_name: networkmanager
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
      - NETWORK_MANAGER_ENVIRONMENT=production
      - NETWORK_MANAGER_APP_SETTINGS=src.components.config.ProductionConfig
    expose:
      - "5000"
    image: ${DOCKER_REPO}/test-networkmanager:intel-nuc-9b35be81e7a063f27b2ea5bf3018be2b248c807f
    labels:
      io.balena.features.dbus: '1'
      io.balena.features.balena-api: '1'
      io.balena.features.supervisor-api: '1'
      # io.balena.features.procfs: '1'
      io.balena.features.sysfs: '1'
    network_mode: "host"
    ports:
      - "127.0.0.1:5000:5000"
    privileged: true
    restart: always
    volumes:
      - network_manager_db:/db
      - network_manager_migrations:/src/migrations
      - network_manager_data:/data

  test-gui:
    container_name: gui
    depends_on:
      - test-networkmanager
    devices:
      - "/dev/tty0:/dev/tty0"
      - "/dev/tty2:/dev/tty2"
      - "/dev/fb0:/dev/fb0"
      - "/dev/input:/dev/input"
      - "/dev/snd:/dev/snd"
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket

    image: ${DOCKER_REPO}/test-gui:intel-nuc-b3ed9275a82a71d972015202e4edd694414177b3
    labels:
      io.balena.features.dbus: '1'
    network_mode: "host"
    privileged: true
    restart: always

  test-tunnel-client:
    container_name: tunnel-client
    depends_on:
      - test-iotgateway
    image: ${DOCKER_REPO}/test-tunnel-client:intel-nuc-c470e5145db19263b2d62631f8878498f93642ee
    network_mode: "host"
    restart: always

  test-discovery:
    container_name: discovery
    depends_on:
      - test-networkmanager
    image: ${DOCKER_REPO}/test-discovery:intel-nuc-64152e20476dde1ce64876fa73a230755c40bec1
    network_mode: "host"
    ports:
      - "127.0.0.1:3000:3000"
    restart: always
    volumes:
      - discovery_data:/discoverydata

  test-restreamer:
    container_name: restreamer
    depends_on:
      - test-iotgateway
    image: ${DOCKER_REPO}/test-restreamer:intel-nuc-910028e123e07b15146ba93d050d061e5b3013a0
    networks:
      balena:
        ipv4_address: 172.239.239.13
        aliases:
          - restreamer
    restart: always
    volumes:
      - restreamer_conf:/service_data/conf

  test-integrations:
    container_name: integrations
    depends_on:
      - test-iotgateway
    image: ${DOCKER_REPO}/test-integrations:intel-nuc-37708dc009ed4fe40d07b1fb115fd2e9cf499c2a
    network_mode: "host"
    restart: always
    volumes:
      - integrations_data:/data

volumes:
  rpdata:
  network_manager_db:
  network_manager_migrations:
  network_manager_data:
  image_data:
  test_conf:
  test_log:
  test_service_conf:
  reencoding_data:
  hostname_conf:
  timeseries_data:
  timeseries_conf:
  timeseries_balena_data:
  iotgateway_certificates:
  discovery_data:
  restreamer_conf:
  integrations_data:
  ai_data:

networks:
  balena:
    ipam:
      driver: default
      config:
        - subnet: 172.239.239.0/24
          gateway: 172.239.239.1

This is pretty scary, because previously my fleet was working as supposed and we were never seen issues like this. Now because of losing DNS resolving because the containers leave the network, some of them that export telemetry data to our cloud are not able to do it any more, which cause all type of issues related to keep track of a device health.

Regards,

Hello there,
another question to diagnose this:
If you do a balena inspect on the containers that have left the network. In which network are they after that? Did the containers crash/restart when this happened?

Are the diagnostics logs that you attached from a healthy or an unhealthy device?

Hi @Hades32 ,

The logs I attached are from the same device that had the containers leaving the network. Meaning from an unhealthy device. The containers did not crash as far as I know, if I restart them, the container goes back to the network.

81cb75d7aea150401276ec2c1e3d2775_checks_2021.07.06_13.01.15+0000.txt (1.6 KB)

I just ran the balena inspect ${CONTAINER_ID} . The containers still show up on the proper network while if I balena network inspect ${CONTAINER_ID} the network does not include the container.

I have also attached the device health checks functionality output produced by Balena Dashboard from an unhealthy device with the same issues.

Regards,

Thanks for providing the additional logs! Do you have any devices still running the old OS release that were not experiencing the issue? Do you recall exactly what version they were running? If something changed in the OS or the engine it would help to narrow it down.

Hi @klutchell ,

We have a few devices running version 11.14.0 of the Supervisor and OS 2.58.4.

For this ones we had to make a custom build of BalenaOS because the lack of support for Nvidia Drivers and Kernel 5.9 at the time of the build. But we only added that, we did not modified or take out anything related to the Supervisor or Balena Engine, we just added the stuff we needed. On this devices we almost never, not to say 100% of the time, we do not see these issues.

These devices are running the same source code that the ones with the latest release of the Balena Supervisor and BalenaOS.

If you know any way to mitigate this issue, please let me know, we do not feel coMfortable having this situation with a lot of devices on our fleet.

Regards,

Hey @eeb, could you grab the output of these commands so I can open an investigation in the balena-engine?

balena-engine version
balena-engine info

Hi @klutchell

The response from balena-engine version for a device that just had the issue with the containers leaving the network.

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Mon Feb  1 20:12:05 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Mon Feb  1 20:12:05 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

The response from balena-engine info for a device that just had the issue with the containers leaving the network.

Client:
 Debug Mode: false

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 15
 Server Version: 19.03.13-dev
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 331
  Dirperm1 Supported: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.8.18-yocto-standard
 Operating System: balenaOS 2.68.1+rev1
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.691GiB
 Name: 40623112549f.videolink.io
 ID: ODX3:BQOU:LFIR:MXUE:WZX4:L37E:BY42:VT3K:6AZL:FDZI:RDVT:UQ2B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: the aufs storage-driver is deprecated, and will be removed in a future release

For a device using an OS and a Supervisor that we do not see this issue, the balena-engine version

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Sun Nov  8 20:41:10 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Sun Nov  8 20:41:10 2020
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

For a device using an OS and a Supervisor that we do not see this issue, the balena-engine info

Client:
 Debug Mode: false

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 13
 Server Version: 19.03.13-dev
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.9.0-yoctodev-standard
 Operating System: balenaOS 2.58.4+rev1
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.28GiB
 Name: 18c04d09969e.videolink.io
 ID: DLI6:WODP:VCOB:UU3K:HQ56:L3QD:UPGA:KN77:U3ZR:2ANQ:57TI:BQAC
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Thanks in advance

Thanks for the additional details! I’ve opened a GH issue here but the first step will be trying to reproduce it internally.

I see you have several services running, but only the 4 on the balena network are dropping, correct? I think from your diagnostics the CPU/Memory usage on these devices is within the normal range?

Hi @klutchell ,

Thanks so much! Yeah the containers that are on the balena network are the ones dropping out from the network. The CPU/RAM values are in range, but they could vary overtime for sure depending on the load set on them.

Regards,

Interesting that the engine appears unchanged between those releases, so maybe it’s something else in the OS or supervisor that could impact this.

If you wouldn’t mind trying to update the supervisor on one of the impacted devices, it would help narrow down where to look for the root cause!

Also, when the 4 containers disconnect from the bridge, is it all 4 at once or could it happen to some containers and not others? Could you grab the output of nmcli on the host when the containers are disconnected?

So when the containers disconnect I have seen it happen not all of them at the same time, it can happen one a partial amount of the 4 of them… I saw the other day that it happened on 3 of them instead of the 4. About the supervisor version, I would gladly try the update in a few of them, but I’m not sure how until we see if it’s working or not… As I said, is something that appears to happen randomly, not all the time in all devices at the same time. Are there bug fixes on the Supervisor 12.8.X releases related to this? I would not like to update blindly my production fleet, but I can test on my test and development devices for sure.
I’ll post the nmcli output in a few minutes, is there an specific option that I should pass to nmcli?

Regards,

No specific options required for nmcli in this case, it will just print the status of your interfaces. I wanted to see if the bridge showed as disconnected.

I’m not aware of any supervisor changes that would cause or fix this issue, but if you could test on a development device and we see a change in behaviour it will narrow down our troubleshooting.

Hi @klutchell

Here is the output of nmcli command on a device that has the issue.

enp2s0: connected to Wired connection 2
        "enp2s0"
        ethernet (r8169), 40:62:31:11:48:09, hw, mtu 1500
        ip4 default
        inet4 10.138.1.91/24
        route4 0.0.0.0/0
        route4 10.138.1.0/24
        inet6 fe80::b221:20fa:47ce:5c74/64
        route6 fe80::/64
        route6 ff00::/8
        route6 ff00::/8

supervisor0: connected (externally) to supervisor0
        "supervisor0"
        bridge, 02:42:6D:79:DD:0B, sw, mtu 1500
        inet4 10.114.104.1/25
        route4 10.114.104.0/25
        inet6 fe80::42:6dff:fe79:dd0b/64
        route6 fe80::/64
        route6 ff00::/8

enp1s0: unavailable
        "enp1s0"
        ethernet (r8169), 40:62:31:11:48:08, hw, mtu 1500

balena0: unmanaged
        "balena0"
        bridge, 02:42:A0:CD:7C:22, sw, mtu 1500

br-42e206f2f0f7: unmanaged
        "br-42e206f2f0f7"
        bridge, 02:42:59:99:46:9D, sw, mtu 1500

br-b77374435ce0: unmanaged
        "br-b77374435ce0"
        bridge, 02:42:F9:19:93:09, sw, mtu 1500

resin-dns: unmanaged
        "resin-dns"
        bridge, FA:1E:E1:B3:BD:68, sw, mtu 1500

veth0a630f1: unmanaged
        "veth0a630f1"
        ethernet (veth), 8E:4C:C6:69:C0:94, sw, mtu 1500

veth4e9766c: unmanaged
        "veth4e9766c"
        ethernet (veth), C2:97:55:F4:23:E9, sw, mtu 1500

veth4e98c8c: unmanaged
        "veth4e98c8c"
        ethernet (veth), F2:6D:9A:55:C5:8B, sw, mtu 1500

veth96f4ca7: unmanaged
        "veth96f4ca7"
        ethernet (veth), F6:22:3B:05:C6:9A, sw, mtu 1500

vetha02f02e: unmanaged
        "vetha02f02e"
        ethernet (veth), 06:6E:01:A1:46:A9, sw, mtu 1500

sit0: unmanaged
        "sit0"
        iptunnel (sit), 00:00:00:00, sw, mtu 1480

lo: unmanaged
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536

resin-vpn: unmanaged
        "resin-vpn"
        tun, sw, mtu 1500

DNS configuration:
        servers: 10.0.0.68 10.0.0.88 10.0.0.77
        domains: CityofSouthFultonga.gov
        interface: enp2s0

Use "nmcli device show" to get complete information about known devices and
"nmcli connection show" to get an overview on active connection profiles.

Consult nmcli(1) and nmcli-examples(7) manual pages for complete usage details.

I’ll update the development related device and I’ll let you know here if I find something improved.

Regards

Hey, did you see any change in behaviour with the latest supervisor version? If not, we might have to start looking more closely at your device logs. Specifically the supervisor and the kernel to see if there is any indication or log when a container loses it’s network bridge.

  • Do any of your containers restart frequently (either by design, or crashing, etc)?
  • Do any of them perform host OS tasks like wifi network management or similar?
  • Is it always the same containers that lose the bridge?
  • You mentioned another forum thread where you saw this issue, could you link it here for context?
  • In the diagnostics you attached I see lot of errors related to sda, you might need to look into HD errors on that specific device

I think at this point we should enable persistent logging via Dashboard → Device Configuration if you haven’t already and start collecting logs. Next time it happens you can attach the output of the following commands, along with the answers to my many questions above!

  • journalctl -a --no-pager
  • dmesg

Let us know how it’s going!

Hello @klutchell we updated our development devices, we did not update the production devices, we do not take lightly the update of core software like Balena Supervisor without lots of testing.

  • Our devices reboot every 15 days by design.
  • Our devices reboot in case of critical failure of a service.
  • A container perform network management tasks (network interfaces) but just when needed.
  • It’s always the same containers the ones that lose the bridge network. We only have host-networking containers and bridge-networked containers the bridge ones are the ones that lose it.
  • The thread is not entirely about this, but something similar happens on this thread → DNS failure not caught by supervisor - #50 by alexgg
  • About sda, yes there is something to solve there, but I do not think is related to this issue. sda is an external device.

About persistent logging, we would have to enable in all our devices, because it’s something that does not happen all the time… I’ll enable on a few to see if I can capture.

Regards

Hey there, any changes since we last spoke? We’ve seen similar behaviour with one other device type in the field but haven’t reproduced it internally or determined a root cause, so I appreciate you taking the time to troubleshoot!

  • Any changes noticed on the development devices that were upgraded?
  • What kind of network management tasks are you performing on these interfaces? This may impact how the engine responds to interface changes.
  • Is your project open-source or do you have a simplified version we could run internally to reproduce?
  • Was it reproduced on any of the devices with persistent logging?

The other device in the field encountering a similar issue is a custom ARM-based device, with containers that crash/restart frequently. So there are differences to be sure and we haven’t found a common element yet.

I don’t see that I mentioned this above, but I just double-checked that the balena-engine version did NOT change between v2.58.4 and v2.68.1 so this would indicate something else in the OS/kernel that changed.

For this ones we had to make a custom build of BalenaOS because the lack of support for Nvidia Drivers and Kernel 5.9 at the time of the build. But we only added that, we did not modified or take out anything related to the Supervisor or Balena Engine, we just added the stuff we needed. On this devices we almost never, not to say 100% of the time, we do not see these issues.

Maybe I missed this in our thread above, but are the devices at v2.68.1 also running a custom OS build with modifications for drivers/kernel? If so, is it possible to run your application on an official image from our balenaOS downloads page, or will that break all functionality? Can we see the changes you made to the image?