Google coral dev board running Balena OS 2.108.26 crashes due to nf_conntrack_netlink kernel module loaded needed to run k3s agent

This issue is the same issue reported by a team member of mine in https://forums.balena.io/t/reboot-loop-on-google-coral-dev-board

Summary of our setup

We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:

  • flashed with the corresponding balena OS kernels
  • Runs our code that installs k3s and for the Jetson nano and Xavier builds and loads the ipip and wireguard modules out-of-tree in this script

After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.

Our process for enrolling the Google Coral Dev Board

Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded

Dmesg logs on the host kernel

[   40.838652] random: crng init done
[   40.851547] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[   40.874565] EXT4-fs (mmcblk0p6): mounted filesystem with ordered data mode. Opts: (null)
[   41.028904] systemd[1]: System time before build time, advancing clock.
[   41.097585] systemd[1]: File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[   41.097598] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[   41.209113] systemd[1]: /lib/systemd/system/chronyd.service:25: Unknown key name 'ProcSubset' in section 'Service', ignoring.
[   41.209146] systemd[1]: /lib/systemd/system/chronyd.service:28: Unknown key name 'ProtectHostname' in section 'Service', ignoring.
[   41.209163] systemd[1]: /lib/systemd/system/chronyd.service:29: Unknown key name 'ProtectKernelLogs' in section 'Service', ignoring.
[   41.209194] systemd[1]: /lib/systemd/system/chronyd.service:32: Unknown key name 'ProtectProc' in section 'Service', ignoring.
[   42.247023] imx-sdma 30bd0000.sdma: no iram assigned, using external mem
[   42.256266] imx-sdma 30bd0000.sdma: loaded firmware 4.2
[   42.259899] imx-sdma 302c0000.sdma: no iram assigned, using external mem
[   42.268096] imx-sdma 302c0000.sdma: loaded firmware 4.2
[   42.348589] ina2xx 1-0040: error configuring the device: -6
[   42.361241] ina2xx 1-0041: error configuring the device: -6
[   42.750411] zram: Can't change algorithm for initialized device
[   43.627353] Adding 503584k swap on /dev/zram0.  Priority:-2 extents:1 across:503584k SS
[   43.910583] wlan: loading out-of-tree module taints kernel.
[   43.975040] wlan: loading driver v4.5.23.1
[   43.975387] hif_pci_probe:, con_mode= 0x0
[   43.975397] PCI device id is 003e :003e
[   43.975417] hif_pci 0000:01:00.0: BAR 0: assigned [mem 0x18000000-0x181fffff 64bit]
[   43.975548] hif_pci 0000:01:00.0: enabling device (0000 -> 0002)
[   43.976718]
                hif_pci_configure : num_desired MSI set to 1
[   44.054114] hif_pci_probe: ramdump base 0xffff800024e00000 size 2095136
[   44.126366] NUM_DEV=1 FWMODE=0x2 FWSUBMODE=0x0 FWBR_BUF 0
[   44.779370] +HWT
[   44.796852] -HWT
[   44.820250] HTT: full reorder offload enabled
[   44.860930] Pkt log is disabled
[   44.865835] Host SW:4.5.23.1, FW:2.0.1.1048, HW:QCA6174_REV3_2
[   44.866430] ol_pktlog_init: pktlogmod_init successfull
[   44.866722] wlan: driver loaded in 892000
[   44.870061] target uses HTT version 3.50; host uses 3.28
[   47.488191] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   47.495084] Generic PHY 30be0000.ethernet-1:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=30be0000.ethernet-1:00, irq=POLL)
[   47.495751] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   47.534226] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   47.534572] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   47.668851] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   51.593483] fec 30be0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   51.593510] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   65.517525] Bridge firewalling registered
[   65.630300] Initializing XFRM netlink socket
[   65.641279] Netfilter messages via NETLINK v0.30.
[   65.900887] IPv6: ADDRCONF(NETDEV_UP): supervisor0: link is not ready
[   65.995140] IPv6: ADDRCONF(NETDEV_UP): balena0: link is not ready
[   68.796835] ipip: IPv4 and MPLS over IPv4 tunneling driver
[   73.869504] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   79.811546] EXT4-fs (mmcblk0p3): mounted filesystem with ordered data mode. Opts: (null)
[  235.903492] ctnetlink v0.93: registering with nfnetlink.
[  255.787535] ip_set: protocol 6
[  256.065155] IPVS: [rr] scheduler registered.
[  256.688073] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[  256.704251] Mem abort info:
[  256.709945]   Exception class = DABT (current EL), IL = 32 bits
[  256.721880]   SET = 0, FnV = 0
[  256.728142]   EA = 0, S1PTW = 0
[  256.734520] Data abort info:
[  256.740377]   ISV = 0, ISS = 0x00000006
[  256.748146]   CM = 0, WnR = 0
[  256.754179] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80001f9f3000
[  256.767329] [0000000000000040] *pgd=000000005f9f8003, *pud=000000005fa76003, *pmd=0000000000000000
[  256.785345] Internal error: Oops: 96000006 [#1] PREEMPT SMP

K3S agent logs (1.23.17 but also crashes on the latest stable)

INFO[0001] Starting k3s agent v1.23.17+k3s1 (abb8d7d4)
INFO[0001] Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [129.114.34.140:6443 dev.edge.chameleoncloud.org:6443]
WARN[0001] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0003] Module overlay was already loaded
INFO[0003] Module nf_conntrack was already loaded
INFO[0003] Module br_netfilter was already loaded
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
INFO[0003] Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log
INFO[0003] Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
INFO[0004] Containerd is now running
INFO[0004] Getting list of apiserver endpoints from server
INFO[0005] Tunnel authorizer set Kubelet Port 10250
INFO[0005] Updating load balancer k3s-agent-load-balancer default server address -> 129.114.34.140:6443
INFO[0005] Connecting to proxy                           url="wss://129.114.34.140:6443/v1-k3s/connect"
WARN[0005] Disabling CPU quotas due to missing cpu controller or cpu.cfs_period_us
INFO[0005] Running kubelet --address=0.0.0.0 --allowed-unsafe-sysctls=net.ipv4.ip_forward,net.ipv6.conf.all.forwarding --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime=remote --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --cpu-cfs-quota=false --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --kubelet-cgroups=/k3s --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key --volume-plugin-dir=/opt/libexec/kubernetes/kubelet-plugins/volume/exec
Flag --cloud-provider has been deprecated, will be removed in 1.24 or later, in favor of removing cloud provider code from Kubelet.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
I0416 18:57:06.107611    1338 server.go:442] "Kubelet version" kubeletVersion="v1.23.17+k3s1"
I0416 18:57:06.111875    1338 dynamic_cafile_content.go:156] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
INFO[0005] Annotations and labels have already set on node: 8fb60a5
INFO[0006] Running kube-proxy --cluster-cidr=192.168.64.0/18 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubeproxy.kubeconfig --proxy-mode=iptables
I0416 18:57:06.604015    1338 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
INFO[0006] Starting the netpol controller version v1.5.2-0.20221026101626-e01045262706, built on 2023-03-10T21:33:49Z, go1.19.6
I0416 18:57:06.623003    1338 network_policy_controller.go:163] Starting network policy controller
I0416 18:57:06.626245    1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_wrr"
I0416 18:57:06.631820    1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_sh"
I0416 18:57:06.679697    1338 network_policy_controller.go:175] Starting network policy controller full sync goroutine
I0416 18:57:06.811294    1338 node.go:163] Successfully retrieved node IP: 192.168.1.201
I0416 18:57:06.811465    1338 server_others.go:138] "Detected node IP" address="192.168.1.201"
I0416 18:57:06.895718    1338 server_others.go:206] "Using iptables Proxier"
I0416 18:57:06.896101    1338 server_others.go:213] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0416 18:57:06.896288    1338 server_others.go:214] "Creating dualStackProxier for iptables"
I0416 18:57:06.896494    1338 server_others.go:502] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0416 18:57:06.898929    1338 server.go:656] "Version info" version="v1.23.17+k3s1"
I0416 18:57:06.911637    1338 config.go:444] "Starting node config controller"
I0416 18:57:06.912495    1338 shared_informer.go:240] Waiting for caches to sync for node config
I0416 18:57:06.911638    1338 config.go:226] "Starting endpoint slice config controller"
I0416 18:57:06.912773    1338 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0416 18:57:06.911692    1338 config.go:317] "Starting service config controller"
I0416 18:57:06.912992    1338 shared_informer.go:240] Waiting for caches to sync for service config
I0416 18:57:07.013755    1338 shared_informer.go:247] Caches are synced for node config
I0416 18:57:07.113248    1338 shared_informer.go:247] Caches are synced for endpoint slice config
I0416 18:57:07.113317    1338 shared_informer.go:247] Caches are synced for service config

Debugging

After manually loading the kernel modules one by one, I managed to identify the kernel module that causes the crash: nf_conntrack_netlink. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded.

The Question

Do you guys happen to have an idea as to why this module could be causing the crash? Furthermore, could you tell me more about the balena OS version that is recommended for the Coral and how this issue could arise? Lastly, is there another balena OS version that I could use to try this on the coral?

1 Like