This issue is the same issue reported by a team member of mine in https://forums.balena.io/t/reboot-loop-on-google-coral-dev-board
Summary of our setup
We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:
- flashed with the corresponding balena OS kernels
- Runs our code that installs k3s and for the Jetson nano and Xavier builds and loads the ipip and wireguard modules out-of-tree in this script
After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.
Our process for enrolling the Google Coral Dev Board
- We first flash the google coral with the recommended Balena OS version 2.108.26 kernel.
- We then proceed to use this code to load the following kernel modules out-of-tree:
- IPIP kernel module including ip_tunnel and tunnel4
- xt_mark, xt_statistic, xt_comment, xt_multiport, nf_conntrack_netlink, xt_bpf, and xt_u32 all required for running Calico CNI
- Code to build the kernel modules
- Code to load the Kernel modules and start k3s agent
Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded
Dmesg logs on the host kernel
[ 40.838652] random: crng init done
[ 40.851547] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[ 40.874565] EXT4-fs (mmcblk0p6): mounted filesystem with ordered data mode. Opts: (null)
[ 41.028904] systemd[1]: System time before build time, advancing clock.
[ 41.097585] systemd[1]: File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[ 41.097598] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[ 41.209113] systemd[1]: /lib/systemd/system/chronyd.service:25: Unknown key name 'ProcSubset' in section 'Service', ignoring.
[ 41.209146] systemd[1]: /lib/systemd/system/chronyd.service:28: Unknown key name 'ProtectHostname' in section 'Service', ignoring.
[ 41.209163] systemd[1]: /lib/systemd/system/chronyd.service:29: Unknown key name 'ProtectKernelLogs' in section 'Service', ignoring.
[ 41.209194] systemd[1]: /lib/systemd/system/chronyd.service:32: Unknown key name 'ProtectProc' in section 'Service', ignoring.
[ 42.247023] imx-sdma 30bd0000.sdma: no iram assigned, using external mem
[ 42.256266] imx-sdma 30bd0000.sdma: loaded firmware 4.2
[ 42.259899] imx-sdma 302c0000.sdma: no iram assigned, using external mem
[ 42.268096] imx-sdma 302c0000.sdma: loaded firmware 4.2
[ 42.348589] ina2xx 1-0040: error configuring the device: -6
[ 42.361241] ina2xx 1-0041: error configuring the device: -6
[ 42.750411] zram: Can't change algorithm for initialized device
[ 43.627353] Adding 503584k swap on /dev/zram0. Priority:-2 extents:1 across:503584k SS
[ 43.910583] wlan: loading out-of-tree module taints kernel.
[ 43.975040] wlan: loading driver v4.5.23.1
[ 43.975387] hif_pci_probe:, con_mode= 0x0
[ 43.975397] PCI device id is 003e :003e
[ 43.975417] hif_pci 0000:01:00.0: BAR 0: assigned [mem 0x18000000-0x181fffff 64bit]
[ 43.975548] hif_pci 0000:01:00.0: enabling device (0000 -> 0002)
[ 43.976718]
hif_pci_configure : num_desired MSI set to 1
[ 44.054114] hif_pci_probe: ramdump base 0xffff800024e00000 size 2095136
[ 44.126366] NUM_DEV=1 FWMODE=0x2 FWSUBMODE=0x0 FWBR_BUF 0
[ 44.779370] +HWT
[ 44.796852] -HWT
[ 44.820250] HTT: full reorder offload enabled
[ 44.860930] Pkt log is disabled
[ 44.865835] Host SW:4.5.23.1, FW:2.0.1.1048, HW:QCA6174_REV3_2
[ 44.866430] ol_pktlog_init: pktlogmod_init successfull
[ 44.866722] wlan: driver loaded in 892000
[ 44.870061] target uses HTT version 3.50; host uses 3.28
[ 47.488191] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 47.495084] Generic PHY 30be0000.ethernet-1:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=30be0000.ethernet-1:00, irq=POLL)
[ 47.495751] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 47.534226] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 47.534572] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 47.668851] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 51.593483] fec 30be0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 51.593510] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 65.517525] Bridge firewalling registered
[ 65.630300] Initializing XFRM netlink socket
[ 65.641279] Netfilter messages via NETLINK v0.30.
[ 65.900887] IPv6: ADDRCONF(NETDEV_UP): supervisor0: link is not ready
[ 65.995140] IPv6: ADDRCONF(NETDEV_UP): balena0: link is not ready
[ 68.796835] ipip: IPv4 and MPLS over IPv4 tunneling driver
[ 73.869504] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 79.811546] EXT4-fs (mmcblk0p3): mounted filesystem with ordered data mode. Opts: (null)
[ 235.903492] ctnetlink v0.93: registering with nfnetlink.
[ 255.787535] ip_set: protocol 6
[ 256.065155] IPVS: [rr] scheduler registered.
[ 256.688073] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[ 256.704251] Mem abort info:
[ 256.709945] Exception class = DABT (current EL), IL = 32 bits
[ 256.721880] SET = 0, FnV = 0
[ 256.728142] EA = 0, S1PTW = 0
[ 256.734520] Data abort info:
[ 256.740377] ISV = 0, ISS = 0x00000006
[ 256.748146] CM = 0, WnR = 0
[ 256.754179] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80001f9f3000
[ 256.767329] [0000000000000040] *pgd=000000005f9f8003, *pud=000000005fa76003, *pmd=0000000000000000
[ 256.785345] Internal error: Oops: 96000006 [#1] PREEMPT SMP
K3S agent logs (1.23.17 but also crashes on the latest stable)
INFO[0001] Starting k3s agent v1.23.17+k3s1 (abb8d7d4)
INFO[0001] Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [129.114.34.140:6443 dev.edge.chameleoncloud.org:6443]
WARN[0001] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0003] Module overlay was already loaded
INFO[0003] Module nf_conntrack was already loaded
INFO[0003] Module br_netfilter was already loaded
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
INFO[0003] Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log
INFO[0003] Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
INFO[0004] Containerd is now running
INFO[0004] Getting list of apiserver endpoints from server
INFO[0005] Tunnel authorizer set Kubelet Port 10250
INFO[0005] Updating load balancer k3s-agent-load-balancer default server address -> 129.114.34.140:6443
INFO[0005] Connecting to proxy url="wss://129.114.34.140:6443/v1-k3s/connect"
WARN[0005] Disabling CPU quotas due to missing cpu controller or cpu.cfs_period_us
INFO[0005] Running kubelet --address=0.0.0.0 --allowed-unsafe-sysctls=net.ipv4.ip_forward,net.ipv6.conf.all.forwarding --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime=remote --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --cpu-cfs-quota=false --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --kubelet-cgroups=/k3s --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key --volume-plugin-dir=/opt/libexec/kubernetes/kubelet-plugins/volume/exec
Flag --cloud-provider has been deprecated, will be removed in 1.24 or later, in favor of removing cloud provider code from Kubelet.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
I0416 18:57:06.107611 1338 server.go:442] "Kubelet version" kubeletVersion="v1.23.17+k3s1"
I0416 18:57:06.111875 1338 dynamic_cafile_content.go:156] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
INFO[0005] Annotations and labels have already set on node: 8fb60a5
INFO[0006] Running kube-proxy --cluster-cidr=192.168.64.0/18 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubeproxy.kubeconfig --proxy-mode=iptables
I0416 18:57:06.604015 1338 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
INFO[0006] Starting the netpol controller version v1.5.2-0.20221026101626-e01045262706, built on 2023-03-10T21:33:49Z, go1.19.6
I0416 18:57:06.623003 1338 network_policy_controller.go:163] Starting network policy controller
I0416 18:57:06.626245 1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_wrr"
I0416 18:57:06.631820 1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_sh"
I0416 18:57:06.679697 1338 network_policy_controller.go:175] Starting network policy controller full sync goroutine
I0416 18:57:06.811294 1338 node.go:163] Successfully retrieved node IP: 192.168.1.201
I0416 18:57:06.811465 1338 server_others.go:138] "Detected node IP" address="192.168.1.201"
I0416 18:57:06.895718 1338 server_others.go:206] "Using iptables Proxier"
I0416 18:57:06.896101 1338 server_others.go:213] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0416 18:57:06.896288 1338 server_others.go:214] "Creating dualStackProxier for iptables"
I0416 18:57:06.896494 1338 server_others.go:502] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0416 18:57:06.898929 1338 server.go:656] "Version info" version="v1.23.17+k3s1"
I0416 18:57:06.911637 1338 config.go:444] "Starting node config controller"
I0416 18:57:06.912495 1338 shared_informer.go:240] Waiting for caches to sync for node config
I0416 18:57:06.911638 1338 config.go:226] "Starting endpoint slice config controller"
I0416 18:57:06.912773 1338 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0416 18:57:06.911692 1338 config.go:317] "Starting service config controller"
I0416 18:57:06.912992 1338 shared_informer.go:240] Waiting for caches to sync for service config
I0416 18:57:07.013755 1338 shared_informer.go:247] Caches are synced for node config
I0416 18:57:07.113248 1338 shared_informer.go:247] Caches are synced for endpoint slice config
I0416 18:57:07.113317 1338 shared_informer.go:247] Caches are synced for service config
Debugging
After manually loading the kernel modules one by one, I managed to identify the kernel module that causes the crash: nf_conntrack_netlink
. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded.
The Question
Do you guys happen to have an idea as to why this module could be causing the crash? Furthermore, could you tell me more about the balena OS version that is recommended for the Coral and how this issue could arise? Lastly, is there another balena OS version that I could use to try this on the coral?