Help debugging kernel oops on Pi Zero W

My application (based on balenalib/raspberry-pi-alpine:edge), running on Raspberry Pi Zero W, is crashing with a kernel oops quite reliably within about an hour of use. The app listens on an RS485/USB dongle and runs video with omxplayer in the background. Three oopses are below, they seem to be triggered by different applications. One is from my app videopdu, but one other seems to be from the node process in the Balena supervisor.

Can anyone suggest how to begin deciphering the oops messages I am seeing to track down the root cause?

[ 1734.499522] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[ 1734.499548] pgd = d672c000
[ 1734.499558] [0000000c] *pgd=1659e831, *pte=00000000, *ppte=00000000
[ 1734.499585] Internal error: Oops: 17 [#1] ARM
[ 1734.499595] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter bnep xt_owner hci_uart btbcm serdev bluetooth ecdh_generic i2c_dev spidev ftdi_sio usbserial brcmfmac brcmutil cfg80211 rfkill snd_bcm2835(C) snd_pcm snd_timer snd i2c_bcm2835 spi_bcm2835 uio_pdrv_genirq uio fixed sch_fq_codel
[ 1734.499717] CPU: 0 PID: 1244 Comm: node Tainted: G         C      4.14.98 #1
[ 1734.499724] Hardware name: BCM2835
[ 1734.499734] task: d6840000 task.stack: d6ba2000
[ 1734.499765] PC is at do_work_pending+0x30/0xf0
[ 1734.499793] LR is at finish_task_switch+0x5c/0x1d8
[ 1734.499802] pc : [<c0013ae0>]    lr : [<c00495f8>]    psr: 60000013
[ 1734.499811] sp : d6ba3f90  ip : d6ba3ee0  fp : d6ba3fac
[ 1734.499819] r10: 00000000  r9 : d6ba2000  r8 : c0010444
[ 1734.499829] r7 : d6ba3fb0  r6 : c0010444  r5 : ffffe000  r4 : 00000002
[ 1734.499838] r3 : 00000000  r2 : 9f259198  r1 : 00000002  r0 : c11cc8c8
[ 1734.499848] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 1734.499858] Control: 00c5387d  Table: 1672c008  DAC: 00000055
[ 1734.499867] Process node (pid: 1244, stack limit = 0xd6ba2188)
[ 1734.499876] Stack: (0xd6ba3f90 to 0xd6ba4000)
[ 1734.499889] 3f80:                                     01de4bc8 01de4b84 01de4b84 00000003
[ 1734.499904] 3fa0: 00000000 d6ba3fb0 c00102d4 c0013abc 00001000 01b2ae20 00001000 00000000
[ 1734.499919] 3fc0: 01de4bc8 01de4b84 01de4b84 00000003 00000000 00c8d1e8 beef88a0 beef87a4
[ 1734.499934] 3fe0: 00000000 beef8628 beef8604 b6d11594 80000010 00000010 00000000 00000000
[ 1734.499967] [<c0013ae0>] (do_work_pending) from [<c00102d4>] (slow_work_pending+0xc/0x20)
[ 1734.499987] Code: e1a06002 eb02ecf0 ea000007 eb1c44ca (f10c0080)
[ 1734.500000] ---[ end trace 905f8b6d7aa9c8b1 ]---
[ 3521.563116] export_store: invalid GPIO 128
[ 3523.271624] export_store: invalid GPIO 128
[10520.696964] export_store: invalid GPIO 128


[   40.149088] IPv6: ADDRCONF(NETDEV_UP): br-ac30c84a0aa1: link is not ready
[   65.446263] export_store: invalid GPIO 128
[   68.102121] ip6_tables: (C) 2000-2006 Netfilter Core Team
[  563.760578] Unable to handle kernel NULL pointer dereference at virtual address 000002f4
[  563.760601] pgd = ca42c000
[  563.760611] [000002f4] *pgd=00000000
[  563.760628] Internal error: Oops: 5 [#1] ARM
[  563.760638] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter bnep xt_owner hci_uart btbcm serdev bluetooth ecdh_generic i2c_dev spidev brcmfmac ftdi_sio brcmutil usbserial cfg80211 rfkill snd_bcm2835(C) snd_pcm snd_timer snd i2c_bcm2835 spi_bcm2835 uio_pdrv_genirq fixed uio sch_fq_codel
[  563.760757] CPU: 0 PID: 1264 Comm: videopdu Tainted: G         C      4.14.98 #1
[  563.760765] Hardware name: BCM2835
[  563.760775] task: d6560de0 task.stack: d6bc2000
[  563.760801] PC is at ret_fast_syscall+0x0/0x28
[  563.760818] LR is at SyS_ioctl+0x58/0x68
[  563.760828] pc : [<c00102a0>]    lr : [<c017f674>]    psr: 20000013
[  563.760836] sp : d6bc3fa8  ip : d6bc3f80  fp : 00000000
[  563.760845] r10: 00000000  r9 : d6bc2000  r8 : c0010444
[  563.760854] r7 : 00000036  r6 : b6e9b274  r5 : 0000c40c  r4 : b6c9d18c
[  563.760862] r3 : d6baf97c  r2 : 00000002  r1 : 00000000  r0 : 00000000
[  563.760873] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  563.760882] Control: 00c5387d  Table: 0a42c008  DAC: 00000055
[  563.760891] Process videopdu (pid: 1264, stack limit = 0xd6bc2188)
[  563.760901] Stack: (0xd6bc3fa8 to 0xd6bc4000)
[  563.760915] 3fa0:                   b6c9d18c 0000c40c 00000004 0000c40c 00006003 beb9c650
[  563.760929] 3fc0: b6c9d18c 0000c40c b6e9b274 00000036 00000000 beb9c774 00000000 ffffffff
[  563.760944] 3fe0: b6c9cf38 beb9c638 b6c8b95c b6ecd4a8 a0000010 00000004 00000000 00000000
[  563.760965] Code: 00000000 00000000 00000000 00000000 (e5ad0008) 
[  563.761021] ---[ end trace 4c1d8b90a7723bec ]---
[  773.837264] export_store: invalid GPIO 128



[ 1704.429506] Unable to handle kernel NULL pointer dereference at virtual address 00000084
[ 1704.429536] pgd = c0004000
[ 1704.429547] [00000084] *pgd=00000000
[ 1704.429567] Internal error: Oops: 5 [#2] ARM
[ 1704.429578] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter bnep xt_owner hci_uart btbcm serdev bluetooth ecdh_generic i2c_dev spidev brcmfmac ftdi_sio brcmutil usbserial cfg80211 rfkill snd_bcm2835(C) snd_pcm snd_timer snd i2c_bcm2835 spi_bcm2835 uio_pdrv_genirq fixed uio sch_fq_codel
[ 1704.429709] CPU: 0 PID: 522 Comm: jbd2/mmcblk0p6- Tainted: G      D  C      4.14.98 #1
[ 1704.429718] Hardware name: BCM2835
[ 1704.429729] task: c0c94560 task.stack: c0d58000
[ 1704.429757] PC is at elv_iosched_show+0xc8/0x1c8
[ 1704.429769] LR is at elv_iosched_show+0xb8/0x1c8
[ 1704.429778] pc : [<c0383e74>]    lr : [<c0383e64>]    psr: a0000093
[ 1704.429788] sp : c0d59c60  ip : c0d59c60  fp : c0d59c84
[ 1704.429796] r10: c0d59c84  r9 : c1218888  r8 : 00000000
[ 1704.429805] r7 : 00000000  r6 : c0d59c84  r5 : d619cba0  r4 : c121a738
[ 1704.429815] r3 : 00000000  r2 : 00000000  r1 : c121a7a1  r0 : ffffffff
[ 1704.429827] Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 1704.429836] Control: 00c5387d  Table: 1646c008  DAC: 00000055
[ 1704.429845] Process jbd2/mmcblk0p6- (pid: 522, stack limit = 0xc0d58188)
[ 1704.429857] Stack: (0xc0d59c60 to 0xc0d5a000)
[ 1704.429873] 9c60: d619cba0 c11c1028 d6a3fa00 00000003 00000000 00000000 c0d59cbc c0d59c88
[ 1704.429889] 9c80: c038c02c c03835fc d7d33330 d650d5a0 c0d59cd4 d6a3fa00 00000000 df16dd33
[ 1704.429906] 9ca0: c0d59cc4 ffffffff c11c1028 d619cba0 c0d59d04 c0d59cc0 c038a1d8 c038bd24
[ 1704.429921] 9cc0: 00400000 00000000 00000000 00000000 00000000 df16dd33 c0d59d34 d650d5a0
[ 1704.429935] 9ce0: c11c1028 00060801 00060800 00000008 00000000 d578cdc0 c0d59d74 c0d59d08
[ 1704.429950] 9d00: c038a3c0 c038a0ec c0d59d74 c9e5c800 d7179ac0 00000001 01400000 00000000
[ 1704.429965] 9d20: c11c1028 d578cdc0 c0d59d74 c0d59d38 c0380828 c03807c0 c0d59d64 c01a494c
[ 1704.429984] 9d40: c0d59d74 df16dd33 c01a494c c9e5c800 d650d5a0 00060801 00060800 00000000
[ 1704.430000] 9d60: 00000000 d578cdc0 c0d59d9c c0d59d78 c01a4c30 c038a370 d72aa400 c0d59de0
[ 1704.430017] 9d80: c9e5c800 c11c1028 c0d59e94 ffffffff c0d59db4 c0d59da0 c01a5500 c01a4a9c
[ 1704.430031] 9da0: 00000000 c0d59de0 c0d59e54 c0d59db8 c0254d60 c01a54e8 5d081c60 00000000
[ 1704.430047] 9dc0: 2822c20e c11c0760 c0d59dfc c0d59dd8 c0725358 c0057670 c11c1028 df16dd33
[ 1704.430062] 9de0: 00000000 d6e1e98c 00000000 c9ea752c c0d59e3c c0d59e00 c07253f4 c07252b8
[ 1704.430077] 9e00: c0d59dfc c9ea7500 00000002 c02533f0 c0d59e44 c0d59e20 c02533f0 c02524bc
[ 1704.430092] 9e20: d72aa400 d6e1e960 00000000 df16dd33 d72aa400 d6e1e960 00000000 d6e1e98c
[ 1704.430107] 9e40: 00000000 d578cdec c0d59f2c c0d59e58 c0256000 c0254b00 00020beb 00000000
[ 1704.430122] 9e60: d6944f7e 0000018c c0d59eb8 00000000 d6e1e9b4 d6e1e960 ffffffff 0000000c
[ 1704.430136] 9e80: 00000000 00000fd8 d629a000 c77b7028 c0d59eb4 00000000 00000658 0000d1f2
[ 1704.430151] 9ea0: 00020beb 00000000 c0d59ea8 c0d59ea8 c0d59eb0 c0d59eb0 c0d59eb8 c0d59eb8
[ 1704.430166] 9ec0: c0d59ec0 c0d59ec0 c0d59ec8 c0d59ec8 c0072d58 c00ceea8 00000000 00000000
[ 1704.430180] 9ee0: 00000001 00000000 00000000 00022498 c0d59f1c 00000001 00000002 df16dd33
[ 1704.430195] 9f00: 40000113 d72aa400 00000000 00000000 ffffe000 d72aa5fc d72aa438 c11d1bc0
[ 1704.430210] 9f20: c0d59f74 c0d59f30 c025b5b0 c0254eec 00000000 c0c94560 c0057c80 c0d59f3c
[ 1704.430226] 9f40: c0d59f3c df16dd33 c025b4dc c0c5dbc0 d62f4a40 00000000 c0d58000 d72aa400
[ 1704.430241] 9f60: c025b4dc c0ef9c28 c0d59fac c0d59f78 c0041590 c025b4e8 c0c5dbd8 c0c5dbd8
[ 1704.430255] 9f80: 00000000 d62f4a40 c0041440 00000000 00000000 00000000 00000000 00000000
[ 1704.430269] 9fa0: 00000000 c0d59fb0 c001034c c004144c 00000000 00000000 00000000 00000000
[ 1704.430283] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1704.430296] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[ 1704.430321] Code: e51b2030 e3500000 0a00002d e5d4207c (e5981084) 
[ 1704.430336] ---[ end trace 4c1d8b90a7723bed ]---


[ 5107.696660] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[ 5107.696724] pgd = ca4b4000
[ 5107.696735] [0000000c] *pgd=0a48b831, *pte=00000000, *ppte=00000000
[ 5107.696762] Internal error: Oops: 17 [#3] ARM
[ 5107.696773] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter bnep xt_owner hci_uart btbcm serdev bluetooth ecdh_generic i2c_dev spidev brcmfmac ftdi_sio brcmutil usbserial cfg80211 rfkill snd_bcm2835(C) snd_pcm snd_timer snd i2c_bcm2835 spi_bcm2835 uio_pdrv_genirq fixed uio sch_fq_codel
[ 5107.696904] CPU: 0 PID: 1256 Comm: node Tainted: G      D  C      4.14.98 #1
[ 5107.696912] Hardware name: BCM2835
[ 5107.696921] task: d6bf0000 task.stack: d6bec000
[ 5107.696960] PC is at do_work_pending+0x30/0xf0
[ 5107.696993] LR is at finish_task_switch+0x5c/0x1d8
[ 5107.697003] pc : [<c0013ae0>]    lr : [<c00495f8>]    psr: 60000013
[ 5107.697011] sp : d6bedf90  ip : d6bedee0  fp : d6bedfac
[ 5107.697021] r10: 017bd06c  r9 : d6bec000  r8 : 00000000
[ 5107.697029] r7 : d6bedfb0  r6 : 00000000  r5 : ffffe000  r4 : 00000002
[ 5107.697038] r3 : 00000000  r2 : 5a70d99e  r1 : 00000002  r0 : c11cc8c8
[ 5107.697049] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 5107.697058] Control: 00c5387d  Table: 0a4b4008  DAC: 00000055
[ 5107.697068] Process node (pid: 1256, stack limit = 0xd6bec188)
[ 5107.697080] Stack: (0xd6bedf90 to 0xd6bee000)
[ 5107.697095] df80:                                     b4b32e8c 20000010 ffffffff 00c5387d
[ 5107.697112] dfa0: 00000000 d6bedfb0 c00102d4 c0013abc 0000000e 5ec3dba5 54c94699 550883b5
[ 5107.697128] dfc0: 00000160 017d7030 3c108261 3bb9d289 00006902 00000420 017bd06c bea0db54
[ 5107.697143] dfe0: 54c94699 bea0db3c 5113f07c b4b32e8c 20000010 ffffffff 00000000 00000000
[ 5107.697179] [<c0013ae0>] (do_work_pending) from [<c00102d4>] (slow_work_pending+0xc/0x20)
[ 5107.697201] Code: e1a06002 eb02ecf0 ea000007 eb1c44ca (f10c0080) 
[ 5107.697214] ---[ end trace 4c1d8b90a7723bee ]---
bash-5.0#

Hi Toby,
typically once you have had a kernel Oops you can not be sure that your system is still in a good state. The later Oops, although occurring in a different process might still be caused by the first one. Does the Oops originating in your app always occur first ?
Kind Regards
Thomas

@samothx Those are three separate oopses, from different runs, rebooting between each. I agree, once an oops has occurred, the system is not reliable.

How can I debug this oops with a Balena system to get more information on the root cause? This is preventing me from deploying with customers.

Hi Toby,
unfortunately you can only make so much sense of kernel stack traces and I am not an expert at it.
If the oops you posted are all firsts and if your container is not running a node process it sounds like the problem could be anywhere - even unrelated to your container.
My strategy at this point would be to get more insight as to what triggers the oops.

  • How is it connected to your application - Can you shut down components of you app and does that keep the oops from occurring ?
  • Does it occur on older releases of your app ? What was changed ?
  • Is it connected to the device / hardware, does it happen on a different device, device type ?
  • Is it connected to the Balena OS release ? What release of Balena are you running ?
    Unfortunately all of this is rather time consuming…