Hey folks,
I’ve encountered an issue which generates several kernel errors and sporadic crashes, following long-running processes with serial I/O and an Arducam, and only after updating balenaOS on several RPi 4 devices to 2.108.18+rev2.
I also raised an issue on raspberrypi/linux
here, which includes further dmesg
output and lsusb
logs.
Evidence
After long periods of activity I see a number of red flags:
Failed to get suitable pool for 0000:01:00.0
...
xhci_hcd 0000:01:00.0: Ring expansion failed
...
usb 1-1.1: usbfs: usb_submit_urb returned -12
...
There is a log which apparently indicates some unhealthy trbs state for a given device (262144 records, way over 512 records):
> wc -l /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
262144 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
> cut -d ' ' -f2 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs | sort | uniq -c
257024 Buffer
1024 LINK
4096 type
> head /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
0x0000000440037000: Buffer 000000043e394000 length 16384 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
0x0000000440037010: Buffer 000000043e398000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037020: Buffer 000000043e39c000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037030: Buffer 000000043e3a0000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037040: Buffer 000000043e3a4000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037050: Buffer 000000043e3a8000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037060: Buffer 000000043e3ac000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037070: Buffer 000000043e3b0000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037080: Buffer 000000043e3b4000 length 16384 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
0x0000000440037090: Buffer 000000043e3b8000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
> tail /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
0x000000041b103f60: Buffer 000000043e380000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103f70: Buffer 000000043e384000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103f80: Buffer 000000043e388000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103f90: Buffer 000000043e38c000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103fa0: Buffer 000000043e390000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103fb0: LINK 0000000440037000 intr 0 type 'Link' flags i:C:T:c
0x000000041b103fc0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
0x000000041b103fd0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
0x000000041b103fe0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
0x000000041b103ff0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
There are no internal (IP) software changes around this new set of errors, our “reproduction” has simple required the updated balenaOS.
I am currently working to produce a minimal and self-contained script to reproduce the issue.
Existing fixes?
Interestingly, related issues (see links at end of post) resulted in fixes that were merged into kernel 5.15.y (e.g. this commit). The changes in balenaOS 2.108.18+rev2 seem to reference a kernel bump to 5.15 (from 5.10?), but I’m not sure if the respective meta-raspberrypi bump to 5.15.34 pick up the referenced fixes.
Some initial responses suggest that a later kernel version might pick up fixes (see comments), could be incorporated into balenaOS (via updated layers/meta-balena
, picking up commit 1d8a12ffa7fb496c2691d4130ec9fc3c3c00a9f7
). Possibly, those fixes don’t even address the issue we are seeing.
Workaround
Currently, I have a limited understanding of the problem. Simply reverting to a “known good”, balenaOS 2.105.1+rev1, has avoided repeated incidents.
Any advice on how I might identify the underlying issue would be appreciated, thank you!
Edits:
- Noted possible meta-balena commit which could incorporate underlying kernel fixes