USB issues with xHCI ring expansion failures

Hey folks,

I’ve encountered an issue which generates several kernel errors and sporadic crashes, following long-running processes with serial I/O and an Arducam, and only after updating balenaOS on several RPi 4 devices to 2.108.18+rev2.

I also raised an issue on raspberrypi/linux here, which includes further dmesg output and lsusb logs.

Evidence

After long periods of activity I see a number of red flags:

Failed to get suitable pool for 0000:01:00.0
...
xhci_hcd 0000:01:00.0: Ring expansion failed
...
usb 1-1.1: usbfs: usb_submit_urb returned -12
...

There is a log which apparently indicates some unhealthy trbs state for a given device (262144 records, way over 512 records):

> wc -l /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
262144 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
> cut -d ' ' -f2 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs | sort | uniq -c
 257024 Buffer
   1024 LINK
   4096 type
> head /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
0x0000000440037000: Buffer 000000043e394000 length 16384 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
0x0000000440037010: Buffer 000000043e398000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037020: Buffer 000000043e39c000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037030: Buffer 000000043e3a0000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037040: Buffer 000000043e3a4000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037050: Buffer 000000043e3a8000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037060: Buffer 000000043e3ac000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037070: Buffer 000000043e3b0000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
0x0000000440037080: Buffer 000000043e3b4000 length 16384 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
0x0000000440037090: Buffer 000000043e3b8000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:C
> tail /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/02/ep04/trbs
0x000000041b103f60: Buffer 000000043e380000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103f70: Buffer 000000043e384000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103f80: Buffer 000000043e388000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103f90: Buffer 000000043e38c000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103fa0: Buffer 000000043e390000 length 16384 TD size 31 intr 0 type 'Normal' flags b:i:i:C:s:I:e:c
0x000000041b103fb0: LINK 0000000440037000 intr 0 type 'Link' flags i:C:T:c
0x000000041b103fc0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
0x000000041b103fd0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
0x000000041b103fe0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000
0x000000041b103ff0: type 'UNKNOWN' -> raw 00000000 00000000 00000000 00000000

There are no internal (IP) software changes around this new set of errors, our “reproduction” has simple required the updated balenaOS.

I am currently working to produce a minimal and self-contained script to reproduce the issue.

Existing fixes?

Interestingly, related issues (see links at end of post) resulted in fixes that were merged into kernel 5.15.y (e.g. this commit). The changes in balenaOS 2.108.18+rev2 seem to reference a kernel bump to 5.15 (from 5.10?), but I’m not sure if the respective meta-raspberrypi bump to 5.15.34 pick up the referenced fixes.

Some initial responses suggest that a later kernel version might pick up fixes (see comments), could be incorporated into balenaOS (via updated layers/meta-balena, picking up commit 1d8a12ffa7fb496c2691d4130ec9fc3c3c00a9f7). Possibly, those fixes don’t even address the issue we are seeing.

Workaround

Currently, I have a limited understanding of the problem. Simply reverting to a “known good”, balenaOS 2.105.1+rev1, has avoided repeated incidents.

Any advice on how I might identify the underlying issue would be appreciated, thank you!


Edits:

  • Noted possible meta-balena commit which could incorporate underlying kernel fixes

Just noting a possibly related thread which suggests a version as late as 6.4 might be in a healthier state. Possibly, @alexgg you might have some thoughts based on your recent kernel bump commits for balenaOS?

Having read through the thread (217242 – CPU hard lockup related to xhci/dma), which was what we believe to be our issue picked up by contributors to the kernel, it looks fairly promising this could be our fix.

@alexgg what is Balena’s process for updating the kernel for balena-raspberrypi? It seems its still tracking pretty far behind.

From the comments in the kernel thread, it seems like this has been patched in v6.4 (and possibly in older stable releases, but I couldn’t trace it to which ones).

Thanks

Hi Ben, there is already an open PR to open to the latest stable head (see linux-raspberrypi: update to latest stable head 5.15.92 by alexgg · Pull Request #1018 · balena-os/balena-raspberrypi · GitHub). Once that is merged and a new release is deployed please re-test and let us know.

In general, balena-raspberrypi follows meta-raspberrypi at the corresponding Yocto version. In this case kirkstone is using the 5.15 kernel version. The policy is usually to use the latest Long Term Support kernel.

In terms of stable branches updates, we are looking into automating recipe updates so we always use the branch head, but until that happens we have to do it manually and usually on support requests.

Reading through the thread above, it seems the patch you refer to is kernel/git/torvalds/linux.git - Linux kernel source tree and it has just landed upstream.
It has also been backported to 5.15.117 (kernel/git/stable/linux.git - Linux kernel stable tree) but unfortunately it hasn’t yet made it to linux-raspberrypi (History for drivers/usb/host/xhci-ring.c - raspberrypi/linux · GitHub).

As you have already raised the problem with the linux-raspberrypi maintainers, we will update as soon as they update the 5.15.y branch.

Thanks @alexgg - I couldn’t find the backport to 5.15.17, appreciate that.

Now that we know that’s in place, we’ll chase up on linux-raspberrypi before returning to this!

Thank you for getting back @alexgg, hopefully the bump to 5.15.92 will incorporate some possible fixes - but yes it might be that some later fixes (from the thread) also need to be brought into linux-raspberrypi. It looks like your PR is blocked for some reason? Hopefully you’ll get resolution soon, it would be nice to test as soon as possible.

Hi @alexgg, we’ve heard back from the team over at raspberrypi-linux, which warrants forwarding the question: have the Balena team considered moving to kernal 6.1?

Hi @alexgg, it looks like the updated kernel PR has been merged - great! Thanks for helping push those changes. It would be useful to get a sense of when the new balenaOS release will be available for testing? What is the typical turnaround/build time? Thanks!

Hi, sorry for the delay in coming back to you all.

The 2.115.18 balenaOS release has been deployed automatically after the merge for those devices that are automatically tested so I expect it to be available. Please let me know if you are using a device type that has not automatically deployed v2.115.18 and I will look into it. This release contains an update to kernel v5.15.92` which may not address all the issues reported.

As for the update to a 6.x kernel, that is not something we can do as a standalone exercise, it needs to come with a Yocto version update. We are currently using Yocto Kirkstone which is a Long Term Support release and is supported until April 2026. The v5.15 kernel is also a LTS release maintained until October 2026.

So unfortunately we are not considering switching to another Yocto version until a new LTS version is announced.

As per the comments in the Linux raspberrypi kernel, the update to 5.15.117 is also not easy as there are multiple conflicts that need resolution and re-testing.

What I suggest as an alternative path is that we attempt to backport just the patch that allegedly solves this issue and release this to our staging environment so that you can perform testing that confirms the issue is addressed.

Is this an acceptable solution?

Thank you for getting back @alexgg!

We are currently testing 2.115.18, and so far haven’t encountered any issues (whereas they are quite easy to reproduce on a balenaOS version running kernel v5.15.34) - fantastic!

But I wouldn’t want to say that we are clear quite yet, and will report back once we are entirely confident – or, if it seems necessary to backport some specific xHCI patches. Given the challenge in rolling onto 6.x, it’s very generous of you to offer some specific backports (but I appreciate this might be quite involved).

Hi, just following up, have you managed to continue testing and gain any more confidence that we’re in the clear?

Thanks for following up @myarmolinsky! So far, we are relatively confident that this has addressed the issue (based on ongoing testing), but of course we’ll be continuing to monitor closely for a bit longer.

Hi Tom, thanks for letting us know.