Dear Linux folks,
On a Dell XPS 13 9360/0596KF, BIOS 2.21.0 06/02/2022 with Debian
sid/unstable, at least Linux 6.8-rc7 and 6.9-rc1+ log the error below,
when unplugging a USB headset from the USB Type A port of the laptop.
usb 1-1: USB disconnect, device number 10
xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of
current TD ep_index 1 comp_code 1
xhci_hcd 0000:00:14.0: Looking for event-dma 0000000128897080
trb-start 0000000128897090 trb-end 0000000128897090 seg-start
0000000128897000 seg-end 0000000128897ff0
[40189.694083] xhci_hcd 0000:00:14.0: Transfer error for slot 12 ep
4 on endpoint
[40189.694191] xhci_hcd 0000:00:14.0: Port change event, 1-1, id 1,
portsc: 0x202a0
[40189.694208] xhci_hcd 0000:00:14.0: handle_port_status: starting
usb1 port polling.
[40189.694298] hub 1-0:1.0: state 7 ports 12 chg 0000 evt 0002
[40189.694357] xhci_hcd 0000:00:14.0: Get port status 1-1 read:
0x202a0, return 0x10100
[40189.694468] xhci_hcd 0000:00:14.0: clear port1 connect change,
portsc: 0x2a0
[40189.694531] usb usb1-port1: status 0100, change 0001, 12 Mb/s
[40189.694549] usb 1-1: USB disconnect, device number 13
[40189.694560] usb 1-1: unregistering device
[40189.694569] usb 1-1: unregistering interface 1-1:1.0
[40189.694880] xhci_hcd 0000:00:14.0: Transfer error for slot 12 ep
4 on endpoint
[40189.694958] usb 1-1: cannot submit urb (err = -19)
[40189.694983] xhci_hcd 0000:00:14.0: Cancel URB 0000000088a88b25,
dev 1, ep 0x82, starting at offset 0x123666540
[40189.695045] xhci_hcd 0000:00:14.0: // Ding dong!
[40189.695057] xhci_hcd 0000:00:14.0: Cancel URB 00000000177d21ae,
dev 1, ep 0x82, starting at offset 0x123666550
[40189.695071] xhci_hcd 0000:00:14.0: Cancel URB 0000000019497848,
dev 1, ep 0x82, starting at offset 0x123666560
[40189.695084] xhci_hcd 0000:00:14.0: Cancel URB 00000000daf20e4f,
dev 1, ep 0x82, starting at offset 0x123666570
[40189.695097] xhci_hcd 0000:00:14.0: Cancel URB 00000000b36dc96a,
dev 1, ep 0x82, starting at offset 0x123666580
[40189.695109] xhci_hcd 0000:00:14.0: Cancel URB 00000000832a286d,
dev 1, ep 0x82, starting at offset 0x123666590
[40189.695123] xhci_hcd 0000:00:14.0: Cancel URB 00000000f343e418,
dev 1, ep 0x82, starting at offset 0x1236664f0
[40189.695135] xhci_hcd 0000:00:14.0: Cancel URB 00000000c2568709,
dev 1, ep 0x82, starting at offset 0x123666500
[40189.695148] xhci_hcd 0000:00:14.0: Cancel URB 00000000fe8ee0be,
dev 1, ep 0x82, starting at offset 0x123666510
[40189.695161] xhci_hcd 0000:00:14.0: Cancel URB 0000000071a78e52,
dev 1, ep 0x82, starting at offset 0x123666520
[40189.695173] xhci_hcd 0000:00:14.0: Cancel URB 00000000c1d48999,
dev 1, ep 0x82, starting at offset 0x123666530
[40189.695192] xhci_hcd 0000:00:14.0: Stopped on Transfer TRB for
slot 12 ep 4
[40189.695206] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666540 (dma) in stream 0 URB 0000000088a88b25
[40189.695219] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666550 (dma) in stream 0 URB 00000000177d21ae
[40189.695231] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666560 (dma) in stream 0 URB 0000000019497848
[40189.695242] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666570 (dma) in stream 0 URB 00000000daf20e4f
[40189.695252] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666580 (dma) in stream 0 URB 00000000b36dc96a
[40189.695263] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666590 (dma) in stream 0 URB 00000000832a286d
[40189.695274] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x1236664f0 (dma) in stream 0 URB 00000000f343e418
[40189.695285] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666500 (dma) in stream 0 URB 00000000c2568709
[40189.695296] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666510 (dma) in stream 0 URB 00000000fe8ee0be
[40189.695307] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666520 (dma) in stream 0 URB 0000000071a78e52
[40189.695317] xhci_hcd 0000:00:14.0: Removing canceled TD starting
at 0x123666530 (dma) in stream 0 URB 00000000c1d48999
[40189.695360] xhci_hcd 0000:00:14.0: Set TR Deq ptr 0x123666500,
cycle 1
[40189.695371] xhci_hcd 0000:00:14.0: // Ding dong!
[40189.695379] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 0000000088a88b25 TD
[40189.695420] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000177d21ae TD
[40189.695456] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 0000000019497848 TD
[40189.695492] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000daf20e4f TD
[40189.695528] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000b36dc96a TD
[40189.695564] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000832a286d TD
[40189.695599] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Keep cancelled URB 00000000f343e418 TD as
cancel_status is 2
[40189.695610] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000c2568709 TD
[40189.695647] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000fe8ee0be TD
[40189.695682] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 0000000071a78e52 TD
[40189.695718] xhci_hcd 0000:00:14.0:
xhci_giveback_invalidated_tds: Giveback cancelled URB 00000000c1d48999 TD
[40189.695782] xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr
cmd, deq = @123666500
[40189.695793] xhci_hcd 0000:00:14.0: xhci_handle_cmd_set_deq:
Giveback cancelled URB 00000000f343e418 TD
[40189.695879] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA
ptr not part of current TD ep_index 1 comp_code 1
[40189.695893] xhci_hcd 0000:00:14.0: Looking for event-dma
0000000128897aa0 trb-start 0000000128897ab0 trb-end 0000000128897ab0
seg-start 0000000128897000 seg-end 0000000128897ff0
[40189.695932] xhci_hcd 0000:00:14.0: Transfer error for slot 12 ep
14 on endpoint
[40189.695967] xhci_hcd 0000:00:14.0: Soft-reset ep 14, slot 12
[40189.695978] xhci_hcd 0000:00:14.0: // Ding dong!
[40189.695993] xhci_hcd 0000:00:14.0: Ignoring reset ep completion
code of 1
[40189.696047] usb 1-1: Unable to submit urb #2: -19 at
snd_usb_queue_pending_output_urbs
[40189.696065] xhci_hcd 0000:00:14.0: Cancel URB 00000000b476b73a,
dev 1, ep 0x1, starting at offset 0x128897ac0
[40189.696113] xhci_hcd 0000:00:14.0: // Ding dong!
[40189.696124] xhci_hcd 0000:00:14.0: Cancel URB 000000005ef222e5,
dev 1, ep 0x1, starting at offset 0x128897b20
[40189.696141] xhci_hcd 0000:00:14.0: Stopped on Transfer TRB for
slot 12 ep 1
There is no error doing, when connecting the headset to the USB Type A
port of the Dell DA300 port adapter connected over USB Type C port.
As the errors show up in the logs, can I address this somehow on my side?
Kind regards,
Paul
Hi,
> On a Dell XPS 13 9360/0596KF, BIOS 2.21.0 06/02/2022 with Debian
> sid/unstable, at least Linux 6.8-rc7 and 6.9-rc1+ log the error
> below, when unplugging a USB headset from the USB Type A port of the
> laptop.
Can you reproduce this on other computers or with other USB devices?
(I just tested a few random audio dongles on my PC, nothing happens).
Is there any kernel version known to be free of this problem?
> xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part
> of current TD ep_index 1 comp_code 1
> xhci_hcd 0000:00:14.0: Looking for event-dma 0000000128897080
> trb-start 0000000128897090 trb-end 0000000128897090 seg-start
> 0000000128897000 seg-end 0000000128897ff0
This looks like the xHCI driver freed data structures related to a USB
transfer concurrently with the hardware executing this transfer. It's
a bad thing which should never happen, but sometimes it does because of
bugs in USB controllers or the driver.
Can you run patched kernels on the affected laptop? I attach a patch
I used to track down similar issues on UVC webcams. It doesn't fix
anything, but adds some dmesg noise which may be helpful in figuring
out what goes wrong. The patch should work on v6.9-rc1 or -rc2.
Regards,
Michal
Dear Michał,
Thank you for your reply.
Am 05.04.24 um 11:32 schrieb Michał Pecio:
>> On a Dell XPS 13 9360/0596KF, BIOS 2.21.0 06/02/2022 with Debian
>> sid/unstable, at least Linux 6.8-rc7 and 6.9-rc1+ log the error
>> below, when unplugging a USB headset from the USB Type A port of the
>> laptop.
>
> Can you reproduce this on other computers or with other USB devices?
I have not tried it on another GNU/Linux system yet. It does *not*
happen with a Plantronics Blackwire 315.1 USB headset.
> (I just tested a few random audio dongles on my PC, nothing happens).
>
> Is there any kernel version known to be free of this problem?
Unfortunately no. It is reproducible with Debian’s Linux kernel 6.7.9-2.
But in the available logs I found a 6.8+ (“6.9-rc0”) boot with no such
error messages, but I forgot to save them between reboots now, and lost
them. Sorry.
>> xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
>> xhci_hcd 0000:00:14.0: Looking for event-dma 0000000128897080 trb-start 0000000128897090 trb-end 0000000128897090 seg-start 0000000128897000 seg-end 0000000128897ff0
>
> This looks like the xHCI driver freed data structures related to a USB
> transfer concurrently with the hardware executing this transfer. It's
> a bad thing which should never happen, but sometimes it does because of
> bugs in USB controllers or the driver.
Interesting. Thank you for the analysis.
> Can you run patched kernels on the affected laptop? I attach a patch
> I used to track down similar issues on UVC webcams. It doesn't fix
> anything, but adds some dmesg noise which may be helpful in figuring
> out what goes wrong. The patch should work on v6.9-rc1 or -rc2.
Thank you. Please find the log attached with the patch applied on:
$ git log --no-decorate --oneline -2
627920e87b52 usb: xhci: Add debug prints
6c6e47d69d82 Merge tag 'firewire-fixes-6.9-rc2' of
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Kind regards,
Paul
> > Is there any kernel version known to be free of this problem?
>
> Unfortunately no. It is reproducible with Debian’s Linux kernel
> 6.7.9-2. But in the available logs I found a 6.8+ (“6.9-rc0”) boot
> with no such error messages, but I forgot to save them between
> reboots now, and lost them. Sorry.
Maybe it's a random error which doesn't occur on every disconnection.
Case in point, it didn't happen when you recorded the log today.
Please try again until you catch it happening, then post the log
showing "ERROR Transfer event blah blah" and the debug messages.
Dear Michał,
Thank you again for your prompt reply even on the weekend.
Am 06.04.24 um 18:36 schrieb Michał Pecio:
>>> Is there any kernel version known to be free of this problem?
>>
>> Unfortunately no. It is reproducible with Debian’s Linux kernel
>> 6.7.9-2. But in the available logs I found a 6.8+ (“6.9-rc0”) boot
>> with no such error messages, but I forgot to save them between
>> reboots now, and lost them. Sorry.
>
> Maybe it's a random error which doesn't occur on every disconnection.
>
> Case in point, it didn't happen when you recorded the log today.
Indeed, I just looked for red lines, and the logs only contained.
usb 1-1: Unable to submit urb #2: -19 at
snd_usb_queue_pending_output_urbs
(Should I report a separate issue for this?)
> Please try again until you catch it happening, then post the log
> showing "ERROR Transfer event blah blah" and the debug messages.
No, I was more vigilant and was able to reproduce it.
Kind regards,
Paul
This (and the absence of any earlier errors on the endpoint) looks
like the hardware may be confirming a "successful" transfer twice or
the driver may be processing one such confirmation twice.
[ 94.088594] usb 1-2: USB disconnect, device number 8
[ 94.089370] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
[ 94.089403] xhci_hcd 0000:00:14.0: Looking for event-dma 00000001250310f0 trb-start 0000000125031100 trb-end 0000000125031100 seg-start 0000000125031000 seg-end 0000000125031ff0
[ 94.089427] xhci_hcd 0000:00:14.0: last xhci_td_cleanup: first_dma 1250310f0 last_dma 1250310f0 status -115 from finish_td
(I say "successful" but it really isn't - the device is no longer
listening. But there is no delivery confirmation on isochronous OUT
endpoints so the xHC doesn't suspect anything.)
Could you try again with this updated debug patch to get more info?
Regards,
Michal
On 7.4.2024 15.25, Michał Pecio wrote:
> This (and the absence of any earlier errors on the endpoint) looks
> like the hardware may be confirming a "successful" transfer twice or
> the driver may be processing one such confirmation twice.
It's also possible this TD/TRB was cancelled due to the disconnect.
Could be that even if driver removes the TD from the list and cleans out the TRB
from the ring buffer (turns TRB to no-op) hardware may have read ahead and cached the TRB,
and process it anyway.
>
> [ 94.088594] usb 1-2: USB disconnect, device number 8
> [ 94.089370] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
> [ 94.089403] xhci_hcd 0000:00:14.0: Looking for event-dma 00000001250310f0 trb-start 0000000125031100 trb-end 0000000125031100 seg-start 0000000125031000 seg-end 0000000125031ff0
> [ 94.089427] xhci_hcd 0000:00:14.0: last xhci_td_cleanup: first_dma 1250310f0 last_dma 1250310f0 status -115 from finish_td
>
> (I say "successful" but it really isn't - the device is no longer
> listening. But there is no delivery confirmation on isochronous OUT
> endpoints so the xHC doesn't suspect anything.)
>
> Could you try again with this updated debug patch to get more info?
Would also be helpful to add xhci dynamic debug and xhci tracing (two separate logs)
These will show in detail everything that is going on.
Steps:
mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
< Reproduce issue >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace
please copy the /sys/kernel/debug/tracing/trace file somewhere as soon
as possible after reproducing the issue. It grows fast.
Thanks
Mathias
[one addition]
Am 08.04.24 um 18:37 schrieb Paul Menzel:
> Dear Mathias, dear Michał,
>
>
> Thank you very much for your assistance.
>
>
> Am 08.04.24 um 09:17 schrieb Mathias Nyman:
>> On 7.4.2024 15.25, Michał Pecio wrote:
>>> This (and the absence of any earlier errors on the endpoint) looks
>>> like the hardware may be confirming a "successful" transfer twice or
>>> the driver may be processing one such confirmation twice.
>>
>> It's also possible this TD/TRB was cancelled due to the disconnect.
>> Could be that even if driver removes the TD from the list and cleans
>> out the TRB from the ring buffer (turns TRB to no-op) hardware may
>> have read ahead and cached the TRB, and process it anyway.
>>
>>> [ 94.088594] usb 1-2: USB disconnect, device number 8
>>> [ 94.089370] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
>>> [ 94.089403] xhci_hcd 0000:00:14.0: Looking for event-dma 00000001250310f0 trb-start 0000000125031100 trb-end 0000000125031100 seg-start 0000000125031000 seg-end 0000000125031ff0
>>> [ 94.089427] xhci_hcd 0000:00:14.0: last xhci_td_cleanup: first_dma 1250310f0 last_dma 1250310f0 status -115 from finish_td
>>>
>>> (I say "successful" but it really isn't - the device is no longer
>>> listening. But there is no delivery confirmation on isochronous OUT
>>> endpoints so the xHC doesn't suspect anything.)
>>>
>>> Could you try again with this updated debug patch to get more info?
>>
>> Would also be helpful to add xhci dynamic debug and xhci tracing (two
>> separate logs). These will show in detail everything that is going on.
>>
>> Steps:
>>
>> mount -t debugfs none /sys/kernel/debug
>> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
>> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
>> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
>> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
>> echo 1 > /sys/kernel/debug/tracing/tracing_on
>> < Reproduce issue >
>> Send output of dmesg
>> Send content of /sys/kernel/debug/tracing/trace
>>
>> please copy the /sys/kernel/debug/tracing/trace file somewhere as soon
>> as possible after reproducing the issue. It grows fast.
>
> For posterity I created Linux Kernel Bugzilla issue #218695 [1], and
> attached the files there. Hopefully everything was captured, that you
> need. The discussion could continue on this list, but do as it suits you
> best.
Just to clarify, although there are six error log lines I only plugged
and unplugged the USB headset once.
> Kind regards,
>
> Paul
>
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218695
Dear Mathias, dear Michał,
Thank you very much for your assistance.
Am 08.04.24 um 09:17 schrieb Mathias Nyman:
> On 7.4.2024 15.25, Michał Pecio wrote:
>> This (and the absence of any earlier errors on the endpoint) looks
>> like the hardware may be confirming a "successful" transfer twice or
>> the driver may be processing one such confirmation twice.
>
> It's also possible this TD/TRB was cancelled due to the disconnect.
> Could be that even if driver removes the TD from the list and cleans out
> the TRB from the ring buffer (turns TRB to no-op) hardware may have read
> ahead and cached the TRB, and process it anyway.
>
>> [ 94.088594] usb 1-2: USB disconnect, device number 8
>> [ 94.089370] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
>> [ 94.089403] xhci_hcd 0000:00:14.0: Looking for event-dma 00000001250310f0 trb-start 0000000125031100 trb-end 0000000125031100 seg-start 0000000125031000 seg-end 0000000125031ff0
>> [ 94.089427] xhci_hcd 0000:00:14.0: last xhci_td_cleanup: first_dma 1250310f0 last_dma 1250310f0 status -115 from finish_td
>>
>> (I say "successful" but it really isn't - the device is no longer
>> listening. But there is no delivery confirmation on isochronous OUT
>> endpoints so the xHC doesn't suspect anything.)
>>
>> Could you try again with this updated debug patch to get more info?
>
> Would also be helpful to add xhci dynamic debug and xhci tracing (two
> separate logs). These will show in detail everything that is going on.
>
> Steps:
>
> mount -t debugfs none /sys/kernel/debug
> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> echo 1 > /sys/kernel/debug/tracing/tracing_on
> < Reproduce issue >
> Send output of dmesg
> Send content of /sys/kernel/debug/tracing/trace
>
> please copy the /sys/kernel/debug/tracing/trace file somewhere as soon
> as possible after reproducing the issue. It grows fast.
For posterity I created Linux Kernel Bugzilla issue #218695 [1], and
attached the files there. Hopefully everything was captured, that you
need. The discussion could continue on this list, but do as it suits you
best.
Kind regards,
Paul
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=218695
> It's also possible this TD/TRB was cancelled due to the disconnect.
> Could be that even if driver removes the TD from the list and cleans
> out the TRB from the ring buffer (turns TRB to no-op) hardware may
> have read ahead and cached the TRB, and process it anyway.
I thought about it, but my debug patch says that the missing TD was
freed by finish_td(), which is called on TDs considered completed by
hardware. A cancelled TD would show giveback_invalidated_tds().
Anyway, we now have new information from the reporter. My v2 patch
keeps a log of the last five events processed on each transfer ring
and dumps the log on TRB mismatch errors.
Unfortunately, it looks like the host controller is broken and signals
completion of those transfers twice. The log below shows two distinct
events for TRB 32959a1c0 and that the coresponding TD has just been
freed by finish_td().
[33819.676314] usb 1-2: USB disconnect, device number 7
[33819.676331] usb 1-2: unregistering device
[33819.676347] usb 1-2: unregistering interface 1-2:1.0
[33819.677861] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
[33819.677886] xhci_hcd 0000:00:14.0: Looking for event-dma 000000032959a1c0 trb-start 000000032959a1d0 trb-end 000000032959a1d0 seg-start 000000032959a000 seg-end 000000032959aff0
[33819.677902] xhci_hcd 0000:00:14.0: last xhci_td_cleanup: first_dma 32959a1c0 last_dma 32959a1c0 status -115 from finish_td
[33819.677915] xhci_hcd 0000:00:14.0: handle_tx_event log -4: event ffff9e19c38eb1f0 ep_trb_dma 32959a190 comp_code 1 len 0 slot 6 ep 1
[33819.677925] xhci_hcd 0000:00:14.0: handle_tx_event log -3: event ffff9e19c38eb200 ep_trb_dma 32959a1a0 comp_code 1 len 0 slot 6 ep 1
[33819.677934] xhci_hcd 0000:00:14.0: handle_tx_event log -2: event ffff9e19c38eb220 ep_trb_dma 32959a1b0 comp_code 1 len 0 slot 6 ep 1
[33819.677941] xhci_hcd 0000:00:14.0: handle_tx_event log -1: event ffff9e19c38eb230 ep_trb_dma 32959a1c0 comp_code 1 len 0 slot 6 ep 1
[33819.677949] xhci_hcd 0000:00:14.0: handle_tx_event log 0: event ffff9e19c38eb260 ep_trb_dma 32959a1c0 comp_code 1 len 0 slot 6 ep 1
At this point we have set the "interesting" flag on this transfer ring,
so the handling of the next event is logged. We observe TD 32959a1d0
completing normally here.
[33819.677965] xhci_hcd 0000:00:14.0: handle_tx_event event ffff9e19c38eb270 ep_trb_dma 32959a1d0 comp_code 1 len 0 slot 6 ep 1
[33819.677979] xhci_hcd 0000:00:14.0: handle_tx_event event ffff9e19c38eb270 first_trb 32959a1d0 last_trb 32959a1d0 ep_seg 32959a000
[33819.677993] xhci_hcd 0000:00:14.0: process_isoc_td event ffff9e19c38eb270 requested 192 ep_trb_len 192 remaining 0
[33819.678004] xhci_hcd 0000:00:14.0: finish_td td_status -115 comp_code 1 frame_status 0 frame_actual_length 192
And then we get a TRB mismatch error on this same TD.
The dumped log shows duplicate events once again.
[33819.678022] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 1 comp_code 1
[33819.678034] xhci_hcd 0000:00:14.0: Looking for event-dma 000000032959a1d0 trb-start 000000032959a1e0 trb-end 000000032959a1e0 seg-start 000000032959a000 seg-end 000000032959aff0
[33819.678047] xhci_hcd 0000:00:14.0: last xhci_td_cleanup: first_dma 32959a1d0 last_dma 32959a1d0 status -115 from finish_td
[33819.678058] xhci_hcd 0000:00:14.0: handle_tx_event log -4: event ffff9e19c38eb220 ep_trb_dma 32959a1b0 comp_code 1 len 0 slot 6 ep 1
[33819.678066] xhci_hcd 0000:00:14.0: handle_tx_event log -3: event ffff9e19c38eb230 ep_trb_dma 32959a1c0 comp_code 1 len 0 slot 6 ep 1
[33819.678074] xhci_hcd 0000:00:14.0: handle_tx_event log -2: event ffff9e19c38eb260 ep_trb_dma 32959a1c0 comp_code 1 len 0 slot 6 ep 1
[33819.678082] xhci_hcd 0000:00:14.0: handle_tx_event log -1: event ffff9e19c38eb270 ep_trb_dma 32959a1d0 comp_code 1 len 0 slot 6 ep 1
[33819.678089] xhci_hcd 0000:00:14.0: handle_tx_event log 0: event ffff9e19c38eb280 ep_trb_dma 32959a1d0 comp_code 1 len 0 slot 6 ep 1
Regards,
Michal
On 8.4.2024 22.05, Michał Pecio wrote:
>> It's also possible this TD/TRB was cancelled due to the disconnect.
>> Could be that even if driver removes the TD from the list and cleans
>> out the TRB from the ring buffer (turns TRB to no-op) hardware may
>> have read ahead and cached the TRB, and process it anyway.
>
> I thought about it, but my debug patch says that the missing TD was
> freed by finish_td(), which is called on TDs considered completed by
> hardware. A cancelled TD would show giveback_invalidated_tds().
>
>
> Anyway, we now have new information from the reporter. My v2 patch
> keeps a log of the last five events processed on each transfer ring
> and dumps the log on TRB mismatch errors.
>
> Unfortunately, it looks like the host controller is broken and signals
> completion of those transfers twice. The log below shows two distinct
> events for TRB 32959a1c0 and that the coresponding TD has just been
> freed by finish_td().
The trace confirms this, we get double completion events for several Isoc TRBs.
These double completions are seen after a transaction error on the same device (different endpoint).
Transfer events for TRB ..a1c0 twice, with a transaction error in between:
<idle>-0 [000] d.h2. 33819.709897: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
<idle>-0 [000] d.h2. 33819.709904: xhci_handle_event: EVENT: TRB 000000041547d010 status 'USB Transaction Error' len 4 slot 6 ep 15 type 'Transfer Event' flags e:c
systemd-journal-395 [000] d.H1. 33819.711886: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
Transfer events for TRB ..a1d0 twice (the next TRB)
systemd-journal-395 [000] d.H1. 33819.712001: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
systemd-journal-395 [000] d.H1. 33819.712059: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
Transfer events for TRB ..a1e0 twice
systemd-journal-395 [000] d.H1. 33819.712139: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
systemd-journal-395 [000] d.h1. 33819.712871: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
etc..
Driver can cope with these extra events, but if this is common we should
probably handle it silently and not concern users with that ERROR message.
We are actually at the moment looking at improving handle_tx_event() with
Niklas (cc), and will take this into consideration.
Thanks
Mathias
Dear Mathias, dear Michał,
Am 09.04.24 um 13:22 schrieb Mathias Nyman:
> On 8.4.2024 22.05, Michał Pecio wrote:
>>> It's also possible this TD/TRB was cancelled due to the disconnect.
>>> Could be that even if driver removes the TD from the list and cleans
>>> out the TRB from the ring buffer (turns TRB to no-op) hardware may
>>> have read ahead and cached the TRB, and process it anyway.
>>
>> I thought about it, but my debug patch says that the missing TD was
>> freed by finish_td(), which is called on TDs considered completed by
>> hardware. A cancelled TD would show giveback_invalidated_tds().
>>
>> Anyway, we now have new information from the reporter. My v2 patch
>> keeps a log of the last five events processed on each transfer ring
>> and dumps the log on TRB mismatch errors.
>>
>> Unfortunately, it looks like the host controller is broken and signals
>> completion of those transfers twice. The log below shows two distinct
>> events for TRB 32959a1c0 and that the coresponding TD has just been
>> freed by finish_td().
>
> The trace confirms this, we get double completion events for several
> Isoc TRBs. These double completions are seen after a transaction
> error on the same device (different endpoint). >
> Transfer events for TRB ..a1c0 twice, with a transaction error in between:
> <idle>-0 [000] d.h2. 33819.709897: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
> <idle>-0 [000] d.h2. 33819.709904: xhci_handle_event: EVENT: TRB 000000041547d010 status 'USB Transaction Error' len 4 slot 6 ep 15 type 'Transfer Event' flags e:c
> systemd-journal-395 [000] d.H1. 33819.711886: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>
> Transfer events for TRB ..a1d0 twice (the next TRB)
> systemd-journal-395 [000] d.H1. 33819.712001: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
> systemd-journal-395 [000] d.H1. 33819.712059: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>
> Transfer events for TRB ..a1e0 twice
> systemd-journal-395 [000] d.H1. 33819.712139: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
> systemd-journal-395 [000] d.h1. 33819.712871: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>
> etc..
>
> Driver can cope with these extra events, but if this is common we should
> probably handle it silently and not concern users with that ERROR message.
Thank you for the detailed analysis. Excuse my ignorance, but do you
have an idea, what this Sennheiser USB headset does differently than
other USB devices? Additionally, is this a known problem with this Intel
xHCI controller, meaning, is there an errata about this problem?
> We are actually at the moment looking at improving handle_tx_event() with
> Niklas (cc), and will take this into consideration.
That would be great. Thank you for working on this.
Kind regards,
Paul
> Driver can cope with these extra events, but if this is common we
> should probably handle it silently and not concern users with that
> ERROR message.
The error message in itself is harmless, it means the driver gets an
event it doesn't know how to handle and ignores it. Further events are
processed normally (in this specific case).
What's problematic is that the controller is apparently still working
on a TD which the driver considers to be finished already. The driver
can overwrite the TD and reuse its data buffer for other transfers,
while the hardware may still need the original TD for proper operation
and, if we are very unlucky, could attempt DMA to/from the data buffer,
causing data corruption or information leak to a malicious USB dongle.
For all we know, Paul's buggy chipset may not only be confirming the
transfer twice, but really performing it twice for some stupid reason.
> We are actually at the moment looking at improving handle_tx_event()
> with Niklas (cc), and will take this into consideration.
Given the number of bugs so far, maybe it would make sense to count
transfer ring slots of the last completed TD as still "in use" until
the next TD is known to at least begin executing.
Unfortunately, "quarantining" URB data buffers in similar manner would
be harder AFAIK.
I recently found one more bug of this kind: the Etron EJ168 controller
produces two events for failed single-TRB isochronous IN transfers -
one event indicating the failure, and then a "success". The full extent
of the bug (does it affect OUT or non-isoch, what happens on multi-TRB)
is unknown because the controller is very prone to crashing under my
workloads, which doesn't help debugging.
Regards,
Michal
On 10.4.2024 10.59, Paul Menzel wrote:
> Dear Mathias, dear Michał,
>
>
> Am 09.04.24 um 13:22 schrieb Mathias Nyman:
>> On 8.4.2024 22.05, Michał Pecio wrote:
>>>> It's also possible this TD/TRB was cancelled due to the disconnect.
>>>> Could be that even if driver removes the TD from the list and cleans
>>>> out the TRB from the ring buffer (turns TRB to no-op) hardware may
>>>> have read ahead and cached the TRB, and process it anyway.
>>>
>>> I thought about it, but my debug patch says that the missing TD was
>>> freed by finish_td(), which is called on TDs considered completed by
>>> hardware. A cancelled TD would show giveback_invalidated_tds().
>>>
>>> Anyway, we now have new information from the reporter. My v2 patch
>>> keeps a log of the last five events processed on each transfer ring
>>> and dumps the log on TRB mismatch errors.
>>>
>>> Unfortunately, it looks like the host controller is broken and signals
>>> completion of those transfers twice. The log below shows two distinct
>>> events for TRB 32959a1c0 and that the coresponding TD has just been
>>> freed by finish_td().
>>
>> The trace confirms this, we get double completion events for several
>> Isoc TRBs. These double completions are seen after a transaction
>> error on the same device (different endpoint). >
>> Transfer events for TRB ..a1c0 twice, with a transaction error in between:
>> <idle>-0 [000] d.h2. 33819.709897: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>> <idle>-0 [000] d.h2. 33819.709904: xhci_handle_event: EVENT: TRB 000000041547d010 status 'USB Transaction Error' len 4 slot 6 ep 15 type 'Transfer Event' flags e:c
>> systemd-journal-395 [000] d.H1. 33819.711886: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>
>> Transfer events for TRB ..a1d0 twice (the next TRB)
>> systemd-journal-395 [000] d.H1. 33819.712001: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>> systemd-journal-395 [000] d.H1. 33819.712059: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>
>> Transfer events for TRB ..a1e0 twice
>> systemd-journal-395 [000] d.H1. 33819.712139: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>> systemd-journal-395 [000] d.h1. 33819.712871: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>
>> etc..
>>
>> Driver can cope with these extra events, but if this is common we should
>> probably handle it silently and not concern users with that ERROR message.
>
> Thank you for the detailed analysis. Excuse my ignorance, but do you have an idea, what this Sennheiser USB headset does differently than other USB devices? Additionally, is this a known problem with this Intel xHCI controller, meaning, is there an errata about this problem?
There are a few related erratas in older 9 series chipsets that possibly
could explain this, but those issues are no longer listed for newer chipsets.
The Sennheiser headset is a full-speed (FS) device that use 192 byte Isoch transfers.
Series 9 chipset xHC has issues with exactly those FS Isoch transfers over 189 bytes, see
" 1. USB Isoch In Transfer Error Issue"
There are some issues related to FS device removal:
" 13. USB Full-/low-speed Device Removal Issue"
And some related to resending transfer events for "chached" TRBs after FS device
disconnect/reconnect.
"25. USB xHCI may Execute a Stale Transfer Request Block (TRB)"
https://www.intel.co.jp/content/dam/www/public/us/en/documents/specification-updates/9-series-chipset-pch-spec-update.pdf
Thanks
Mathias
Dear Mathias,
Thank you for your reply.
Am 11.04.24 um 09:18 schrieb Mathias Nyman:
> On 10.4.2024 10.59, Paul Menzel wrote:
>> Am 09.04.24 um 13:22 schrieb Mathias Nyman:
>>> On 8.4.2024 22.05, Michał Pecio wrote:
>>>>> It's also possible this TD/TRB was cancelled due to the disconnect.
>>>>> Could be that even if driver removes the TD from the list and cleans
>>>>> out the TRB from the ring buffer (turns TRB to no-op) hardware may
>>>>> have read ahead and cached the TRB, and process it anyway.
>>>>
>>>> I thought about it, but my debug patch says that the missing TD was
>>>> freed by finish_td(), which is called on TDs considered completed by
>>>> hardware. A cancelled TD would show giveback_invalidated_tds().
>>>>
>>>> Anyway, we now have new information from the reporter. My v2 patch
>>>> keeps a log of the last five events processed on each transfer ring
>>>> and dumps the log on TRB mismatch errors.
>>>>
>>>> Unfortunately, it looks like the host controller is broken and signals
>>>> completion of those transfers twice. The log below shows two distinct
>>>> events for TRB 32959a1c0 and that the coresponding TD has just been
>>>> freed by finish_td().
>>>
>>> The trace confirms this, we get double completion events for several
>>> Isoc TRBs. These double completions are seen after a transaction
>>> error on the same device (different endpoint). >
>>> Transfer events for TRB ..a1c0 twice, with a transaction error in
>>> between:
>>> <idle>-0 [000] d.h2. 33819.709897: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>> <idle>-0 [000] d.h2. 33819.709904: xhci_handle_event: EVENT: TRB 000000041547d010 status 'USB Transaction Error' len 4 slot 6 ep 15 type 'Transfer Event' flags e:c
>>> systemd-journal-395 [000] d.H1. 33819.711886: xhci_handle_event: EVENT: TRB 000000032959a1c0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>>
>>> Transfer events for TRB ..a1d0 twice (the next TRB)
>>> systemd-journal-395 [000] d.H1. 33819.712001: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>> systemd-journal-395 [000] d.H1. 33819.712059: xhci_handle_event: EVENT: TRB 000000032959a1d0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>>
>>> Transfer events for TRB ..a1e0 twice
>>> systemd-journal-395 [000] d.H1. 33819.712139: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>> systemd-journal-395 [000] d.h1. 33819.712871: xhci_handle_event: EVENT: TRB 000000032959a1e0 status 'Success' len 0 slot 6 ep 2 type 'Transfer Event' flags e:c
>>>
>>> etc..
>>>
>>> Driver can cope with these extra events, but if this is common we should
>>> probably handle it silently and not concern users with that ERROR
>>> message.
>>
>> Thank you for the detailed analysis. Excuse my ignorance, but do you
>> have an idea, what this Sennheiser USB headset does differently than
>> other USB devices? Additionally, is this a known problem with this
>> Intel xHCI controller, meaning, is there an errata about this problem?
>
> There are a few related erratas in older 9 series chipsets that possibly
> could explain this, but those issues are no longer listed for newer
> chipsets.
>
> The Sennheiser headset is a full-speed (FS) device that use 192 byte
> Isoch transfers.
>
> Series 9 chipset xHC has issues with exactly those FS Isoch transfers
> over 189 bytes, see
> " 1. USB Isoch In Transfer Error Issue"
>
> There are some issues related to FS device removal:
> " 13. USB Full-/low-speed Device Removal Issue"
>
> And some related to resending transfer events for "cached" TRBs after
> FS device disconnect/reconnect.
> "25. USB xHCI may Execute a Stale Transfer Request Block (TRB)"
>
> https://www.intel.co.jp/content/dam/www/public/us/en/documents/specification-updates/9-series-chipset-pch-spec-update.pdf
Thank you for digging this up. Judging from the document, these erratas
were not addressed in any firmware update.
As another data point, I was able to reproduce this issue with the
Sennheiser USB headset and a Dell XPS 15 7590.
$ lspci -nn | grep USB
00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH
USB 3.1 xHCI Host Controller [8086:a36d] (rev 10)
3a:00.0 USB controller [0c03]: Intel Corporation JHL6340
Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
[8086:15db] (rev 02)
I uploaded the logs to *Hardware for Linux* [1].
Kind regards,
Paul
[1]: https://linux-hardware.org/?probe=904c918345
> Thank you for the detailed analysis. Excuse my ignorance, but do you
> have an idea, what this Sennheiser USB headset does differently than
> other USB devices?
It uses isochronous out transfers, which off the top of my head are
almost only used for audio playback in the real world.
However, I don't know what it is doing differently from your other USB
headset. A few random hypotheses:
- the other is not a "full speed" device (but it most likely is)
- you haven't tried enough times to reproduce the bug
- you have some software which automatically starts playback to the
Sennheiser when it's connected, but not to the other headset
- different playback format and different USB packet size, although
192B/ms corresponds to 16 bit 48kHz stereo - the most comomn format.
As for your bugzilla comment and the list of other TRB mismatch errors
on linux-hardware - yes, these things happen due to a variety of bugs
in host controllers and the driver. They are not all the same bug and
they can't be analyzed or fixed if people don't report them and don't
cooperate in debugging. These messages alone don't provide information
about what specifically went wrong and under which circumstances.
Regards,
Michal