After a few more weeks of digging, I have come to the tentative
conclusion that either the XHCI driver, or the USB sound driver,
or both, fail to handle USB errors correctly.
I have some questions at the bottom, after a (brief-ish) explanation
of exactly what seems to go wrong.
TL;DR: arecord from a misbehaving device can hang forever
after a USB error, due to poll on /dev/snd/timer never returning.
The details: under some mysterious circumstances, the PCM290x
family sound chips can send more data than expected during an
isochronous transfer, leading to a babble error. Those
circumstances seem to in part depend on the USB host controller
and/or the electrical environment, since the chips work just
fine for most people.
Receiving data past the end of the isochronous transfer window
scheduled for a device results in the XHCI controller throwing
a babble error, which moves the endpoint into halted state.
This is followed by the host controller software sending a
reset endpoint command, and moving the endpoint into stopped
state, as specified on pages 164-165 of the XHCI specification.
However, the USB sound driver seems to have no idea that this
error happened. The function retire_capture_urb looks at the
status of each isochronous frame, but seems to be under the
assumption that the sound device just keeps on running.
The function snd_complete_urb seems to only detect that the
device is not running if usb_submit_urb returns a failure.
err = usb_submit_urb(urb, GFP_ATOMIC);
if (err == 0)
return;
usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n", err);
if (ep->data_subs && ep->data_subs->pcm_substream) {
substream = ep->data_subs->pcm_substream;
snd_pcm_stop_xrun(substream);
}
However, the XHCI driver will happily submit an URB to a
stopped device. Looking at the call trace usb_submit_urb ->
xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring,
you can see this code:
/* Make sure the endpoint has been added to xHC schedule */
switch (ep_state) {
...
case EP_STATE_HALTED:
xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
case EP_STATE_STOPPED:
case EP_STATE_RUNNING:
break;
This leads me to a few questions:
- should retire_capture_urb call snd_pcm_stop_xrun,
or another function like it, if it sees certain
errors in the iso frame in the URB?
- should snd_complete_urb do something with these
errors, too, in case they happen on the sync frames
and not the data frames?
- does the XHCI code need to ring the doorbell when
submitting an URB to a stopped device, or is it
always up to the higher-level driver to fully reset
the device before it can do anything useful?
- if a device in stopped state does not do anything
useful, should usb_submit_urb return an error?
- how should the USB sound driver recover from these
occasional and/or one-off errors? stop the sound
stream, or try to reinitialize the device and start
recording again?
I am willing to write patches and can test with my
setup, but both the sound code and the USB code are
new to me so I would like to know what direction I
should go in :)
--
All Rights Reversed.
On Wed, 20 May 2020 13:26:57 +0200,
Rik van Riel wrote:
>
> After a few more weeks of digging, I have come to the tentative
> conclusion that either the XHCI driver, or the USB sound driver,
> or both, fail to handle USB errors correctly.
>
> I have some questions at the bottom, after a (brief-ish) explanation
> of exactly what seems to go wrong.
>
> TL;DR: arecord from a misbehaving device can hang forever
> after a USB error, due to poll on /dev/snd/timer never returning.
>
> The details: under some mysterious circumstances, the PCM290x
> family sound chips can send more data than expected during an
> isochronous transfer, leading to a babble error. Those
> circumstances seem to in part depend on the USB host controller
> and/or the electrical environment, since the chips work just
> fine for most people.
>
> Receiving data past the end of the isochronous transfer window
> scheduled for a device results in the XHCI controller throwing
> a babble error, which moves the endpoint into halted state.
>
> This is followed by the host controller software sending a
> reset endpoint command, and moving the endpoint into stopped
> state, as specified on pages 164-165 of the XHCI specification.
>
> However, the USB sound driver seems to have no idea that this
> error happened. The function retire_capture_urb looks at the
> status of each isochronous frame, but seems to be under the
> assumption that the sound device just keeps on running.
>
> The function snd_complete_urb seems to only detect that the
> device is not running if usb_submit_urb returns a failure.
>
> err = usb_submit_urb(urb, GFP_ATOMIC);
> if (err == 0)
> return;
>
> usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n", err);
>
> if (ep->data_subs && ep->data_subs->pcm_substream) {
> substream = ep->data_subs->pcm_substream;
> snd_pcm_stop_xrun(substream);
> }
>
> However, the XHCI driver will happily submit an URB to a
> stopped device. Looking at the call trace usb_submit_urb ->
> xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring,
> you can see this code:
>
> /* Make sure the endpoint has been added to xHC schedule */
> switch (ep_state) {
> ...
> case EP_STATE_HALTED:
> xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
> case EP_STATE_STOPPED:
> case EP_STATE_RUNNING:
> break;
>
> This leads me to a few questions:
> - should retire_capture_urb call snd_pcm_stop_xrun,
> or another function like it, if it sees certain
> errors in the iso frame in the URB?
I guess it makes sense, yes.
> - should snd_complete_urb do something with these
> errors, too, in case they happen on the sync frames
> and not the data frames?
Ditto, the error can be handled similarly.
> - does the XHCI code need to ring the doorbell when
> submitting an URB to a stopped device, or is it
> always up to the higher-level driver to fully reset
> the device before it can do anything useful?
> - if a device in stopped state does not do anything
> useful, should usb_submit_urb return an error?
> - how should the USB sound driver recover from these
> occasional and/or one-off errors? stop the sound
> stream, or try to reinitialize the device and start
> recording again?
When snd_pcm_stop_xrun() is called, it stops the stream and sets in
XRUN state. Then the application receives -EPIPE error upon the next
access, and the application needs to re-setup the stream and restart.
Takashi
>
> I am willing to write patches and can test with my
> setup, but both the sound code and the USB code are
> new to me so I would like to know what direction I
> should go in :)
>
> --
> All Rights Reversed.
> [2 This is a digitally signed message part <application/pgp-signature (7bit)>]
>
On 20.5.2020 14.26, Rik van Riel wrote:
> After a few more weeks of digging, I have come to the tentative
> conclusion that either the XHCI driver, or the USB sound driver,
> or both, fail to handle USB errors correctly.
>
> I have some questions at the bottom, after a (brief-ish) explanation
> of exactly what seems to go wrong.
>
> TL;DR: arecord from a misbehaving device can hang forever
> after a USB error, due to poll on /dev/snd/timer never returning.
>
> The details: under some mysterious circumstances, the PCM290x
> family sound chips can send more data than expected during an
> isochronous transfer, leading to a babble error. Those
> circumstances seem to in part depend on the USB host controller
> and/or the electrical environment, since the chips work just
> fine for most people.
>
> Receiving data past the end of the isochronous transfer window
> scheduled for a device results in the XHCI controller throwing
> a babble error, which moves the endpoint into halted state.
>
> This is followed by the host controller software sending a
> reset endpoint command, and moving the endpoint into stopped
> state, as specified on pages 164-165 of the XHCI specification.
Note that isoch endpoints should generate buffer overrun instead of
babble detect error on TD babble conditions.
See xHCI spec 6.4.5 additional note 115.
Or maybe a frame babble could halt an isoc endpoint, see xhci 4.10.2.4.1
but then you should see a port error and port going to disabled state.
Any logs of this?
mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
< trigger the issue >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace
>
> However, the USB sound driver seems to have no idea that this
> error happened. The function retire_capture_urb looks at the
> status of each isochronous frame, but seems to be under the
> assumption that the sound device just keeps on running.
>
> The function snd_complete_urb seems to only detect that the
> device is not running if usb_submit_urb returns a failure.
>
> err = usb_submit_urb(urb, GFP_ATOMIC);
> if (err == 0)
> return;
>
> usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n", err);
>
> if (ep->data_subs && ep->data_subs->pcm_substream) {
> substream = ep->data_subs->pcm_substream;
> snd_pcm_stop_xrun(substream);
> }
>
> However, the XHCI driver will happily submit an URB to a
> stopped device. Looking at the call trace usb_submit_urb ->
> xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring,
> you can see this code:
>
> /* Make sure the endpoint has been added to xHC schedule */
> switch (ep_state) {
> ...
> case EP_STATE_HALTED:
> xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
> case EP_STATE_STOPPED:
> case EP_STATE_RUNNING:
> break;
>
> This leads me to a few questions:
> - should retire_capture_urb call snd_pcm_stop_xrun,
> or another function like it, if it sees certain
> errors in the iso frame in the URB?
> - should snd_complete_urb do something with these
> errors, too, in case they happen on the sync frames
> and not the data frames?
> - does the XHCI code need to ring the doorbell when
> submitting an URB to a stopped device, or is it
> always up to the higher-level driver to fully reset
> the device before it can do anything useful?
xhci driver should ring the doorbell.
xhci_queue_isoc_tx()
giveback_first_trb()
xhci_ring_ep_doorbell()
when we are talking about babble or transaction errors the device might be
completely unaware of the situation. Device side of the endpoint is not necessarily halted.
xHC host will only halt its internal endpoint state, and it needs a reset
endopoint command from the xhci driver to clear the internal halt state.
-Mathias
> - if a device in stopped state does not do anything
> useful, should usb_submit_urb return an error?
> - how should the USB sound driver recover from these
> occasional and/or one-off errors? stop the sound
> stream, or try to reinitialize the device and start
> recording again?
>
> I am willing to write patches and can test with my
> setup, but both the sound code and the USB code are
> new to me so I would like to know what direction I
> should go in :)
>
On Wed, May 20, 2020 at 07:26:57AM -0400, Rik van Riel wrote:
> After a few more weeks of digging, I have come to the tentative
> conclusion that either the XHCI driver, or the USB sound driver,
> or both, fail to handle USB errors correctly.
>
> I have some questions at the bottom, after a (brief-ish) explanation
> of exactly what seems to go wrong.
>
> TL;DR: arecord from a misbehaving device can hang forever
> after a USB error, due to poll on /dev/snd/timer never returning.
>
> The details: under some mysterious circumstances, the PCM290x
> family sound chips can send more data than expected during an
> isochronous transfer, leading to a babble error. Those
Do these chips connect as USB-3 devices or as USB-2? (I wouldn't expect
an audio device to use USB-3; it shouldn't need the higher bandwidth.)
> circumstances seem to in part depend on the USB host controller
> and/or the electrical environment, since the chips work just
> fine for most people.
>
> Receiving data past the end of the isochronous transfer window
> scheduled for a device results in the XHCI controller throwing
> a babble error, which moves the endpoint into halted state.
>
> This is followed by the host controller software sending a
> reset endpoint command, and moving the endpoint into stopped
> state, as specified on pages 164-165 of the XHCI specification.
In general, errors such as babble are not supposed to stop isochronous
endpoints.
> However, the USB sound driver seems to have no idea that this
> error happened. The function retire_capture_urb looks at the
> status of each isochronous frame, but seems to be under the
> assumption that the sound device just keeps on running.
This is appropriate, for the reason mentioned above.
> The function snd_complete_urb seems to only detect that the
> device is not running if usb_submit_urb returns a failure.
>
> err = usb_submit_urb(urb, GFP_ATOMIC);
> if (err == 0)
> return;
>
> usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n", err);
>
> if (ep->data_subs && ep->data_subs->pcm_substream) {
> substream = ep->data_subs->pcm_substream;
> snd_pcm_stop_xrun(substream);
> }
>
> However, the XHCI driver will happily submit an URB to a
> stopped device.
Do you mean "stopped device" or "stopped endpoint"?
> Looking at the call trace usb_submit_urb ->
> xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring,
> you can see this code:
>
> /* Make sure the endpoint has been added to xHC schedule */
> switch (ep_state) {
> ...
> case EP_STATE_HALTED:
> xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
> case EP_STATE_STOPPED:
> case EP_STATE_RUNNING:
> break;
>
> This leads me to a few questions:
> - should retire_capture_urb call snd_pcm_stop_xrun,
> or another function like it, if it sees certain
> errors in the iso frame in the URB?
No. Isochronous endpoints are expected to encounter errors from time to
time; that is the nature of isochronous communications. You're supposed
to ignore the errors (skip over any bad data) and keep going.
> - should snd_complete_urb do something with these
> errors, too, in case they happen on the sync frames
> and not the data frames?
> - does the XHCI code need to ring the doorbell when
> submitting an URB to a stopped device, or is it
> always up to the higher-level driver to fully reset
> the device before it can do anything useful?
In this case it is not up to the higher-level driver.
> - if a device in stopped state does not do anything
> useful, should usb_submit_urb return an error?
The notion of "stopped state" is not part of USB-2. As a result, it
should be handled entirely within the xhci-hcd driver.
(A non-isochronous endpoint can be in the "halted" state. But obviously
this isn't what you're talking about.)
> - how should the USB sound driver recover from these
> occasional and/or one-off errors? stop the sound
> stream, or try to reinitialize the device and start
> recording again?
As far as I know, it should do its best to continue (perhaps fill in
missing data with zeros).
Alan Stern
> I am willing to write patches and can test with my
> setup, but both the sound code and the USB code are
> new to me so I would like to know what direction I
> should go in :)
On Wed, 2020-05-20 at 12:38 -0400, Alan Stern wrote:
> On Wed, May 20, 2020 at 07:26:57AM -0400, Rik van Riel wrote:
> > After a few more weeks of digging, I have come to the tentative
> > conclusion that either the XHCI driver, or the USB sound driver,
> > or both, fail to handle USB errors correctly.
> >
> > I have some questions at the bottom, after a (brief-ish)
> > explanation
> > of exactly what seems to go wrong.
> >
> > TL;DR: arecord from a misbehaving device can hang forever
> > after a USB error, due to poll on /dev/snd/timer never returning.
> >
> > The details: under some mysterious circumstances, the PCM290x
> > family sound chips can send more data than expected during an
> > isochronous transfer, leading to a babble error. Those
>
> Do these chips connect as USB-3 devices or as USB-2? (I wouldn't
> expect
> an audio device to use USB-3; it shouldn't need the higher
> bandwidth.)
USB-2
> In general, errors such as babble are not supposed to stop
> isochronous
> endpoints.
However, it seems that they do. The urb never
gets an answer after snd_complete_urb refiles
it with usb_submit_urb.
> > However, the USB sound driver seems to have no idea that this
> > error happened. The function retire_capture_urb looks at the
> > status of each isochronous frame, but seems to be under the
> > assumption that the sound device just keeps on running.
>
> This is appropriate, for the reason mentioned above.
Having arecord get stuck forever does not seem like
the right behavior, though :)
> > This leads me to a few questions:
> > - should retire_capture_urb call snd_pcm_stop_xrun,
> > or another function like it, if it sees certain
> > errors in the iso frame in the URB?
>
> No. Isochronous endpoints are expected to encounter errors from time
> to
> time; that is the nature of isochronous communications. You're
> supposed
> to ignore the errors (skip over any bad data) and keep going.
...
> The notion of "stopped state" is not part of USB-2. As a result, it
> should be handled entirely within the xhci-hcd driver.
Interesting. That makes me really curious why things are
getting stuck, now...
> > - how should the USB sound driver recover from these
> > occasional and/or one-off errors? stop the sound
> > stream, or try to reinitialize the device and start
> > recording again?
>
> As far as I know, it should do its best to continue (perhaps fill in
> missing data with zeros).
That was my first intuition as well, given the documented
behavior of isochronous frames.
However, given that the device appears to stop sending
frames after that error, at least usbmon is not seeing
any, I am not sure what needs to happen in order to get
that behavior.
--
All Rights Reversed.
On Wed, 2020-05-20 at 16:50 +0300, Mathias Nyman wrote:
> On 20.5.2020 14.26, Rik van Riel wrote:
> > After a few more weeks of digging, I have come to the tentative
> > conclusion that either the XHCI driver, or the USB sound driver,
> > or both, fail to handle USB errors correctly.
> >
> > I have some questions at the bottom, after a (brief-ish)
> > explanation
> > of exactly what seems to go wrong.
> >
> > TL;DR: arecord from a misbehaving device can hang forever
> > after a USB error, due to poll on /dev/snd/timer never returning.
> >
> > The details: under some mysterious circumstances, the PCM290x
> > family sound chips can send more data than expected during an
> > isochronous transfer, leading to a babble error. Those
> > circumstances seem to in part depend on the USB host controller
> > and/or the electrical environment, since the chips work just
> > fine for most people.
> >
> > Receiving data past the end of the isochronous transfer window
> > scheduled for a device results in the XHCI controller throwing
> > a babble error, which moves the endpoint into halted state.
> >
> > This is followed by the host controller software sending a
> > reset endpoint command, and moving the endpoint into stopped
> > state, as specified on pages 164-165 of the XHCI specification.
>
> Note that isoch endpoints should generate buffer overrun instead of
> babble detect error on TD babble conditions.
> See xHCI spec 6.4.5 additional note 115.
Maybe it should, but I'm hitting this printk in handle_tx_event:
case COMP_BABBLE_DETECTED_ERROR:
xhci_dbg(xhci, "Babble error for slot %u ep %u on
endpoint\n",
slot_id, ep_index);
status = -EOVERFLOW;
break;
Tracing in the sound driver suggests that packet belongs to
the sound data device (not the sync device), because the
URB with -EOVERFLOW status for the isochronous packet in that
URB is the last one seen there.
> Or maybe a frame babble could halt an isoc endpoint, see xhci
> 4.10.2.4.1
> but then you should see a port error and port going to disabled
> state.
>
> Any logs of this?
>
> mount -t debugfs none /sys/kernel/debug
> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> < trigger the issue >
> Send output of dmesg
> Send content of /sys/kernel/debug/tracing/trace
dmesg:
https://drive.google.com/open?id=1S2Qc8lroqA5-RMukuLBLWFGx10vEjG-i
usbtrace:
https://drive.google.com/open?id=1cbLcOnAtQRW0Chgak6PNC0l4yJv__4uO
> > However, the USB sound driver seems to have no idea that this
> > error happened. The function retire_capture_urb looks at the
> > status of each isochronous frame, but seems to be under the
> > assumption that the sound device just keeps on running.
> >
> > The function snd_complete_urb seems to only detect that the
> > device is not running if usb_submit_urb returns a failure.
> >
> > err = usb_submit_urb(urb, GFP_ATOMIC);
> > if (err == 0)
> > return;
> >
> > usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n",
> > err);
> >
> > if (ep->data_subs && ep->data_subs->pcm_substream) {
> > substream = ep->data_subs->pcm_substream;
> > snd_pcm_stop_xrun(substream);
> > }
> >
> > However, the XHCI driver will happily submit an URB to a
> > stopped device. Looking at the call trace usb_submit_urb ->
> > xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring,
> > you can see this code:
> >
> > /* Make sure the endpoint has been added to xHC schedule */
> > switch (ep_state) {
> > ...
> > case EP_STATE_HALTED:
> > xhci_dbg(xhci, "WARN halted endpoint, queueing URB
> > anyway.\n");
> > case EP_STATE_STOPPED:
> > case EP_STATE_RUNNING:
> > break;
> >
> > This leads me to a few questions:
> > - should retire_capture_urb call snd_pcm_stop_xrun,
> > or another function like it, if it sees certain
> > errors in the iso frame in the URB?
> > - should snd_complete_urb do something with these
> > errors, too, in case they happen on the sync frames
> > and not the data frames?
> > - does the XHCI code need to ring the doorbell when
> > submitting an URB to a stopped device, or is it
> > always up to the higher-level driver to fully reset
> > the device before it can do anything useful?
>
> xhci driver should ring the doorbell.
>
> xhci_queue_isoc_tx()
> giveback_first_trb()
> xhci_ring_ep_doorbell()
>
> when we are talking about babble or transaction errors the device
> might be
> completely unaware of the situation. Device side of the endpoint is
> not necessarily halted.
> xHC host will only halt its internal endpoint state, and it needs a
> reset
> endopoint command from the xhci driver to clear the internal halt
> state.
Interesting. I had no idea the endpoint on each side
was controlled separately.
--
All Rights Reversed.
On Wed, May 20, 2020 at 03:21:44PM -0400, Rik van Riel wrote:
> On Wed, 2020-05-20 at 12:38 -0400, Alan Stern wrote:
> > On Wed, May 20, 2020 at 07:26:57AM -0400, Rik van Riel wrote:
> > > After a few more weeks of digging, I have come to the tentative
> > > conclusion that either the XHCI driver, or the USB sound driver,
> > > or both, fail to handle USB errors correctly.
> > >
> > > I have some questions at the bottom, after a (brief-ish)
> > > explanation
> > > of exactly what seems to go wrong.
> > >
> > > TL;DR: arecord from a misbehaving device can hang forever
> > > after a USB error, due to poll on /dev/snd/timer never returning.
> > >
> > > The details: under some mysterious circumstances, the PCM290x
> > > family sound chips can send more data than expected during an
> > > isochronous transfer, leading to a babble error. Those
> >
> > Do these chips connect as USB-3 devices or as USB-2? (I wouldn't
> > expect
> > an audio device to use USB-3; it shouldn't need the higher
> > bandwidth.)
>
> USB-2
Okay. In that case, xhci-hcd should appear to act just like a USB-2
host controller.
> > In general, errors such as babble are not supposed to stop
> > isochronous
> > endpoints.
>
> However, it seems that they do. The urb never
> gets an answer after snd_complete_urb refiles
> it with usb_submit_urb.
>
> > > However, the USB sound driver seems to have no idea that this
> > > error happened. The function retire_capture_urb looks at the
> > > status of each isochronous frame, but seems to be under the
> > > assumption that the sound device just keeps on running.
> >
> > This is appropriate, for the reason mentioned above.
>
> Having arecord get stuck forever does not seem like
> the right behavior, though :)
No, it isn't.
> > > This leads me to a few questions:
> > > - should retire_capture_urb call snd_pcm_stop_xrun,
> > > or another function like it, if it sees certain
> > > errors in the iso frame in the URB?
> >
> > No. Isochronous endpoints are expected to encounter errors from time
> > to
> > time; that is the nature of isochronous communications. You're
> > supposed
> > to ignore the errors (skip over any bad data) and keep going.
>
> ...
>
> > The notion of "stopped state" is not part of USB-2. As a result, it
> > should be handled entirely within the xhci-hcd driver.
>
> Interesting. That makes me really curious why things are
> getting stuck, now...
This could be a bug in xhci-hcd. Perhaps the controller's endpoint
state needs to be updated after one of these errors occurs. Mathias
will know all about that.
> > > - how should the USB sound driver recover from these
> > > occasional and/or one-off errors? stop the sound
> > > stream, or try to reinitialize the device and start
> > > recording again?
> >
> > As far as I know, it should do its best to continue (perhaps fill in
> > missing data with zeros).
>
> That was my first intuition as well, given the documented
> behavior of isochronous frames.
>
> However, given that the device appears to stop sending
> frames after that error, at least usbmon is not seeing
> any, I am not sure what needs to happen in order to get
> that behavior.
The device won't send any data unless the host controller tells it to.
If something is wrong with the host controller then usbmon won't see any
data.
Alan Stern
On Wed, 2020-05-20 at 16:34 -0400, Alan Stern wrote:
> On Wed, May 20, 2020 at 03:21:44PM -0400, Rik van Riel wrote:
> >
> > Interesting. That makes me really curious why things are
> > getting stuck, now...
>
> This could be a bug in xhci-hcd. Perhaps the controller's endpoint
> state needs to be updated after one of these errors occurs. Mathias
> will know all about that.
I am seeing something potentially interesting in the
giant trace. First the final enqueue/dequeue before
the babble error:
<idle>-0 [005] d.s. 776367.638233: xhci_inc_enq: ISOC
0000000033a6879e: enq 0x0000001014070420(0x0000001014070000) deq
0x0000001014070360(0x0000001014070000) segs 2 stream 0 free_trbs 497
bounce 196 cycle 1
The next reference to 0x0000001014070360 is the babble error,
and some info on the ISOC buffer itself:
<idle>-0 [005] d.h. 776367.639187: xhci_handle_event:
EVENT: TRB 0000001014070360 status 'Babble Detected' len 196 slot 15 ep
9 type 'Transfer Event' flags e:C
<idle>-0 [005] d.h. 776367.639195: xhci_handle_transfer:
ISOC: Buffer 0000000e2676f400 length 196 TD size 0 intr 0 type 'Isoch'
flags b:i:I:c:s:I:e:C
Immediately after the babble error, the next request is enqueued,
and the doorbell is rung:
<idle>-0 [005] d.h. 776367.639196: xhci_inc_deq: ISOC 0000000033a6879e: enq 0x0000001014070420(0x0000001014070000) deq 0x0000001014070370(0x0000001014070000) segs 2 stream 0 free_trbs 498 bounce 196 cycle 1
<idle>-0 [005] d.h. 776367.639197: xhci_urb_giveback: ep4in-isoc: urb 0000000072126553 pipe 135040 slot 15 length 196/196 sgs 0/0 stream 0 flags 00000206
<idle>-0 [005] d.h. 776367.639197: xhci_inc_deq: EVENT 0000000097f84b16: enq 0x00000010170b5000(0x00000010170b5000) deq 0x00000010170b5670(0x00000010170b5000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
<idle>-0 [005] ..s. 776367.639212: xhci_urb_enqueue: ep4in-isoc: urb 0000000072126553 pipe 135040 slot 15 length 0/196 sgs 0/0 stream 0 flags 00000206
<idle>-0 [005] d.s. 776367.639214: xhci_queue_trb: ISOC: Buffer 0000000e2676f400 length 196 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:I:e:c
<idle>-0 [005] d.s. 776367.639214: xhci_inc_enq: ISOC 0000000033a6879e: enq 0x0000001014070430(0x0000001014070000) deq 0x0000001014070370(0x0000001014070000) segs 2 stream 0 free_trbs 497 bounce 196 cycle 1
<idle>-0 [005] d.s. 776367.639215: xhci_ring_ep_doorbell: Ring doorbell for Slot 15 ep4in
However, after that point, no more xhci_handle_transfer: ISOC
lines ar seen in the log. The doorbell line above is the last
line in the log for ep4in.
Is this some area where USB3 and USB2 behave differently?
dmesg:
https://drive.google.com/open?id=1S2Qc8lroqA5-RMukuLBLWFGx10vEjG-i
usb trace, as requested by Mathias:
https://drive.google.com/open?id=1cbLcOnAtQRW0Chgak6PNC0l4yJv__4uO
--
All Rights Reversed.
On 21.5.2020 6.45, Rik van Riel wrote:
> On Wed, 2020-05-20 at 16:34 -0400, Alan Stern wrote:
>> On Wed, May 20, 2020 at 03:21:44PM -0400, Rik van Riel wrote:
>>>
>>> Interesting. That makes me really curious why things are
>>> getting stuck, now...
>>
>> This could be a bug in xhci-hcd. Perhaps the controller's endpoint
>> state needs to be updated after one of these errors occurs. Mathias
>> will know all about that.
>
> I am seeing something potentially interesting in the
> giant trace. First the final enqueue/dequeue before
> the babble error:
>
> <idle>-0 [005] d.s. 776367.638233: xhci_inc_enq: ISOC
> 0000000033a6879e: enq 0x0000001014070420(0x0000001014070000) deq
> 0x0000001014070360(0x0000001014070000) segs 2 stream 0 free_trbs 497
> bounce 196 cycle 1
>
> The next reference to 0x0000001014070360 is the babble error,
> and some info on the ISOC buffer itself:
>
> <idle>-0 [005] d.h. 776367.639187: xhci_handle_event:
> EVENT: TRB 0000001014070360 status 'Babble Detected' len 196 slot 15 ep
> 9 type 'Transfer Event' flags e:C
> <idle>-0 [005] d.h. 776367.639195: xhci_handle_transfer:
> ISOC: Buffer 0000000e2676f400 length 196 TD size 0 intr 0 type 'Isoch'
> flags b:i:I:c:s:I:e:C
>n
> Immediately after the babble error, the next request is enqueued,
> and the doorbell is rung:
>
> <idle>-0 [005] d.h. 776367.639196: xhci_inc_deq: ISOC 0000000033a6879e: enq 0x0000001014070420(0x0000001014070000) deq 0x0000001014070370(0x0000001014070000) segs 2 stream 0 free_trbs 498 bounce 196 cycle 1
> <idle>-0 [005] d.h. 776367.639197: xhci_urb_giveback: ep4in-isoc: urb 0000000072126553 pipe 135040 slot 15 length 196/196 sgs 0/0 stream 0 flags 00000206
> <idle>-0 [005] d.h. 776367.639197: xhci_inc_deq: EVENT 0000000097f84b16: enq 0x00000010170b5000(0x00000010170b5000) deq 0x00000010170b5670(0x00000010170b5000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
> <idle>-0 [005] ..s. 776367.639212: xhci_urb_enqueue: ep4in-isoc: urb 0000000072126553 pipe 135040 slot 15 length 0/196 sgs 0/0 stream 0 flags 00000206
> <idle>-0 [005] d.s. 776367.639214: xhci_queue_trb: ISOC: Buffer 0000000e2676f400 length 196 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:I:e:c
> <idle>-0 [005] d.s. 776367.639214: xhci_inc_enq: ISOC 0000000033a6879e: enq 0x0000001014070430(0x0000001014070000) deq 0x0000001014070370(0x0000001014070000) segs 2 stream 0 free_trbs 497 bounce 196 cycle 1
> <idle>-0 [005] d.s. 776367.639215: xhci_ring_ep_doorbell: Ring doorbell for Slot 15 ep4in
>
> However, after that point, no more xhci_handle_transfer: ISOC
> lines ar seen in the log. The doorbell line above is the last
> line in the log for ep4in.
>
> Is this some area where USB3 and USB2 behave differently?
It acts as if the endpoint is no longer running.
If the endpoint would be halted then xhci_requires_manual_halt_cleanup() should
reset the endpoints and it would show in the traces.
Could you add the code below and take new traces, it will show the endpoint
state after the Babble error.
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 0fda0c0f4d31..373d89ef7275 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2455,6 +2455,7 @@ static int handle_tx_event(struct xhci_hcd *xhci,
case COMP_BABBLE_DETECTED_ERROR:
xhci_dbg(xhci, "Babble error for slot %u ep %u on endpoint\n",
slot_id, ep_index);
+ trace_xhci_handle_tx_event(ep_ctx);
status = -EOVERFLOW;
break;
/* Completion codes for endpoint error state */
diff --git a/drivers/usb/host/xhci-trace.h b/drivers/usb/host/xhci-trace.h
index b19582b2a72c..5081df079f4a 100644
--- a/drivers/usb/host/xhci-trace.h
+++ b/drivers/usb/host/xhci-trace.h
@@ -360,6 +360,11 @@ DEFINE_EVENT(xhci_log_ep_ctx, xhci_add_endpoint,
TP_ARGS(ctx)
);
+DEFINE_EVENT(xhci_log_ep_ctx, xhci_handle_tx_event,
+ TP_PROTO(struct xhci_ep_ctx *ctx),
+ TP_ARGS(ctx)
+);
+
DECLARE_EVENT_CLASS(xhci_log_slot_ctx,
TP_PROTO(struct xhci_slot_ctx *ctx),
TP_ARGS(ctx),
[keeping old context since it's been a month...]
On Mon, 2020-05-25 at 12:37 +0300, Mathias Nyman wrote:
> On 21.5.2020 6.45, Rik van Riel wrote:
> > On Wed, 2020-05-20 at 16:34 -0400, Alan Stern wrote:
> > > On Wed, May 20, 2020 at 03:21:44PM -0400, Rik van Riel wrote:
> > > > Interesting. That makes me really curious why things are
> > > > getting stuck, now...
> > >
> > > This could be a bug in xhci-hcd. Perhaps the controller's
> > > endpoint
> > > state needs to be updated after one of these errors
> > > occurs. Mathias
> > > will know all about that.
> >
> > I am seeing something potentially interesting in the
> > giant trace. First the final enqueue/dequeue before
> > the babble error:
> >
> > <idle>-0 [005] d.s. 776367.638233: xhci_inc_enq: ISOC
> > 0000000033a6879e: enq 0x0000001014070420(0x0000001014070000) deq
> > 0x0000001014070360(0x0000001014070000) segs 2 stream 0 free_trbs
> > 497
> > bounce 196 cycle 1
> >
> > The next reference to 0x0000001014070360 is the babble error,
> > and some info on the ISOC buffer itself:
> >
> > <idle>-0 [005] d.h. 776367.639187: xhci_handle_event:
> > EVENT: TRB 0000001014070360 status 'Babble Detected' len 196 slot
> > 15 ep
> > 9 type 'Transfer Event' flags e:C
> > <idle>-0 [005] d.h. 776367.639195:
> > xhci_handle_transfer:
> > ISOC: Buffer 0000000e2676f400 length 196 TD size 0 intr 0 type
> > 'Isoch'
> > flags b:i:I:c:s:I:e:C
> > n
> > Immediately after the babble error, the next request is enqueued,
> > and the doorbell is rung:
> >
> > <idle>-0 [005] d.h. 776367.639196: xhci_inc_deq: ISOC
> > 0000000033a6879e: enq 0x0000001014070420(0x0000001014070000) deq
> > 0x0000001014070370(0x0000001014070000) segs 2 stream 0 free_trbs
> > 498 bounce 196 cycle 1
> > <idle>-0 [005] d.h. 776367.639197: xhci_urb_giveback:
> > ep4in-isoc: urb 0000000072126553 pipe 135040 slot 15 length 196/196
> > sgs 0/0 stream 0 flags 00000206
> > <idle>-0 [005] d.h. 776367.639197: xhci_inc_deq:
> > EVENT 0000000097f84b16: enq 0x00000010170b5000(0x00000010170b5000)
> > deq 0x00000010170b5670(0x00000010170b5000) segs 1 stream 0
> > free_trbs 254 bounce 0 cycle 1
> > <idle>-0 [005] ..s. 776367.639212: xhci_urb_enqueue:
> > ep4in-isoc: urb 0000000072126553 pipe 135040 slot 15 length 0/196
> > sgs 0/0 stream 0 flags 00000206
> > <idle>-0 [005] d.s. 776367.639214: xhci_queue_trb:
> > ISOC: Buffer 0000000e2676f400 length 196 TD size 0 intr 0 type
> > 'Isoch' flags b:i:I:c:s:I:e:c
> > <idle>-0 [005] d.s. 776367.639214: xhci_inc_enq: ISOC
> > 0000000033a6879e: enq 0x0000001014070430(0x0000001014070000) deq
> > 0x0000001014070370(0x0000001014070000) segs 2 stream 0 free_trbs
> > 497 bounce 196 cycle 1
> > <idle>-0 [005] d.s. 776367.639215:
> > xhci_ring_ep_doorbell: Ring doorbell for Slot 15 ep4in
> >
> > However, after that point, no more xhci_handle_transfer: ISOC
> > lines ar seen in the log. The doorbell line above is the last
> > line in the log for ep4in.
> >
> > Is this some area where USB3 and USB2 behave differently?
>
> It acts as if the endpoint is no longer running.
>
> If the endpoint would be halted then
> xhci_requires_manual_halt_cleanup() should
> reset the endpoints and it would show in the traces.
>
> Could you add the code below and take new traces, it will show the
> endpoint
> state after the Babble error.
Hi Mathias,
I have finally rebooted into a kernel with your tracepoint.
After a babble error, I get the following info in the trace.
[ 556.716334] xhci_hcd 0000:00:14.0: Babble error for slot 13 ep 8 on
endpoint
28672.016 :0/0 xhci-hcd:xhci_handle_tx_event(info: 196609, info2:
12845096, deq: 69501877488, tx_info: 12845252)
34816.037 :0/0 xhci-hcd:xhci_handle_tx_event(info: 196609, info2:
12845096, deq: 69501877856, tx_info: 12845252)
38912.043 :0/0 xhci-hcd:xhci_handle_tx_event(info: 196609, info2:
12845096, deq: 69501870176, tx_info: 12845252)
I hope this is useful :)
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-
> ring.c
> index 0fda0c0f4d31..373d89ef7275 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -2455,6 +2455,7 @@ static int handle_tx_event(struct xhci_hcd
> *xhci,
> case COMP_BABBLE_DETECTED_ERROR:
> xhci_dbg(xhci, "Babble error for slot %u ep %u on
> endpoint\n",
> slot_id, ep_index);
> + trace_xhci_handle_tx_event(ep_ctx);
> status = -EOVERFLOW;
> break;
> /* Completion codes for endpoint error state */
> diff --git a/drivers/usb/host/xhci-trace.h b/drivers/usb/host/xhci-
> trace.h
> index b19582b2a72c..5081df079f4a 100644
> --- a/drivers/usb/host/xhci-trace.h
> +++ b/drivers/usb/host/xhci-trace.h
> @@ -360,6 +360,11 @@ DEFINE_EVENT(xhci_log_ep_ctx, xhci_add_endpoint,
> TP_ARGS(ctx)
> );
>
> +DEFINE_EVENT(xhci_log_ep_ctx, xhci_handle_tx_event,
> + TP_PROTO(struct xhci_ep_ctx *ctx),
> + TP_ARGS(ctx)
> +);
> +
> DECLARE_EVENT_CLASS(xhci_log_slot_ctx,
> TP_PROTO(struct xhci_slot_ctx *ctx),
> TP_ARGS(ctx),
>
>
>
--
All Rights Reversed.
On Mon, 2020-06-29 at 23:21 -0400, Rik van Riel wrote:
> > Could you add the code below and take new traces, it will show the
> > endpoint
> > state after the Babble error.
>
> Hi Mathias,
>
> I have finally rebooted into a kernel with your tracepoint.
> After a babble error, I get the following info in the trace.
>
> [ 556.716334] xhci_hcd 0000:00:14.0: Babble error for slot 13 ep 8
> on
> endpoint
>
> 28672.016 :0/0 xhci-hcd:xhci_handle_tx_event(info: 196609, info2:
> 12845096, deq: 69501877488, tx_info: 12845252)
> 34816.037 :0/0 xhci-hcd:xhci_handle_tx_event(info: 196609, info2:
> 12845096, deq: 69501877856, tx_info: 12845252)
> 38912.043 :0/0 xhci-hcd:xhci_handle_tx_event(info: 196609, info2:
> 12845096, deq: 69501870176, tx_info: 12845252)
OK, this is strange indeed.
info: 0x30001
info2: 0xc40028
tx_info: c400c4
That suggests the device state is EP_STATE_DISABLED, but
we never got the error from the EP_STATE_DISABLED test near
the start of handle_tx_event(). If we had, the big switch
statement containing the code below would have been bypassed.
Unless I am mistaken, does that mean the endpoint context
(*ep_ctx) got modified while the code was in the middle of
handle_tx_event()?
What would cause that? A subsequent transfer to an endpoint
while it is in EP_STATE_HALTED, which the comment suggests
is the expected endpoint state for a babble error?
> > diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-
> > ring.c
> > index 0fda0c0f4d31..373d89ef7275 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -2455,6 +2455,7 @@ static int handle_tx_event(struct xhci_hcd
> > *xhci,
> > case COMP_BABBLE_DETECTED_ERROR:
> > xhci_dbg(xhci, "Babble error for slot %u ep %u on
> > endpoint\n",
> > slot_id, ep_index);
> > + trace_xhci_handle_tx_event(ep_ctx);
> > status = -EOVERFLOW;
> > break;
> > /* Completion codes for endpoint error state */
> > diff --git a/drivers/usb/host/xhci-trace.h b/drivers/usb/host/xhci-
> > trace.h
> > index b19582b2a72c..5081df079f4a 100644
> > --- a/drivers/usb/host/xhci-trace.h
> > +++ b/drivers/usb/host/xhci-trace.h
> > @@ -360,6 +360,11 @@ DEFINE_EVENT(xhci_log_ep_ctx,
> > xhci_add_endpoint,
> > TP_ARGS(ctx)
> > );
> >
> > +DEFINE_EVENT(xhci_log_ep_ctx, xhci_handle_tx_event,
> > + TP_PROTO(struct xhci_ep_ctx *ctx),
> > + TP_ARGS(ctx)
> > +);
> > +
> > DECLARE_EVENT_CLASS(xhci_log_slot_ctx,
> > TP_PROTO(struct xhci_slot_ctx *ctx),
> > TP_ARGS(ctx),
> >
> >
> >
--
All Rights Reversed.
On 30.6.2020 16.08, Rik van Riel wrote:
> I misread the code, it's not a bitfield, so state 1 means an endpoint marked with running state. The next urb is never getting a response, though.
>
> However, the xhci spec says an endpoint is halted upon a babble error.
I was looking at the same, so according to specs this state shouldn't be possible.
>
> The code right above the babble handling case adds halted into the endpoint state itself. Does the code handling the babble error need to do something similar to trigger cleanup elsewhere?
It's a flag to prevent ringing the doorbell for a halted endpoint.
Anyway, reset endpoint is meant to recover an endpoint in a halted state.
Resetting non-halted endpoints will just lead to a context state error, and
besides, isoc endpoints shouldn't halt.
Anyways, I haven't got any better idea at the moment.
You can try and see what a forced reset does with:
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 2c255d0620b0..d79aca0df6d4 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1900,8 +1900,7 @@ static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
* endpoint anyway. Check if a babble halted the
* endpoint.
*/
- if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED)
- return 1;
+ return 1;
return 0;
}
Traces also showed thet endpoint doorbell was rang after th babble error, so
we know that didn't help restarting the endpoint.
-Mathias
On Tue, 2020-06-30 at 17:27 +0300, Mathias Nyman wrote:
> On 30.6.2020 16.08, Rik van Riel wrote:
> > I misread the code, it's not a bitfield, so state 1 means an
> > endpoint marked with running state. The next urb is never getting a
> > response, though.
> >
> > However, the xhci spec says an endpoint is halted upon a babble
> > error.
>
> I was looking at the same, so according to specs this state shouldn't
> be possible.
The PCM2903B chip, and potentially the hub it is behind,
are USB2 devices, though. Does USB2 know about halted
endpoints?
> > The code right above the babble handling case adds halted into the
> > endpoint state itself. Does the code handling the babble error need
> > to do something similar to trigger cleanup elsewhere?
>
> It's a flag to prevent ringing the doorbell for a halted endpoint.
> Anyway, reset endpoint is meant to recover an endpoint in a halted
> state.
> Resetting non-halted endpoints will just lead to a context state
> error, and
> besides, isoc endpoints shouldn't halt.
>
> Anyways, I haven't got any better idea at the moment.
> You can try and see what a forced reset does with:
So close. Looks like something in the XHCI/USB stack isn't
doing the reset because something is in an unexpected state?
[ 51.706798] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed
due to incorrect slot or ep state.
[ 51.706802] got overflow, ep->flags = 2
[ 51.932550] usb 3-9.7.5: reset high-speed USB device number 18 using
xhci_hcd
[ 68.830396] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed
due to incorrect slot or ep state.
[ 68.830409] got overflow, ep->flags = 2
[ 70.077981] rfkill: input handler disabled
[ 157.992374] got overflow, ep->flags = 2
[ 157.992406] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed
due to incorrect slot or ep state.
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-
> ring.c
> index 2c255d0620b0..d79aca0df6d4 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1900,8 +1900,7 @@ static int
> xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
> * endpoint anyway. Check if a babble halted the
> * endpoint.
> */
> - if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED)
> - return 1;
> + return 1;
>
> return 0;
> }
>
> Traces also showed thet endpoint doorbell was rang after th babble
> error, so
> we know that didn't help restarting the endpoint.
>
> -Mathias
>
--
All Rights Reversed.