2021-03-27 16:19:47

by Maciej S. Szmigiero

[permalink] [raw]
Subject: >20 KB URBs + EHCI = bad performance due to stalls

Hi,

Is there any specific reason that URBs without URB_SHORT_NOT_OK flag that
span multiple EHCI qTDs have Alternate Next qTD pointer set to the dummy
qTD in their every qTD besides the last one (instead of to the first qTD
of the next URB to that endpoint)?

This causes that endpoint queue to stall in case of a short read that
does not reach the last qTD (I guess this condition persists until an
URB is (re)submitted to that endpoint, but I am not sure here).

One of affected drivers here is drivers/net/usb/r8152.c.

If I simply reduce its per-URB transfer buffer to 20 KB (the maximum
that fits in a well-aligned qTD) the RX rate increases from around
100 Mbps to 200+ Mbps (on an ICH8M controller):
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -6554,6 +6556,9 @@
break;
}

+ if (tp->udev->speed == USB_SPEED_HIGH)
+ tp->rx_buf_sz = min(tp->rx_buf_sz, (u32)20 * 1024);
+
return ret;
}

The driver default is to use 32 KB buffers (which span two qTDs),
but the device rarely fully fills the first qTD resulting in
repetitive stalls and more than halving the performance.

As far as I can see, the relevant code in
drivers/usb/host/ehci-q.c::qh_urb_transaction() predates the git era.
The comment in that function before setting the Alternate Next qTD
pointer:
> /*
> * short reads advance to a "magic" dummy instead of the next
> * qtd ... that forces the queue to stop, for manual cleanup.
> * (this will usually be overridden later.)
> */

...suggests the idea was to override that pointer when
URB_SHORT_NOT_OK is not set, but this is actually done only for
the last qTD from the URB (also, that's the only one that ends
with interrupt flag set).

Looking at OHCI and UHCI host controller drivers the equivalent
limits seem to be different there (8 KB and 2 KB), while I don't
see any specific limit in the XHCI case.

Because of that variance in the URB buffer limit it seems strange
to me that this should be managed by a particular USB device driver
rather than by the host controller driver, because this would mean
every such driver would need to either use the lowest common
denominator for the URB buffer size (which is very small) or
hardcode the limit for every host controller that the device can
be connected to, which seems a bit inefficient.

Thanks,
Maciej


2021-03-29 15:24:11

by Alan Stern

[permalink] [raw]
Subject: Re: >20 KB URBs + EHCI = bad performance due to stalls

On Sat, Mar 27, 2021 at 04:55:20PM +0100, Maciej S. Szmigiero wrote:
> Hi,
>
> Is there any specific reason that URBs without URB_SHORT_NOT_OK flag that
> span multiple EHCI qTDs have Alternate Next qTD pointer set to the dummy
> qTD in their every qTD besides the last one (instead of to the first qTD
> of the next URB to that endpoint)?

Quick answer: I don't know. I can't think of any good reason. This
code was all written a long time ago. Maybe the issue was overlooked
or the details were misunderstood.

> This causes that endpoint queue to stall in case of a short read that
> does not reach the last qTD (I guess this condition persists until an
> URB is (re)submitted to that endpoint, but I am not sure here).

It persists until the driver cleans up the queue.

> One of affected drivers here is drivers/net/usb/r8152.c.
>
> If I simply reduce its per-URB transfer buffer to 20 KB (the maximum
> that fits in a well-aligned qTD) the RX rate increases from around
> 100 Mbps to 200+ Mbps (on an ICH8M controller):

> The driver default is to use 32 KB buffers (which span two qTDs),
> but the device rarely fully fills the first qTD resulting in
> repetitive stalls and more than halving the performance.
>
> As far as I can see, the relevant code in
> drivers/usb/host/ehci-q.c::qh_urb_transaction() predates the git era.

Like I said, a long time ago.

> The comment in that function before setting the Alternate Next qTD
> pointer:
> > /*
> > * short reads advance to a "magic" dummy instead of the next
> > * qtd ... that forces the queue to stop, for manual cleanup.
> > * (this will usually be overridden later.)
> > */
>
> ...suggests the idea was to override that pointer when
> URB_SHORT_NOT_OK is not set, but this is actually done only for
> the last qTD from the URB (also, that's the only one that ends
> with interrupt flag set).

The hw_alt_next field should be updated for all the qTDs in the URB.
Failure to this was probably an oversight. Or maybe the omission was
to simplify the procedure for cleaning up the queue after a short
transfer.

> Looking at OHCI and UHCI host controller drivers the equivalent
> limits seem to be different there (8 KB and 2 KB), while I don't
> see any specific limit in the XHCI case.

I'd have to review the details of ohci-hcd and uhci-hcd to make
sure. In principle, the queue isn't supposed to stop merely because
of a short transfer unless URB_SHORT_NOT_OK is set. However, the UHCI
hardware in particular may offer no other way to handle a short transfer.

> Because of that variance in the URB buffer limit it seems strange
> to me that this should be managed by a particular USB device driver
> rather than by the host controller driver, because this would mean
> every such driver would need to either use the lowest common
> denominator for the URB buffer size (which is very small) or
> hardcode the limit for every host controller that the device can
> be connected to, which seems a bit inefficient.

I don't understand what you're saying in this paragraph. What do you
think USB device drivers are supposed to be managing? The URB buffer
size? They should set that field without regard to the type of host
controller in use.

In short, the behavior you observed is a bug, resulting in a loss of
throughput (though not in any loss of data). It needs to be fixed.

If you would like to write and submit a patch, that would be great.
Otherwise, I'll try to find time to work on it.

I would appreciate any effort you could make toward checking the code
in qh_completions(); I suspect that the checks it does involving
EHCI_LIST_END may not be right. At the very least, they should be
encapsulated in a macro so that they are easier to understand.

Alan Stern

2021-03-31 18:24:04

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: >20 KB URBs + EHCI = bad performance due to stalls

On 29.03.2021 17:22, Alan Stern wrote:
> On Sat, Mar 27, 2021 at 04:55:20PM +0100, Maciej S. Szmigiero wrote:
>> Hi,
>>
>> Is there any specific reason that URBs without URB_SHORT_NOT_OK flag that
>> span multiple EHCI qTDs have Alternate Next qTD pointer set to the dummy
>> qTD in their every qTD besides the last one (instead of to the first qTD
>> of the next URB to that endpoint)?
>
> Quick answer: I don't know. I can't think of any good reason. This
> code was all written a long time ago. Maybe the issue was overlooked
> or the details were misunderstood.

I've dug out the original EHCI driver, that landed in 2.5.2:
https://marc.info/?l=linux-usb-devel&m=100875066109269&w=2
https://marc.info/?l=linux-usb-devel&m=100982880716373&w=2

It already had the following qTD setup code, roughly similar to what
the current one does:
> /* previous urb allows short rx? maybe optimize. */
> if (!(last_qtd->urb->transfer_flags & USB_DISABLE_SPD)
> && (epnum & 0x10)) {
> // only the last QTD for now
> last_qtd->hw_alt_next = hw_next;

The "for now" language seems to suggest that ultimately other-than-last
qTDs were supposed to be set not to stall the queue, too.
Just the code to handle this case was never written.

It seems to me though, this should be possible with relatively few
changes to the code:
qh_append_tds() will need to patch these other-than-last qTDs
hw_alt_next pointer to point to the (new) dummy qTD (instead of just
pointing the last submitted qTD hw_next to it and adding the remaining
qTDs verbatim to the qH qTD list).

Then qh_completions() will need few changes:
*
> } else if (IS_SHORT_READ (token)
> && !(qtd->hw_alt_next
> & EHCI_LIST_END(ehci))) {
This branch will need to be modified not to mark the queue as stopped
and request its unlinking when such type of short qTD was processed.

* The ACTIVE bit should be treated as unset on any qTD following the
one that hits the above condition until a qTD for a different URB is
encountered.
Otherwise the unprocessed remaining qTDs from that short URB will be
considered pending active qTDs and the code will wait forever for their
processing,

* The code that patches the previous qTD hw_next pointer when removing a
qTD that isn't currently at the qH will need changing to also patch
hw_alt_next pointers of the qTDs belonging to the previous URB in case
the previous URB was one of these short-read-ok ones.

That's was my quick assessment what is required to handle these
transactions effectively in the EHCI driver.

I suspect, however, there may be some corner cases involving
non-ordinary qTD unlinking which might need fixing, too (like caused
by usb_unlink_urb(), system suspend or HC removal).
But I am not sure about this since I don't know this code well.

>
>> This causes that endpoint queue to stall in case of a short read that
>> does not reach the last qTD (I guess this condition persists until an
>> URB is (re)submitted to that endpoint, but I am not sure here).
>
> It persists until the driver cleans up the queue.

I guess by "the driver" you mean the host controller driver, not the USB
device driver.

As far as I can see from the code in case of a short read the whole qH
for that endpoint is going to be unlinked.

Which in turn requires observing two Async Advance interrupts.
EHCI spec says that in the worst case each such interrupt will be
generated after two async list traversals.

Only then the qH is going to be re-linked with the remaining URBs (qTDs).

Although it does not seem like this all would take a lot of time when
the bus is otherwise idle the delays are actually enough to cause a drop
of 100+ Mbps in throughput for me.

If one assumes that the device has a 32 KB on-chip buffer (don't know
that for sure, but I assume that the maximum URB size the driver used
was based on that size) then a drop of 100+ Mbps would mean a delay
on the order of 2 msecs in servicing that endpoint.

By the way, it seems that others were getting even worse throughput
from a r8152 device.
People at https://bugzilla.kernel.org/show_bug.cgi?id=205923 report
getting only around 60 Mbps.

It looks like that the URB buffer size in the r8152 driver was now
reduced to 16 KB to work around this issue.

>> One of affected drivers here is drivers/net/usb/r8152.c.
>>
>> If I simply reduce its per-URB transfer buffer to 20 KB (the maximum
>> that fits in a well-aligned qTD) the RX rate increases from around
>> 100 Mbps to 200+ Mbps (on an ICH8M controller):
>
>> The driver default is to use 32 KB buffers (which span two qTDs),
>> but the device rarely fully fills the first qTD resulting in
>> repetitive stalls and more than halving the performance.
>>
>> As far as I can see, the relevant code in
>> drivers/usb/host/ehci-q.c::qh_urb_transaction() predates the git era.
>
> Like I said, a long time ago.
>
>> The comment in that function before setting the Alternate Next qTD
>> pointer:
>>> /*
>>> * short reads advance to a "magic" dummy instead of the next
>>> * qtd ... that forces the queue to stop, for manual cleanup.
>>> * (this will usually be overridden later.)
>>> */
>>
>> ...suggests the idea was to override that pointer when
>> URB_SHORT_NOT_OK is not set, but this is actually done only for
>> the last qTD from the URB (also, that's the only one that ends
>> with interrupt flag set).
>
> The hw_alt_next field should be updated for all the qTDs in the URB.
> Failure to this was probably an oversight. Or maybe the omission was
> to simplify the procedure for cleaning up the queue after a short
> transfer.

Might be, as I have said above, the cleanup procedure will also need
changes to handle these short qTDs.

>> Looking at OHCI and UHCI host controller drivers the equivalent
>> limits seem to be different there (8 KB and 2 KB), while I don't
>> see any specific limit in the XHCI case.
>
> I'd have to review the details of ohci-hcd and uhci-hcd to make
> sure. In principle, the queue isn't supposed to stop merely because
> of a short transfer unless URB_SHORT_NOT_OK is set. However, the UHCI
> hardware in particular may offer no other way to handle a short transfer.

Here I think it is lesser of an issue due to sheer slowness of these
devices.

So even if an URB needs some extra processing time the device should
still be able to maintain that 12 Mbps.
But I might be wrong here for USB devices with super-small on-chip
FIFOs.

>> Because of that variance in the URB buffer limit it seems strange
>> to me that this should be managed by a particular USB device driver
>> rather than by the host controller driver, because this would mean
>> every such driver would need to either use the lowest common
>> denominator for the URB buffer size (which is very small) or
>> hardcode the limit for every host controller that the device can
>> be connected to, which seems a bit inefficient.
>
> I don't understand what you're saying in this paragraph. What do you
> think USB device drivers are supposed to be managing? The URB buffer
> size?

Yes, I've meant the URB "transfer_buffer_length".

> They should set that field without regard to the type of host
> controller in use.

That's what I had on mind by saying that it seems strange to me that
the URB buffer size should be managed by a particular USB device driver
depending on the host controller in use.

> In short, the behavior you observed is a bug, resulting in a loss of
> throughput (though not in any loss of data). It needs to be fixed.
>
> If you would like to write and submit a patch, that would be great.
> Otherwise, I'll try to find time to work on it.

Unfortunately, I doubt I will be able to work on this in coming weeks
due to time constraints, I'm sorry :(

> I would appreciate any effort you could make toward checking the code
> in qh_completions(); I suspect that the checks it does involving
> EHCI_LIST_END may not be right. At the very least, they should be
> encapsulated in a macro so that they are easier to understand.

I've went through the (short) URB linking and unlinking code
(including qh_completions()) and I haven't found anything suspicious
there, besides one thing that's actually on the URB *linking* path:
in qh_append_tds() the dummy qTD that is the last qTD in that
endpoint queue is being overwritten using an assignment operator.

While both this dummy qTD and the source qTD that overwrites it have
the HALT bit set it looks a bit uncomfortable to me to see a qTD that
the HC might just be fetching (while trying to advance the queue) being
overwritten.

Like, is C standard giving guarantees that no intermediate values are
being written to a struct when that struct is a target of an assignment
operator?

But apparently this doesn't cause trouble, so I guess in practice
this works okay.

> Alan Stern
>

Thanks,
Maciej

2021-03-31 19:58:15

by Alan Stern

[permalink] [raw]
Subject: Re: >20 KB URBs + EHCI = bad performance due to stalls

On Wed, Mar 31, 2021 at 08:20:56PM +0200, Maciej S. Szmigiero wrote:
> On 29.03.2021 17:22, Alan Stern wrote:
> > On Sat, Mar 27, 2021 at 04:55:20PM +0100, Maciej S. Szmigiero wrote:
> > > Hi,
> > >
> > > Is there any specific reason that URBs without URB_SHORT_NOT_OK flag that
> > > span multiple EHCI qTDs have Alternate Next qTD pointer set to the dummy
> > > qTD in their every qTD besides the last one (instead of to the first qTD
> > > of the next URB to that endpoint)?
> >
> > Quick answer: I don't know. I can't think of any good reason. This
> > code was all written a long time ago. Maybe the issue was overlooked
> > or the details were misunderstood.
>
> I've dug out the original EHCI driver, that landed in 2.5.2:
> https://marc.info/?l=linux-usb-devel&m=100875066109269&w=2
> https://marc.info/?l=linux-usb-devel&m=100982880716373&w=2
>
> It already had the following qTD setup code, roughly similar to what
> the current one does:
> > /* previous urb allows short rx? maybe optimize. */
> > if (!(last_qtd->urb->transfer_flags & USB_DISABLE_SPD)
> > && (epnum & 0x10)) {
> > // only the last QTD for now
> > last_qtd->hw_alt_next = hw_next;
>
> The "for now" language seems to suggest that ultimately other-than-last
> qTDs were supposed to be set not to stall the queue, too.
> Just the code to handle this case was never written.

Probably it just slipped out of the developer's mind.

> It seems to me though, this should be possible with relatively few
> changes to the code:
> qh_append_tds() will need to patch these other-than-last qTDs
> hw_alt_next pointer to point to the (new) dummy qTD (instead of just
> pointing the last submitted qTD hw_next to it and adding the remaining
> qTDs verbatim to the qH qTD list).

Right.

> Then qh_completions() will need few changes:
> *
> > } else if (IS_SHORT_READ (token)
> > && !(qtd->hw_alt_next
> > & EHCI_LIST_END(ehci))) {
> This branch will need to be modified not to mark the queue as stopped
> and request its unlinking when such type of short qTD was processed.

This would be a good place to introduce a macro. For example:

} else if (IS_SHORT_READ(token) &&
EHCI_PTR_IS_SET(qtd->hw_alt_next)) {

or something similar.

> * The ACTIVE bit should be treated as unset on any qTD following the
> one that hits the above condition until a qTD for a different URB is
> encountered.
> Otherwise the unprocessed remaining qTDs from that short URB will be
> considered pending active qTDs and the code will wait forever for their
> processing,

The treatment shouldn't be exactly the same as if ACTIVE is clear. The
following qTDs can be removed from the list and deallocated immediately,
since the hardware won't look at them. And they shouldn't affect the
URB's status.

> * The code that patches the previous qTD hw_next pointer when removing a
> qTD that isn't currently at the qH will need changing to also patch
> hw_alt_next pointers of the qTDs belonging to the previous URB in case
> the previous URB was one of these short-read-ok ones.

Yes. Awkward but necessary.

Although I know nothing at all about the USB API in Windows, I suspect
that it manages to avoid this awkwardness entirely by not allowing URBs
in the middle of the queue to be unlinked. Or perhaps allowing it only
for endpoint 0. I've often wished Linux's API had been written that
way.

> That's was my quick assessment what is required to handle these
> transactions effectively in the EHCI driver.
>
> I suspect, however, there may be some corner cases involving
> non-ordinary qTD unlinking which might need fixing, too (like caused
> by usb_unlink_urb(), system suspend or HC removal).
> But I am not sure about this since I don't know this code well.

Those shouldn't present any difficulty. There are inherently easier to
handle because the QH won't be actively running when they occur.

> > > This causes that endpoint queue to stall in case of a short read that
> > > does not reach the last qTD (I guess this condition persists until an
> > > URB is (re)submitted to that endpoint, but I am not sure here).
> >
> > It persists until the driver cleans up the queue.
>
> I guess by "the driver" you mean the host controller driver, not the USB
> device driver.

Yes, I meant ehci-hcd.

> > > Looking at OHCI and UHCI host controller drivers the equivalent
> > > limits seem to be different there (8 KB and 2 KB), while I don't
> > > see any specific limit in the XHCI case.
> >
> > I'd have to review the details of ohci-hcd and uhci-hcd to make
> > sure. In principle, the queue isn't supposed to stop merely because
> > of a short transfer unless URB_SHORT_NOT_OK is set. However, the UHCI
> > hardware in particular may offer no other way to handle a short transfer.
>
> Here I think it is lesser of an issue due to sheer slowness of these
> devices.
>
> So even if an URB needs some extra processing time the device should
> still be able to maintain that 12 Mbps.
> But I might be wrong here for USB devices with super-small on-chip
> FIFOs.
>
> > > Because of that variance in the URB buffer limit it seems strange
> > > to me that this should be managed by a particular USB device driver
> > > rather than by the host controller driver, because this would mean
> > > every such driver would need to either use the lowest common
> > > denominator for the URB buffer size (which is very small) or
> > > hardcode the limit for every host controller that the device can
> > > be connected to, which seems a bit inefficient.
> >
> > I don't understand what you're saying in this paragraph. What do you
> > think USB device drivers are supposed to be managing? The URB buffer
> > size?
>
> Yes, I've meant the URB "transfer_buffer_length".
>
> > They should set that field without regard to the type of host
> > controller in use.
>
> That's what I had on mind by saying that it seems strange to me that
> the URB buffer size should be managed by a particular USB device driver
> depending on the host controller in use.
>
> > In short, the behavior you observed is a bug, resulting in a loss of
> > throughput (though not in any loss of data). It needs to be fixed.
> >
> > If you would like to write and submit a patch, that would be great.
> > Otherwise, I'll try to find time to work on it.
>
> Unfortunately, I doubt I will be able to work on this in coming weeks
> due to time constraints, I'm sorry :(

All right, then I'll work on it when time permits.

> > I would appreciate any effort you could make toward checking the code
> > in qh_completions(); I suspect that the checks it does involving
> > EHCI_LIST_END may not be right. At the very least, they should be
> > encapsulated in a macro so that they are easier to understand.
>
> I've went through the (short) URB linking and unlinking code
> (including qh_completions()) and I haven't found anything suspicious
> there, besides one thing that's actually on the URB *linking* path:
> in qh_append_tds() the dummy qTD that is the last qTD in that
> endpoint queue is being overwritten using an assignment operator.
>
> While both this dummy qTD and the source qTD that overwrites it have
> the HALT bit set it looks a bit uncomfortable to me to see a qTD that
> the HC might just be fetching (while trying to advance the queue) being
> overwritten.

I agree. But there's no way around it; if you're going to change the
contents of the qTD queue while the QH is running, at some point you
have to overwrite something that the controller might be accessing
concurrently.

> Like, is C standard giving guarantees that no intermediate values are
> being written to a struct when that struct is a target of an assignment
> operator?

THe C standard doesn't say anything like that, but the kernel does
generally rely on such behavior. However, it wouldn't hurt to mark this
special case by using WRITE_ONCE.

> But apparently this doesn't cause trouble, so I guess in practice
> this works okay.

Yes, it does.

Alan Stern

2021-03-31 21:26:29

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: >20 KB URBs + EHCI = bad performance due to stalls

On 31.03.2021 21:55, Alan Stern wrote:
> On Wed, Mar 31, 2021 at 08:20:56PM +0200, Maciej S. Szmigiero wrote:
>> On 29.03.2021 17:22, Alan Stern wrote:
>>> On Sat, Mar 27, 2021 at 04:55:20PM +0100, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>> Is there any specific reason that URBs without URB_SHORT_NOT_OK flag that
>>>> span multiple EHCI qTDs have Alternate Next qTD pointer set to the dummy
>>>> qTD in their every qTD besides the last one (instead of to the first qTD
>>>> of the next URB to that endpoint)?
>>>
>>> Quick answer: I don't know. I can't think of any good reason. This
>>> code was all written a long time ago. Maybe the issue was overlooked
>>> or the details were misunderstood.
>>
>> I've dug out the original EHCI driver, that landed in 2.5.2:
>> https://marc.info/?l=linux-usb-devel&m=100875066109269&w=2
>> https://marc.info/?l=linux-usb-devel&m=100982880716373&w=2
>>
>> It already had the following qTD setup code, roughly similar to what
>> the current one does:
>>> /* previous urb allows short rx? maybe optimize. */
>>> if (!(last_qtd->urb->transfer_flags & USB_DISABLE_SPD)
>>> && (epnum & 0x10)) {
>>> // only the last QTD for now
>>> last_qtd->hw_alt_next = hw_next;
>>
>> The "for now" language seems to suggest that ultimately other-than-last
>> qTDs were supposed to be set not to stall the queue, too.
>> Just the code to handle this case was never written.
>
> Probably it just slipped out of the developer's mind.
>
>> It seems to me though, this should be possible with relatively few
>> changes to the code:
>> qh_append_tds() will need to patch these other-than-last qTDs
>> hw_alt_next pointer to point to the (new) dummy qTD (instead of just
>> pointing the last submitted qTD hw_next to it and adding the remaining
>> qTDs verbatim to the qH qTD list).
>
> Right.
>
>> Then qh_completions() will need few changes:
>> *
>>> } else if (IS_SHORT_READ (token)
>>> && !(qtd->hw_alt_next
>>> & EHCI_LIST_END(ehci))) {
>> This branch will need to be modified not to mark the queue as stopped
>> and request its unlinking when such type of short qTD was processed.
>
> This would be a good place to introduce a macro. For example:
>
> } else if (IS_SHORT_READ(token) &&
> EHCI_PTR_IS_SET(qtd->hw_alt_next)) {
>
> or something similar.

I agree, this certainly looks more readable.

>> * The ACTIVE bit should be treated as unset on any qTD following the
>> one that hits the above condition until a qTD for a different URB is
>> encountered.
>> Otherwise the unprocessed remaining qTDs from that short URB will be
>> considered pending active qTDs and the code will wait forever for their
>> processing,
>
> The treatment shouldn't be exactly the same as if ACTIVE is clear. The
> following qTDs can be removed from the list and deallocated immediately,
> since the hardware won't look at them. And they shouldn't affect the
> URB's status.

From my understanding of qh_completions() if these "remaining" qTDs from
a short read are treated as !ACTIVE then none of the conditions in
!ACTIVE branch will hit: they don't have DBE or HALT set and they aren't
queue-stopping short read qTDs (I am assuming here that the
aforementioned qtd->hw_alt_next condition will be changed to exclude
them).

So the qTD processing loop will reach "if (last_status == -EINPROGRESS)",
this will be false since previous qTD (that one that has actually
partially completed) had already set the last_status to -EREMOTEIO.
Then the code will delete this qTD from qTD list and go to the next qTD
(either next "remaining" qTD from that URB or from a different URB).

Once a qTD from a different URB is encountered the special treatment of
ACTIVE qTDs as !ACTIVE has to end.

>> * The code that patches the previous qTD hw_next pointer when removing a
>> qTD that isn't currently at the qH will need changing to also patch
>> hw_alt_next pointers of the qTDs belonging to the previous URB in case
>> the previous URB was one of these short-read-ok ones.
>
> Yes. Awkward but necessary.
>
> Although I know nothing at all about the USB API in Windows, I suspect
> that it manages to avoid this awkwardness entirely by not allowing URBs
> in the middle of the queue to be unlinked. Or perhaps allowing it only
> for endpoint 0. I've often wished Linux's API had been written that
> way.

According to Microsoft docs every IRP that might remain queued for an
indefinite interval has to have a cancel handler.
URBs are submitted wrapped in IRPs, so at least in theory it should be
possible to cancel them.
But I don't know how it works in practice.

I also remember getting "warning: guest updated active QH" often when
launching Windows VMs in QEMU.
That does not seem like a good sign, but it might ultimately be a
deficiency in QEMU EHCI HC emulation, not Windows.

In Linux at least we could (theoretically) change the API and modify
all the client drivers.
But I think the benefit is not worth the cost at that point.

>> That's was my quick assessment what is required to handle these
>> transactions effectively in the EHCI driver.
>>
>> I suspect, however, there may be some corner cases involving
>> non-ordinary qTD unlinking which might need fixing, too (like caused
>> by usb_unlink_urb(), system suspend or HC removal).
>> But I am not sure about this since I don't know this code well.
>
> Those shouldn't present any difficulty. There are inherently easier to
> handle because the QH won't be actively running when they occur.

I've meant that these can exercise a different code path than ordinary
unlink so one has to check this, too.

(...)
>> That's what I had on mind by saying that it seems strange to me that
>> the URB buffer size should be managed by a particular USB device driver
>> depending on the host controller in use.
>>
>>> In short, the behavior you observed is a bug, resulting in a loss of
>>> throughput (though not in any loss of data). It needs to be fixed.
>>>
>>> If you would like to write and submit a patch, that would be great.
>>> Otherwise, I'll try to find time to work on it.
>>
>> Unfortunately, I doubt I will be able to work on this in coming weeks
>> due to time constraints, I'm sorry :(
>
> All right, then I'll work on it when time permits.

That's great, thanks.

>>> I would appreciate any effort you could make toward checking the code
>>> in qh_completions(); I suspect that the checks it does involving
>>> EHCI_LIST_END may not be right. At the very least, they should be
>>> encapsulated in a macro so that they are easier to understand.
>>
>> I've went through the (short) URB linking and unlinking code
>> (including qh_completions()) and I haven't found anything suspicious
>> there, besides one thing that's actually on the URB *linking* path:
>> in qh_append_tds() the dummy qTD that is the last qTD in that
>> endpoint queue is being overwritten using an assignment operator.
>>
>> While both this dummy qTD and the source qTD that overwrites it have
>> the HALT bit set it looks a bit uncomfortable to me to see a qTD that
>> the HC might just be fetching (while trying to advance the queue) being
>> overwritten.
>
> I agree. But there's no way around it; if you're going to change the
> contents of the qTD queue while the QH is running, at some point you
> have to overwrite something that the controller might be accessing
> concurrently.

I agree, that's a bit unfortunate.

Unless one unlinks the qH temporarily (but as we have observed that's
rather slow) or disables the async list for a moment (probably slow,
too, and impacts other endpoints).

>> Like, is C standard giving guarantees that no intermediate values are
>> being written to a struct when that struct is a target of an assignment
>> operator?
>
> THe C standard doesn't say anything like that, but the kernel does
> generally rely on such behavior.

I see, thanks.

> However, it wouldn't hurt to mark this
> special case by using WRITE_ONCE.

I think WRITE_ONCE() at least to the hw_token would make a lot of sense
here.

>> But apparently this doesn't cause trouble, so I guess in practice
>> this works okay.
>
> Yes, it does.
>
> Alan Stern
>

Thanks,
Maciej