2020-04-15 22:12:36

by John Stultz

[permalink] [raw]
Subject: More dwc3 gadget issues with adb

Hey Felipe,
Last week or so, a change[1] in AOSP to adbd seemingly uncovered
another issue with dwc3 gadget scatter-gather support on HiKey960.

Interestingly it doesn't seem to affect the Dragonboard 845c, which
uses the same dwc3 driver and has had its own issues in the past.

The behavior is the same as we saw last time around with both devices.
After booting the device, running "adb logcat -d" (or really any adb
command that transfers more than a trivial amount of data) on the host
will result in the adb output seeming to stall. Any further adb
invocations to the device will hang indefinitely.

I've captured trace events for before the change (works), after the
change (broken), and after the change with the sg_enabled flag turned
off (which works around the problem).

Let me know if there is anything else useful for me to share.

thanks
-john

[1] https://android.googlesource.com/platform/system/core/+/0871824de6c5b868f2a2e67c9b5673cb2181b4c6


Attachments:
hikey960.tar.xz (90.17 kB)

2020-04-16 08:22:54

by Felipe Balbi

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb


Hi,

John Stultz <[email protected]> writes:
> Hey Felipe,
> Last week or so, a change[1] in AOSP to adbd seemingly uncovered
> another issue with dwc3 gadget scatter-gather support on HiKey960.
>
> Interestingly it doesn't seem to affect the Dragonboard 845c, which
> uses the same dwc3 driver and has had its own issues in the past.
>
> The behavior is the same as we saw last time around with both devices.
> After booting the device, running "adb logcat -d" (or really any adb
> command that transfers more than a trivial amount of data) on the host
> will result in the adb output seeming to stall. Any further adb
> invocations to the device will hang indefinitely.
>
> I've captured trace events for before the change (works), after the
> change (broken), and after the change with the sg_enabled flag turned
> off (which works around the problem).
>
> Let me know if there is anything else useful for me to share.

First the obvious questions: Which kernel version is this? What does
"before" and "after" refer to in our traces? What are first working and
first failing versions? Can you run git bisect?

One thing I noticed is that we're missing a giveback on ep1out. Here's a
working case:

UsbFfs-worker-580 [002] d..1 66.704886: dwc3_alloc_request: ep1out: req 0000000011c55648 length 0/0 zsI ==> 0
UsbFfs-worker-580 [002] d..2 66.704889: dwc3_ep_queue: ep1out: req 0000000011c55648 length 0/16384 zsI ==> -115
UsbFfs-worker-580 [002] d..2 66.704892: dwc3_prepare_trb: ep1out: trb 000000003559c11c (E27:D7) buf 000000008843b000 size 16384 ctrl 00000819 (HlcS:sC:normal)
UsbFfs-worker-580 [002] d..2 66.704897: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705053: dwc3_event: event (00006084): ep1out: Transfer In Progress [0] (SIm)
irq/65-dwc3-522 [000] d..1 66.705054: dwc3_complete_trb: ep1out: trb 000000008c350fb3 (E27:D8) buf 0000000089d6b000 size 16360 ctrl 00000818 (hlcS:sC:normal)
irq/65-dwc3-522 [000] d..1 66.705058: dwc3_gadget_giveback: ep1out: req 0000000001b9ed3f length 24/16384 zsI ==> 0
kworker/u16:2-260 [001] .... 66.705097: dwc3_free_request: ep1out: req 0000000001b9ed3f length 24/16384 zsI ==> 0

and the failure point:

UsbFfs-worker-580 [002] d..1 66.705129: dwc3_alloc_request: ep1out: req 0000000067a34de4 length 0/0 zsI ==> 0
UsbFfs-worker-580 [002] d..2 66.705131: dwc3_ep_queue: ep1out: req 0000000067a34de4 length 0/16384 zsI ==> -115
UsbFfs-worker-580 [002] d..2 66.705134: dwc3_prepare_trb: ep1out: trb 00000000f3db4076 (E28:D8) buf 000000008843f000 size 16384 ctrl 00000819 (HlcS:sC:normal)
UsbFfs-worker-580 [002] d..2 66.705141: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705309: dwc3_event: event (00006084): ep1out: Transfer In Progress [0] (SIm)
irq/65-dwc3-522 [000] d..1 66.705310: dwc3_complete_trb: ep1out: trb 0000000092deef41 (E28:D9) buf 00000000ba8f1000 size 4072 ctrl 0000001c (hlCS:sc:normal)
irq/65-dwc3-522 [000] d..1 66.705318: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705323: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705329: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705334: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705339: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705344: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705349: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-522 [000] d..1 66.705354: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful

One interesting thing is that TRB addresses are "odd". I can't find a
proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
without it? For example, nowhere in the log can I find the place where
trb 0000000092deef41 was first enqueue. I'm assuming the log to be
ordered, which means that trb is the same as 00000000f3db4076. But why
are the addresses different?

Another weird thing is that even though we CHN bit being set in
0000000092deef41, we don't see where the second trb (the one its chained
to) was prepared. It seems like it was *never* prepared, what gives?

--
balbi


Attachments:
signature.asc (847.00 B)

2020-04-16 20:45:09

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
> John Stultz <[email protected]> writes:
> > Hey Felipe,
> > Last week or so, a change[1] in AOSP to adbd seemingly uncovered
> > another issue with dwc3 gadget scatter-gather support on HiKey960.
> >
> > Interestingly it doesn't seem to affect the Dragonboard 845c, which
> > uses the same dwc3 driver and has had its own issues in the past.
> >
> > The behavior is the same as we saw last time around with both devices.
> > After booting the device, running "adb logcat -d" (or really any adb
> > command that transfers more than a trivial amount of data) on the host
> > will result in the adb output seeming to stall. Any further adb
> > invocations to the device will hang indefinitely.
> >
> > I've captured trace events for before the change (works), after the
> > change (broken), and after the change with the sg_enabled flag turned
> > off (which works around the problem).
> >
> > Let me know if there is anything else useful for me to share.
>
> First the obvious questions: Which kernel version is this?

Apologies. v5.7-rc1, though the same behavior was seen with v5.6 and
v5.4 kernels.

> What does
> "before" and "after" refer to in our traces?

Before and after the userland changes to adb were made.


> What are first working and
> first failing versions? Can you run git bisect?

It's not directly a kernel regression. But it seems like an uncovered
issue by changes in userland.


> One thing I noticed is that we're missing a giveback on ep1out. Here's a
> working case:
>
> UsbFfs-worker-580 [002] d..1 66.704886: dwc3_alloc_request: ep1out: req 0000000011c55648 length 0/0 zsI ==> 0
> UsbFfs-worker-580 [002] d..2 66.704889: dwc3_ep_queue: ep1out: req 0000000011c55648 length 0/16384 zsI ==> -115
> UsbFfs-worker-580 [002] d..2 66.704892: dwc3_prepare_trb: ep1out: trb 000000003559c11c (E27:D7) buf 000000008843b000 size 16384 ctrl 00000819 (HlcS:sC:normal)
> UsbFfs-worker-580 [002] d..2 66.704897: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705053: dwc3_event: event (00006084): ep1out: Transfer In Progress [0] (SIm)
> irq/65-dwc3-522 [000] d..1 66.705054: dwc3_complete_trb: ep1out: trb 000000008c350fb3 (E27:D8) buf 0000000089d6b000 size 16360 ctrl 00000818 (hlcS:sC:normal)
> irq/65-dwc3-522 [000] d..1 66.705058: dwc3_gadget_giveback: ep1out: req 0000000001b9ed3f length 24/16384 zsI ==> 0
> kworker/u16:2-260 [001] .... 66.705097: dwc3_free_request: ep1out: req 0000000001b9ed3f length 24/16384 zsI ==> 0
>
> and the failure point:
>
> UsbFfs-worker-580 [002] d..1 66.705129: dwc3_alloc_request: ep1out: req 0000000067a34de4 length 0/0 zsI ==> 0
> UsbFfs-worker-580 [002] d..2 66.705131: dwc3_ep_queue: ep1out: req 0000000067a34de4 length 0/16384 zsI ==> -115
> UsbFfs-worker-580 [002] d..2 66.705134: dwc3_prepare_trb: ep1out: trb 00000000f3db4076 (E28:D8) buf 000000008843f000 size 16384 ctrl 00000819 (HlcS:sC:normal)
> UsbFfs-worker-580 [002] d..2 66.705141: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705309: dwc3_event: event (00006084): ep1out: Transfer In Progress [0] (SIm)
> irq/65-dwc3-522 [000] d..1 66.705310: dwc3_complete_trb: ep1out: trb 0000000092deef41 (E28:D9) buf 00000000ba8f1000 size 4072 ctrl 0000001c (hlCS:sc:normal)
> irq/65-dwc3-522 [000] d..1 66.705318: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705323: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705329: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705334: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705339: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705344: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705349: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
> irq/65-dwc3-522 [000] d..1 66.705354: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
>
> One interesting thing is that TRB addresses are "odd". I can't find a
> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
> without it? For example, nowhere in the log can I find the place where
> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
> ordered, which means that trb is the same as 00000000f3db4076. But why
> are the addresses different?
>
> Another weird thing is that even though we CHN bit being set in
> 0000000092deef41, we don't see where the second trb (the one its chained
> to) was prepared. It seems like it was *never* prepared, what gives?

I'll try to take a deeper look at these points and get back to you. I
really appreciate your eyes on this!

Thanks for the feedback!
-john

2020-04-22 04:41:11

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
> One thing I noticed is that we're missing a giveback on ep1out. Here's a
> working case:
>

Hey Felipe,
So I found some time to dig around on this today and I started
trying to understand this issue that you've pointed out about missing
the giveback.

It seems part of the issue is we get to a point where we have some req
where pending_sgs is more than one.

We call dwc3_prepare_one_trb_sg() on it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068

And we process the sg list incrementing req->num_queued_sgs for each one.

then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522

We call dwc3_gadget_ep_reclaim_trb_sg()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470

Where we iterate over the req->sg, ideally decrementing
num_pending_sgs each time and return.

But back in dwc3_gadget_ep_cleanup_completed_request() and there
we're hitting the:
if (!dwc3_gadget_ep_request_completed(req) ||
req->num_pending_sgs) {
case which causes us to skip the call to dwc3_gadget_giveback().

Looking as to why the num_pending_sgs is non zero, that's because in
dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
early before we decrement num_pending_sgs.

For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
(called from dwc3_prepare_one_trb_sg() back at the beginning):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921

I added logic showing every time we set or clear that flag, and it
seems like we're always setting it but never clearing it. And often
that's not an issue as we only have one sg entry. But if its set on a
trb in a request with multiple sgs, that's where it seems to be
causing the issue.

I'll continue to dig around to try to understand where it might be
going awry (why we never clear the HWO flag). But figured I'd try to
explain this much in case it rings any bells to you.



> One interesting thing is that TRB addresses are "odd". I can't find a
> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
> without it? For example, nowhere in the log can I find the place where
> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
> ordered, which means that trb is the same as 00000000f3db4076. But why
> are the addresses different?
>
> Another weird thing is that even though we CHN bit being set in
> 0000000092deef41, we don't see where the second trb (the one its chained
> to) was prepared. It seems like it was *never* prepared, what gives?

I suspect these bits were due to the tracing happening after some
minor amount of initial adb traffic began at bootup? So the trace
isn't capturing all the events.

Let me know if that doesn't sound reasonable and I'll try to dig a bit
futher on those questions.

thanks
-john

2020-04-22 05:13:44

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Tue, Apr 21, 2020 at 9:38 PM John Stultz <[email protected]> wrote:
>
> On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
> > One thing I noticed is that we're missing a giveback on ep1out. Here's a
> > working case:
> >
>
> Hey Felipe,
> So I found some time to dig around on this today and I started
> trying to understand this issue that you've pointed out about missing
> the giveback.
>
> It seems part of the issue is we get to a point where we have some req
> where pending_sgs is more than one.
>
> We call dwc3_prepare_one_trb_sg() on it:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068
>
> And we process the sg list incrementing req->num_queued_sgs for each one.
>
> then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522
>
> We call dwc3_gadget_ep_reclaim_trb_sg()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
>
> Where we iterate over the req->sg, ideally decrementing
> num_pending_sgs each time and return.
>
> But back in dwc3_gadget_ep_cleanup_completed_request() and there
> we're hitting the:
> if (!dwc3_gadget_ep_request_completed(req) ||
> req->num_pending_sgs) {
> case which causes us to skip the call to dwc3_gadget_giveback().
>
> Looking as to why the num_pending_sgs is non zero, that's because in
> dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
> the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
> early before we decrement num_pending_sgs.
>
> For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
> (called from dwc3_prepare_one_trb_sg() back at the beginning):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921
>
> I added logic showing every time we set or clear that flag, and it
> seems like we're always setting it but never clearing it. And often
> that's not an issue as we only have one sg entry. But if its set on a
> trb in a request with multiple sgs, that's where it seems to be
> causing the issue.
>
> I'll continue to dig around to try to understand where it might be
> going awry (why we never clear the HWO flag). But figured I'd try to
> explain this much in case it rings any bells to you.

I was looking some more at this and it seems a little odd...

In dwc3_gadget_ep_reclaim_trb_sg():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470

The check for (trb->ctrl & DWC3_TRB_CTRL_HWO) which breaks us out of
the loop happens before we call
dwc3_gadget_ep_reclaim_completed_trb():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2406

Which is what clears the DWC3_TRB_CTRL_HWO flag (outside of
dwc3_gadget_ep_skip_trbs()).

So on a whim I dropped that check, and things go back to working on
HiKey960, no more adb stalls!

Does something like this make sense? It's not causing trouble on
db845c either so far in my testing.

thanks
-john

(sorry gmail will whitespace corrupt this code paste - just want to
communicate what I did clearly, not to apply)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 4d3c79d90a6e..2a26d33520ce 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2457,9 +2457,6 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct
dwc3_ep *dep,
for_each_sg(sg, s, pending, i) {
trb = &dep->trb_pool[dep->trb_dequeue];

- if (trb->ctrl & DWC3_TRB_CTRL_HWO)
- break;
-
req->sg = sg_next(s);
req->num_pending_sgs--;

2020-04-22 06:02:24

by Felipe Balbi

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb


Hi,

John Stultz <[email protected]> writes:
> On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
>> One thing I noticed is that we're missing a giveback on ep1out. Here's a
>> working case:
>>
>
> Hey Felipe,
> So I found some time to dig around on this today and I started
> trying to understand this issue that you've pointed out about missing
> the giveback.
>
> It seems part of the issue is we get to a point where we have some req
> where pending_sgs is more than one.
>
> We call dwc3_prepare_one_trb_sg() on it:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068
>
> And we process the sg list incrementing req->num_queued_sgs for each one.
>
> then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522
>
> We call dwc3_gadget_ep_reclaim_trb_sg()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
>
> Where we iterate over the req->sg, ideally decrementing
> num_pending_sgs each time and return.
>
> But back in dwc3_gadget_ep_cleanup_completed_request() and there
> we're hitting the:
> if (!dwc3_gadget_ep_request_completed(req) ||
> req->num_pending_sgs) {
> case which causes us to skip the call to dwc3_gadget_giveback().

This logic was modified recently. Can you check if today's linus/master
works for you?

> Looking as to why the num_pending_sgs is non zero, that's because in
> dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
> the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
> early before we decrement num_pending_sgs.
>
> For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
> (called from dwc3_prepare_one_trb_sg() back at the beginning):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921
>
> I added logic showing every time we set or clear that flag, and it
> seems like we're always setting it but never clearing it. And often

HW clears HWO flag. We only have to manually clear in one or two cases.

> that's not an issue as we only have one sg entry. But if its set on a
> trb in a request with multiple sgs, that's where it seems to be
> causing the issue.

The issue is completing with HWO set, which should never happen. Can you
collect tracepoints with linus/master of this particular situation?

>> One interesting thing is that TRB addresses are "odd". I can't find a
>> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
>> without it? For example, nowhere in the log can I find the place where
>> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
>> ordered, which means that trb is the same as 00000000f3db4076. But why
>> are the addresses different?
>>
>> Another weird thing is that even though we CHN bit being set in
>> 0000000092deef41, we don't see where the second trb (the one its chained
>> to) was prepared. It seems like it was *never* prepared, what gives?
>
> I suspect these bits were due to the tracing happening after some
> minor amount of initial adb traffic began at bootup? So the trace
> isn't capturing all the events.

No, no. That's more likely to be IOMMU mucking up the addresses. ADB is
very sequential in its behavior and USB transfers requests in
order. Please run linus/master without IOMMU.

cheers

--
balbi


Attachments:
signature.asc (847.00 B)

2020-04-22 18:36:46

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Tue, Apr 21, 2020 at 11:00 PM Felipe Balbi <[email protected]> wrote:
> John Stultz <[email protected]> writes:
> > On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
> >> One thing I noticed is that we're missing a giveback on ep1out. Here's a
> >> working case:
> >>
> >
> > Hey Felipe,
> > So I found some time to dig around on this today and I started
> > trying to understand this issue that you've pointed out about missing
> > the giveback.
> >
> > It seems part of the issue is we get to a point where we have some req
> > where pending_sgs is more than one.
> >
> > We call dwc3_prepare_one_trb_sg() on it:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068
> >
> > And we process the sg list incrementing req->num_queued_sgs for each one.
> >
> > then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522
> >
> > We call dwc3_gadget_ep_reclaim_trb_sg()
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
> >
> > Where we iterate over the req->sg, ideally decrementing
> > num_pending_sgs each time and return.
> >
> > But back in dwc3_gadget_ep_cleanup_completed_request() and there
> > we're hitting the:
> > if (!dwc3_gadget_ep_request_completed(req) ||
> > req->num_pending_sgs) {
> > case which causes us to skip the call to dwc3_gadget_giveback().
>
> This logic was modified recently. Can you check if today's linus/master
> works for you?

I was testing this against 5.7-rc2, but I updated to linus/master and
I'm not seeing any change.

> > Looking as to why the num_pending_sgs is non zero, that's because in
> > dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
> > the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
> > early before we decrement num_pending_sgs.
> >
> > For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
> > (called from dwc3_prepare_one_trb_sg() back at the beginning):
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921
> >
> > I added logic showing every time we set or clear that flag, and it
> > seems like we're always setting it but never clearing it. And often
>
> HW clears HWO flag. We only have to manually clear in one or two cases.

I guess the bit that worries me with the current code is in
dwc3_gadget_ep_reclaim_completed_trb():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2406

There is the logic:
if (chain && (trb->ctrl & DWC3_TRB_CTRL_HWO))
trb->ctrl &= ~DWC3_TRB_CTRL_HWO;

But that will never trip, as the only time we call
dwc3_gadget_ep_reclaim_completed_trb() with chain==true is from
dwc3_gadget_ep_reclaim_trb_sg():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470

Where if the trb->ctrl has the HWO flag set, we break out before
calling dwc3_gadget_ep_reclaim_completed_trb().

So the logic quoted above seems to be effectively dead code (despite
the big comment as to why we need it)

That's why I suspected the HWO check in
dwc3_gadget_ep_reclaim_trb_sg() is problematic.


> > that's not an issue as we only have one sg entry. But if its set on a
> > trb in a request with multiple sgs, that's where it seems to be
> > causing the issue.
>
> The issue is completing with HWO set, which should never happen. Can you
> collect tracepoints with linus/master of this particular situation?

Will do, though again, its a little tough as often we hit the stall
state pretty quickly at bootup, before I can even try to enable
tracing, so it may take a few tries.

> >> One interesting thing is that TRB addresses are "odd". I can't find a
> >> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
> >> without it? For example, nowhere in the log can I find the place where
> >> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
> >> ordered, which means that trb is the same as 00000000f3db4076. But why
> >> are the addresses different?
> >>
> >> Another weird thing is that even though we CHN bit being set in
> >> 0000000092deef41, we don't see where the second trb (the one its chained
> >> to) was prepared. It seems like it was *never* prepared, what gives?
> >
> > I suspect these bits were due to the tracing happening after some
> > minor amount of initial adb traffic began at bootup? So the trace
> > isn't capturing all the events.
>
> No, no. That's more likely to be IOMMU mucking up the addresses. ADB is
> very sequential in its behavior and USB transfers requests in
> order. Please run linus/master without IOMMU.

So I didn't have any IOMMU support enabled in the config I was testing
with. I went through and added IOMMU options in my config with
linus/master as well and that didn't seem to change the behavior
either.

I'll get back to you with further trace logs.

thanks
-john

2020-04-23 00:54:52

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Wed, Apr 22, 2020 at 11:32 AM John Stultz <[email protected]> wrote:
> On Tue, Apr 21, 2020 at 11:00 PM Felipe Balbi <[email protected]> wrote:
> > John Stultz <[email protected]> writes:
> > > that's not an issue as we only have one sg entry. But if its set on a
> > > trb in a request with multiple sgs, that's where it seems to be
> > > causing the issue.
> >
> > The issue is completing with HWO set, which should never happen. Can you
> > collect tracepoints with linus/master of this particular situation?
>
> Will do, though again, its a little tough as often we hit the stall
> state pretty quickly at bootup, before I can even try to enable
> tracing, so it may take a few tries.
>
> > >> One interesting thing is that TRB addresses are "odd". I can't find a
> > >> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
> > >> without it? For example, nowhere in the log can I find the place where
> > >> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
> > >> ordered, which means that trb is the same as 00000000f3db4076. But why
> > >> are the addresses different?
> > >>
> > >> Another weird thing is that even though we CHN bit being set in
> > >> 0000000092deef41, we don't see where the second trb (the one its chained
> > >> to) was prepared. It seems like it was *never* prepared, what gives?
> > >
> > > I suspect these bits were due to the tracing happening after some
> > > minor amount of initial adb traffic began at bootup? So the trace
> > > isn't capturing all the events.
> >
> > No, no. That's more likely to be IOMMU mucking up the addresses. ADB is
> > very sequential in its behavior and USB transfers requests in
> > order. Please run linus/master without IOMMU.
>
> So I didn't have any IOMMU support enabled in the config I was testing
> with. I went through and added IOMMU options in my config with
> linus/master as well and that didn't seem to change the behavior
> either.
>
> I'll get back to you with further trace logs.

Ok. Attached are trace logs. Two bad cases, which are linus/master w/
all IOMMU configs disabled.

Then I have a good case where I've removed the
if (trb->ctrl & DWC3_TRB_CTRL_HWO)
break;
logic in dwc3_gadget_ep_reclaim_trb_sg().


thanks
-john


Attachments:
hikey960.tar.xz (84.58 kB)

2020-04-23 09:24:36

by Felipe Balbi

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb


Hi,

John Stultz <[email protected]> writes:
> On Tue, Apr 21, 2020 at 11:00 PM Felipe Balbi <[email protected]> wrote:
>> John Stultz <[email protected]> writes:
>> > On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
>> >> One thing I noticed is that we're missing a giveback on ep1out. Here's a
>> >> working case:
>> >>
>> >
>> > Hey Felipe,
>> > So I found some time to dig around on this today and I started
>> > trying to understand this issue that you've pointed out about missing
>> > the giveback.
>> >
>> > It seems part of the issue is we get to a point where we have some req
>> > where pending_sgs is more than one.
>> >
>> > We call dwc3_prepare_one_trb_sg() on it:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068
>> >
>> > And we process the sg list incrementing req->num_queued_sgs for each one.
>> >
>> > then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522
>> >
>> > We call dwc3_gadget_ep_reclaim_trb_sg()
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
>> >
>> > Where we iterate over the req->sg, ideally decrementing
>> > num_pending_sgs each time and return.
>> >
>> > But back in dwc3_gadget_ep_cleanup_completed_request() and there
>> > we're hitting the:
>> > if (!dwc3_gadget_ep_request_completed(req) ||
>> > req->num_pending_sgs) {
>> > case which causes us to skip the call to dwc3_gadget_giveback().
>>
>> This logic was modified recently. Can you check if today's linus/master
>> works for you?
>
> I was testing this against 5.7-rc2, but I updated to linus/master and
> I'm not seeing any change.

good

>> > Looking as to why the num_pending_sgs is non zero, that's because in
>> > dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
>> > the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
>> > early before we decrement num_pending_sgs.
>> >
>> > For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
>> > (called from dwc3_prepare_one_trb_sg() back at the beginning):
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921
>> >
>> > I added logic showing every time we set or clear that flag, and it
>> > seems like we're always setting it but never clearing it. And often
>>
>> HW clears HWO flag. We only have to manually clear in one or two cases.
>
> I guess the bit that worries me with the current code is in
> dwc3_gadget_ep_reclaim_completed_trb():
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2406
>
> There is the logic:
> if (chain && (trb->ctrl & DWC3_TRB_CTRL_HWO))
> trb->ctrl &= ~DWC3_TRB_CTRL_HWO;
>
> But that will never trip, as the only time we call
> dwc3_gadget_ep_reclaim_completed_trb() with chain==true is from
> dwc3_gadget_ep_reclaim_trb_sg():
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
>
> Where if the trb->ctrl has the HWO flag set, we break out before
> calling dwc3_gadget_ep_reclaim_completed_trb().
>
> So the logic quoted above seems to be effectively dead code (despite
> the big comment as to why we need it)

There is another situation when we use CHN bit. If we're starting an OUT
transfer and the length is not aligned to wMaxPacketSize.

> That's why I suspected the HWO check in
> dwc3_gadget_ep_reclaim_trb_sg() is problematic.

okay

>> > that's not an issue as we only have one sg entry. But if its set on a
>> > trb in a request with multiple sgs, that's where it seems to be
>> > causing the issue.
>>
>> The issue is completing with HWO set, which should never happen. Can you
>> collect tracepoints with linus/master of this particular situation?
>
> Will do, though again, its a little tough as often we hit the stall
> state pretty quickly at bootup, before I can even try to enable
> tracing, so it may take a few tries.

IIRC there's a way to configure tracepoints from cmdline, I never tested
it though.

>> >> One interesting thing is that TRB addresses are "odd". I can't find a
>> >> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
>> >> without it? For example, nowhere in the log can I find the place where
>> >> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
>> >> ordered, which means that trb is the same as 00000000f3db4076. But why
>> >> are the addresses different?
>> >>
>> >> Another weird thing is that even though we CHN bit being set in
>> >> 0000000092deef41, we don't see where the second trb (the one its chained
>> >> to) was prepared. It seems like it was *never* prepared, what gives?
>> >
>> > I suspect these bits were due to the tracing happening after some
>> > minor amount of initial adb traffic began at bootup? So the trace
>> > isn't capturing all the events.
>>
>> No, no. That's more likely to be IOMMU mucking up the addresses. ADB is
>> very sequential in its behavior and USB transfers requests in
>> order. Please run linus/master without IOMMU.
>
> So I didn't have any IOMMU support enabled in the config I was testing
> with. I went through and added IOMMU options in my config with
> linus/master as well and that didn't seem to change the behavior
> either.

pretty odd that the TRB addresses change like that.

> I'll get back to you with further trace logs.

thanks

--
balbi


Attachments:
signature.asc (847.00 B)

2020-04-23 09:52:47

by Felipe Balbi

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

John Stultz <[email protected]> writes:

> On Wed, Apr 22, 2020 at 11:32 AM John Stultz <[email protected]> wrote:
>> On Tue, Apr 21, 2020 at 11:00 PM Felipe Balbi <[email protected]> wrote:
>> > John Stultz <[email protected]> writes:
>> > > that's not an issue as we only have one sg entry. But if its set on a
>> > > trb in a request with multiple sgs, that's where it seems to be
>> > > causing the issue.
>> >
>> > The issue is completing with HWO set, which should never happen. Can you
>> > collect tracepoints with linus/master of this particular situation?
>>
>> Will do, though again, its a little tough as often we hit the stall
>> state pretty quickly at bootup, before I can even try to enable
>> tracing, so it may take a few tries.
>>
>> > >> One interesting thing is that TRB addresses are "odd". I can't find a
>> > >> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
>> > >> without it? For example, nowhere in the log can I find the place where
>> > >> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
>> > >> ordered, which means that trb is the same as 00000000f3db4076. But why
>> > >> are the addresses different?
>> > >>
>> > >> Another weird thing is that even though we CHN bit being set in
>> > >> 0000000092deef41, we don't see where the second trb (the one its chained
>> > >> to) was prepared. It seems like it was *never* prepared, what gives?
>> > >
>> > > I suspect these bits were due to the tracing happening after some
>> > > minor amount of initial adb traffic began at bootup? So the trace
>> > > isn't capturing all the events.
>> >
>> > No, no. That's more likely to be IOMMU mucking up the addresses. ADB is
>> > very sequential in its behavior and USB transfers requests in
>> > order. Please run linus/master without IOMMU.
>>
>> So I didn't have any IOMMU support enabled in the config I was testing
>> with. I went through and added IOMMU options in my config with
>> linus/master as well and that didn't seem to change the behavior
>> either.
>>
>> I'll get back to you with further trace logs.
>
> Ok. Attached are trace logs. Two bad cases, which are linus/master w/
> all IOMMU configs disabled.
>
> Then I have a good case where I've removed the
> if (trb->ctrl & DWC3_TRB_CTRL_HWO)
> break;
> logic in dwc3_gadget_ep_reclaim_trb_sg().

Okay, here's the failing case:

UsbFfs-worker-574 [001] d..2 261.788895: dwc3_ep_queue: ep1out: req 000000006efef4fb length 0/16384 zsI ==> -115
UsbFfs-worker-574 [001] d..2 261.788922: dwc3_prepare_trb: ep1out: trb 00000000fa0a991a (E11:D4) buf 00000000ab3d1000 size 4096 ctrl 0000001d (HlCS:sc:normal)
UsbFfs-worker-574 [001] d..2 261.788927: dwc3_prepare_trb: ep1out: trb 000000007f6b91bb (E12:D4) buf 00000000a9534000 size 4096 ctrl 0000001d (HlCS:sc:normal)
UsbFfs-worker-574 [001] d..2 261.788930: dwc3_prepare_trb: ep1out: trb 00000000a64ab194 (E13:D4) buf 00000000d6447000 size 4096 ctrl 0000001d (HlCS:sc:normal)
UsbFfs-worker-574 [001] d..2 261.788934: dwc3_prepare_trb: ep1out: trb 00000000571f893b (E14:D4) buf 00000000d666a000 size 4096 ctrl 00000819 (HlcS:sC:normal)
UsbFfs-worker-574 [001] d..2 261.788962: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful
irq/65-dwc3-515 [000] d..1 261.821056: dwc3_event: event (00006084): ep1out: Transfer In Progress [0] (SIm)
irq/65-dwc3-515 [000] d..1 261.821057: dwc3_complete_trb: ep1out: trb 00000000fa0a991a (E22:D11) buf 00000000ab3d1000 size 4072 ctrl 0000001c (hlCS:sc:normal)

So, ffs enqueued a request for 16kiB but only 24 bytes were sent by the
host. This is a short-packet case. It seems like that logic is, indeed,
incorrect. I added it for a reason, though I can't be sure based on the
commit log alone.

I think I was trying to cover the case where we didn't prepare all TRBs
for an SG-list because they didn't fit, but I don't think that's the
right way to achieve what I meant :-p

Could you prepare a patch and Cc stable?

--
balbi


Attachments:
signature.asc (847.00 B)

2020-04-24 17:18:05

by Jack Pham

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

Hey John,

On Tue, Apr 21, 2020 at 10:09:27PM -0700, John Stultz wrote:
> On Tue, Apr 21, 2020 at 9:38 PM John Stultz <[email protected]> wrote:
> >
> > On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <[email protected]> wrote:
> > > One thing I noticed is that we're missing a giveback on ep1out. Here's a
> > > working case:
> > >
> >
> > Hey Felipe,
> > So I found some time to dig around on this today and I started
> > trying to understand this issue that you've pointed out about missing
> > the giveback.
> >
> > It seems part of the issue is we get to a point where we have some req
> > where pending_sgs is more than one.
> >
> > We call dwc3_prepare_one_trb_sg() on it:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068
> >
> > And we process the sg list incrementing req->num_queued_sgs for each one.
> >
> > then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522
> >
> > We call dwc3_gadget_ep_reclaim_trb_sg()
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
> >
> > Where we iterate over the req->sg, ideally decrementing
> > num_pending_sgs each time and return.
> >
> > But back in dwc3_gadget_ep_cleanup_completed_request() and there
> > we're hitting the:
> > if (!dwc3_gadget_ep_request_completed(req) ||
> > req->num_pending_sgs) {
> > case which causes us to skip the call to dwc3_gadget_giveback().
> >
> > Looking as to why the num_pending_sgs is non zero, that's because in
> > dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
> > the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
> > early before we decrement num_pending_sgs.
> >
> > For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
> > (called from dwc3_prepare_one_trb_sg() back at the beginning):
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921
> >
> > I added logic showing every time we set or clear that flag, and it
> > seems like we're always setting it but never clearing it. And often
> > that's not an issue as we only have one sg entry. But if its set on a
> > trb in a request with multiple sgs, that's where it seems to be
> > causing the issue.
> >
> > I'll continue to dig around to try to understand where it might be
> > going awry (why we never clear the HWO flag). But figured I'd try to
> > explain this much in case it rings any bells to you.
>
> I was looking some more at this and it seems a little odd...
>
> In dwc3_gadget_ep_reclaim_trb_sg():
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
>
> The check for (trb->ctrl & DWC3_TRB_CTRL_HWO) which breaks us out of
> the loop happens before we call
> dwc3_gadget_ep_reclaim_completed_trb():
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2406
>
> Which is what clears the DWC3_TRB_CTRL_HWO flag (outside of
> dwc3_gadget_ep_skip_trbs()).
>
> So on a whim I dropped that check, and things go back to working on
> HiKey960, no more adb stalls!
>
> Does something like this make sense? It's not causing trouble on
> db845c either so far in my testing.

Ok I'll bite...

I'm now curious why it hasn't been a problem with the Qualcomm HW. Do
you mind please capturing a similar trace log on the db845c? Would be
good to see a side-by-side comparison and see if, first of all, whether
the same S/G path is getting exercised (i.e. 16KiB OUT requests from ADB
userspace using AIO which then get broken up into 4K chunks by f_fs),
and what the behaviors of the reclaim_trb and giveback are when the
transfer is completed.

Preferably if you could get a trace without your patch applied that
would be great. And maybe also one after your patch just to see if the
traces are truly identical or not.

Thanks for digging into this!
Jack

> (sorry gmail will whitespace corrupt this code paste - just want to
> communicate what I did clearly, not to apply)
> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 4d3c79d90a6e..2a26d33520ce 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -2457,9 +2457,6 @@ static int dwc3_gadget_ep_reclaim_trb_sg(struct
> dwc3_ep *dep,
> for_each_sg(sg, s, pending, i) {
> trb = &dep->trb_pool[dep->trb_dequeue];
>
> - if (trb->ctrl & DWC3_TRB_CTRL_HWO)
> - break;
> -
> req->sg = sg_next(s);
> req->num_pending_sgs--;

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2020-04-24 18:39:30

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Fri, Apr 24, 2020 at 10:12 AM Jack Pham <[email protected]> wrote:
> On Tue, Apr 21, 2020 at 10:09:27PM -0700, John Stultz wrote:
> > Does something like this make sense? It's not causing trouble on
> > db845c either so far in my testing.
>
> Ok I'll bite...
>
> I'm now curious why it hasn't been a problem with the Qualcomm HW. Do
> you mind please capturing a similar trace log on the db845c? Would be
> good to see a side-by-side comparison and see if, first of all, whether
> the same S/G path is getting exercised (i.e. 16KiB OUT requests from ADB
> userspace using AIO which then get broken up into 4K chunks by f_fs),
> and what the behaviors of the reclaim_trb and giveback are when the
> transfer is completed.
>
> Preferably if you could get a trace without your patch applied that
> would be great. And maybe also one after your patch just to see if the
> traces are truly identical or not.

Sure. I've captured logs in the same manner with and without on db845c
(against 5.7-rc2). See attached.

I suspect the difference is the db845c is using an iommu (I don't
think it will boot without it) where hikey960 isn't, but I'll let you
take a look.

thanks
-john


Attachments:
db845c.tar.xz (417.46 kB)

2020-04-24 20:06:38

by John Stultz

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

On Fri, Apr 24, 2020 at 11:36 AM John Stultz <[email protected]> wrote:
> On Fri, Apr 24, 2020 at 10:12 AM Jack Pham <[email protected]> wrote:
> > On Tue, Apr 21, 2020 at 10:09:27PM -0700, John Stultz wrote:
> > > Does something like this make sense? It's not causing trouble on
> > > db845c either so far in my testing.
> >
> > Ok I'll bite...
> >
> > I'm now curious why it hasn't been a problem with the Qualcomm HW. Do
> > you mind please capturing a similar trace log on the db845c? Would be
> > good to see a side-by-side comparison and see if, first of all, whether
> > the same S/G path is getting exercised (i.e. 16KiB OUT requests from ADB
> > userspace using AIO which then get broken up into 4K chunks by f_fs),
> > and what the behaviors of the reclaim_trb and giveback are when the
> > transfer is completed.
> >
> > Preferably if you could get a trace without your patch applied that
> > would be great. And maybe also one after your patch just to see if the
> > traces are truly identical or not.
>
> Sure. I've captured logs in the same manner with and without on db845c
> (against 5.7-rc2). See attached.
>
> I suspect the difference is the db845c is using an iommu (I don't
> think it will boot without it) where hikey960 isn't, but I'll let you
> take a look.

And I've added my own printk debugging that I used to track the issue
down on HiKey960 and on the db845c we are never hitting the case where
the HWO flag is set when we call dwc3_gadget_ep_reclaim_trb_sg()

thanks
-john

2020-04-24 22:53:01

by Jack Pham

[permalink] [raw]
Subject: Re: More dwc3 gadget issues with adb

Hi John,

On Fri, Apr 24, 2020 at 11:36:24AM -0700, John Stultz wrote:
> On Fri, Apr 24, 2020 at 10:12 AM Jack Pham <[email protected]> wrote:
> > On Tue, Apr 21, 2020 at 10:09:27PM -0700, John Stultz wrote:
> > > Does something like this make sense? It's not causing trouble on
> > > db845c either so far in my testing.
> >
> > Ok I'll bite...
> >
> > I'm now curious why it hasn't been a problem with the Qualcomm HW. Do
> > you mind please capturing a similar trace log on the db845c? Would be
> > good to see a side-by-side comparison and see if, first of all, whether
> > the same S/G path is getting exercised (i.e. 16KiB OUT requests from ADB
> > userspace using AIO which then get broken up into 4K chunks by f_fs),
> > and what the behaviors of the reclaim_trb and giveback are when the
> > transfer is completed.
> >
> > Preferably if you could get a trace without your patch applied that
> > would be great. And maybe also one after your patch just to see if the
> > traces are truly identical or not.
>
> Sure. I've captured logs in the same manner with and without on db845c
> (against 5.7-rc2). See attached.

Thank you!

> I suspect the difference is the db845c is using an iommu (I don't
> think it will boot without it) where hikey960 isn't, but I'll let you
> take a look.

Yes I think that's exactly what's happening. Those 16KiB requests on
ep1out would normally be passed as 4x4KiB sglists from f_fs.c but after
the call to usb_gadget_map_request() the IOMMU is coalescing them back
into a single entry, so for each of those requests we end up preparing
just a single unchained TRB.

UsbFfs-worker-532 [007] d..1 96.025897: dwc3_alloc_request: ep1out: req 0000000075c0b6d7 length 0/0 zsI ==> 0
UsbFfs-worker-532 [007] d..2 96.025898: dwc3_ep_queue: ep1out: req 0000000075c0b6d7 length 0/16384 zsI ==> -115
UsbFfs-worker-532 [007] d..2 96.025908: dwc3_prepare_trb: ep1out: trb 00000000c0c9cf9f (E217:D209) buf 00000000ff930000 size 16384 ctrl 00000819 (HlcS:sC:normal)
^^^^^^^^^^^^^^^^
-> trb c0c9cf9f enqueued at position 216 in the ring (enqueue pointer 217)
We can see the pointer to the DMA address and it's 16KiB, and the chain
bit is off.

UsbFfs-worker-532 [007] d..2 96.025912: dwc3_readl: addr 00000000ab36a89f value 00002400
UsbFfs-worker-532 [007] d..2 96.025915: dwc3_writel: addr 00000000057ac193 value 00000000
UsbFfs-worker-532 [007] d..2 96.025917: dwc3_writel: addr 000000009c937859 value 00000000
UsbFfs-worker-532 [007] d..2 96.025919: dwc3_writel: addr 00000000a91887be value 00000000
UsbFfs-worker-532 [007] d..2 96.025922: dwc3_writel: addr 00000000679c8ad6 value 00020007
UsbFfs-worker-532 [007] d..2 96.025924: dwc3_readl: addr 00000000679c8ad6 value 00020007
UsbFfs-worker-532 [007] d..2 96.025925: dwc3_gadget_ep_cmd: ep1out: cmd 'Update Transfer' [20007] params 00000000 00000000 00000000 --> status: Successful

...

irq/142-dwc3-529 [000] d..1 96.027952: dwc3_event: event (00006084): ep1out: Transfer In Progress [0] (SIm)
irq/142-dwc3-529 [000] d..1 96.027955: dwc3_complete_trb: ep1out: trb 00000000c0c9cf9f (E224:D217) buf 00000000ff930000 size 16360 ctrl 00000818 (hlcS:sC:normal)
^^^^^^^^^^^^^^^^
irq/142-dwc3-529 [000] d..1 96.027965: dwc3_gadget_giveback: ep1out: req 0000000075c0b6d7 length 24/16384 zsI ==> 0

That same trb c0c9cf9f is completed, 24 (16384 - 16360) bytes were
transferred, and dwc3 gives back the request to the function driver.
TRB dequeue pointer advanced by one to position 217.

irq/142-dwc3-529 [000] d..1 96.027970: dwc3_readl: addr 0000000054b9cc02 value 80001000
irq/142-dwc3-529 [000] d..1 96.027971: dwc3_writel: addr 0000000054b9cc02 value 00001000
irq/142-dwc3-529 [000] d..1 96.027974: dwc3_writel: addr 00000000e4e556e6 value 80000000
irq/142-dwc3-529 [000] d..1 96.027976: dwc3_writel: addr 000000001139226c value 00000001

So it's interesting that with IOMMU always enabled on the db845c we
don't hit the bug as we avoid preparing an SG/chained TRB list.

Thanks,
Jack
--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project