2014-01-08 00:47:27

by Sarah Sharp

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> On 01/07/2014 01:21 PM, Sarah Sharp wrote:
>
> > Can you please try the attached patch, on top of the previous three
> > patches, and send me dmesg?
>
> Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> first time. The previous dmesg didn't include that patch, but this one
> does.
>
> I read through this dmesg but I nodded off somewhere around line 500.
> I hope you can stay awake :)

Well, it has all the info I need, but the results don't make me too
happy. Everything I've checked seems consistent, and I don't know why
the host stopped. The link TRBs are intact, the dequeue pointer for the
endpoint was pointing to the transfer that timed out and it had the
cycle bit set correctly, etc. Perhaps the no-op TRBs are really the
issue.

I'll have to take a look at the log again tomorrow. I posted the dmesg
on pastebin if David wants to check it out as well:
http://pastebin.com/a4AUpsL1

Can you send me the output of `sudo lspci -vvv -n`? Maybe we can just
turn off scatter-gather for your host controller until we get a proper
fix in that uses link TRBs instead of no-op TRBs.

Sarah Sharp


2014-01-08 16:11:06

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

> From: Sarah Sharp
> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> > On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> >
> > > Can you please try the attached patch, on top of the previous three
> > > patches, and send me dmesg?
> >
> > Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> > first time. The previous dmesg didn't include that patch, but this one
> > does.
> >
> > I read through this dmesg but I nodded off somewhere around line 500.
> > I hope you can stay awake :)
>
> Well, it has all the info I need, but the results don't make me too
> happy. Everything I've checked seems consistent, and I don't know why
> the host stopped. The link TRBs are intact, the dequeue pointer for the
> endpoint was pointing to the transfer that timed out and it had the
> cycle bit set correctly, etc. Perhaps the no-op TRBs are really the
> issue.
>
> I'll have to take a look at the log again tomorrow. I posted the dmesg
> on pastebin if David wants to check it out as well:
> http://pastebin.com/a4AUpsL1

I can't see anything obvious either.
However there is no response to the 'stop endpoint' command.
Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
any USB IN or OUT transaction before raising the command completion event.
Possibly it is too 'stuck' to complete the transaction?

The endpoint status is also still '1' (running).
This also means that the 'TR dequeue pointer' is undefined - so the
controller could easily be processing a later TRB.
This field might even still contain the ring base address written by
the driver much earlier.

This might mean that something 'catastrophic' has happened earlier.
Maybe the controller isn't actually seeing any doorbell writes at all.
Maybe the base addresses it has for the rings have all got corrupted.
At least this looks like amd64 - so there aren't memory coherency issues.

Some hacks that might help isolate the problem:
1) Request an interrupt from the last nop data TRB.
2) Put a command nop (decimal 23) TRB into the command ring before
the 'stop endpoint'.
3) Comment out the code that adds the nop data TRBs.
The first two might need code adding to handle the responses.

Do we know the actual xhci device?
I think it reports version 0x96.
(Sarah - it might be useful if that version were in one of the trace
messages that is output by default.)

David

2014-01-08 16:39:43

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

On Tue, 7 Jan 2014, Sarah Sharp wrote:

> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> > On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> >
> > > Can you please try the attached patch, on top of the previous three
> > > patches, and send me dmesg?
> >
> > Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> > first time. The previous dmesg didn't include that patch, but this one
> > does.
> >
> > I read through this dmesg but I nodded off somewhere around line 500.
> > I hope you can stay awake :)
>
> Well, it has all the info I need, but the results don't make me too
> happy. Everything I've checked seems consistent, and I don't know why
> the host stopped. The link TRBs are intact, the dequeue pointer for the
> endpoint was pointing to the transfer that timed out and it had the
> cycle bit set correctly, etc. Perhaps the no-op TRBs are really the
> issue.

This may be a foolish question, but why is xhci-hcd using no-op TRBs in
the first place?

It makes sense that they would be needed if you have to unlink an URB
that isn't the first one on the endpoint ring. But usb-storage never
does that; whenever it unlinks an URB, it always unlinks the earliest
entry in the endpoint's queue.

After unlinking the first URB on the ring, you don't need to fill in
its TRBs with no-ops. Instead, when you are ready to start the ring
againk, just tell the host controller to move the dequeue pointer up to
the start of the next surviving URB. (You'll also have to adjust the
CYCLE bits of the TRBs that get skipped over, but that's trivial.)

Alan Stern

2014-01-08 16:53:02

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

> From: Alan Stern
>
> This may be a foolish question, but why is xhci-hcd using no-op TRBs in
> the first place?

Because it can't write in a link TRB because other parts of the
code use link TRBs to detect the end of the ring.

The problem is that it can't put a link TRB in the middle of
a chain of data fragments unless it is at a 'suitable' offset
from the start of the data TD. Given arbitrary input fragmentation
this means that you can't put a link TRB in the middle of a TD.
(The documented alignment might be as high as 16kB.)

If the rest of the code used a 'ring end pointer' then a link TRB
could be used instead.

David


2014-01-08 17:14:30

by Alan Stern

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

On Wed, 8 Jan 2014, David Laight wrote:

> > From: Alan Stern
> >
> > This may be a foolish question, but why is xhci-hcd using no-op TRBs in
> > the first place?
>
> Because it can't write in a link TRB because other parts of the
> code use link TRBs to detect the end of the ring.
>
> The problem is that it can't put a link TRB in the middle of
> a chain of data fragments unless it is at a 'suitable' offset
> from the start of the data TD. Given arbitrary input fragmentation
> this means that you can't put a link TRB in the middle of a TD.
> (The documented alignment might be as high as 16kB.)
>
> If the rest of the code used a 'ring end pointer' then a link TRB
> could be used instead.

I see. Sounds like a poor design decision in hindsight. Can it be
changed?

Alan Stern

2014-01-08 17:26:31

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

> From: Alan Stern
> On Wed, 8 Jan 2014, David Laight wrote:
>
> > > From: Alan Stern
> > >
> > > This may be a foolish question, but why is xhci-hcd using no-op TRBs in
> > > the first place?
> >
> > Because it can't write in a link TRB because other parts of the
> > code use link TRBs to detect the end of the ring.
> >
> > The problem is that it can't put a link TRB in the middle of
> > a chain of data fragments unless it is at a 'suitable' offset
> > from the start of the data TD. Given arbitrary input fragmentation
> > this means that you can't put a link TRB in the middle of a TD.
> > (The documented alignment might be as high as 16kB.)
> >
> > If the rest of the code used a 'ring end pointer' then a link TRB
> > could be used instead.
>
> I see. Sounds like a poor design decision in hindsight. Can it be
> changed?

Anything can be changed :-)
But it is a bit pervasive.
Padding out with nops isolated the change.

David


2014-01-09 23:50:52

by Sarah Sharp

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

[Walt, please use reply-all to keep the list in the loop, thanks.]

On Wed, Jan 08, 2014 at 04:09:14PM +0000, David Laight wrote:
> > From: Sarah Sharp
> > On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> > > On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> > >
> > > > Can you please try the attached patch, on top of the previous three
> > > > patches, and send me dmesg?
> > >
> > > Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> > > first time. The previous dmesg didn't include that patch, but this one
> > > does.
> > >
> > > I read through this dmesg but I nodded off somewhere around line 500.
> > > I hope you can stay awake :)
> >
> > Well, it has all the info I need, but the results don't make me too
> > happy. Everything I've checked seems consistent, and I don't know why
> > the host stopped. The link TRBs are intact, the dequeue pointer for the
> > endpoint was pointing to the transfer that timed out and it had the
> > cycle bit set correctly, etc. Perhaps the no-op TRBs are really the
> > issue.
> >
> > I'll have to take a look at the log again tomorrow. I posted the dmesg
> > on pastebin if David wants to check it out as well:
> > http://pastebin.com/a4AUpsL1
>
> I can't see anything obvious either.
> However there is no response to the 'stop endpoint' command.
> Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
> any USB IN or OUT transaction before raising the command completion event.
> Possibly it is too 'stuck' to complete the transaction?

The host has to stop processing the transaction, it can't "wait" for the
transaction to finish. "The Stop Endpoint Command is expected to stop
endpoint activity as soon as possible, which may mean that it stops in
the middle of a TRB."

Usually when hosts get into this kind of mode, something has seriously
gone wrong, like bus error when it issues a bad memory access.

> The endpoint status is also still '1' (running).
> This also means that the 'TR dequeue pointer' is undefined - so the
> controller could easily be processing a later TRB.
> This field might even still contain the ring base address written by
> the driver much earlier.
>
> This might mean that something 'catastrophic' has happened earlier.
> Maybe the controller isn't actually seeing any doorbell writes at all.
> Maybe the base addresses it has for the rings have all got corrupted.
> At least this looks like amd64 - so there aren't memory coherency issues.
>
> Some hacks that might help isolate the problem:
> 1) Request an interrupt from the last nop data TRB.
> 2) Put a command nop (decimal 23) TRB into the command ring before
> the 'stop endpoint'.
> 3) Comment out the code that adds the nop data TRBs.
> The first two might need code adding to handle the responses.
>
> Do we know the actual xhci device?
> I think it reports version 0x96.
> (Sarah - it might be useful if that version were in one of the trace
> messages that is output by default.)

You mean print the PCI device and vendor ID? Perhaps Subsystem vendor
as well?

On Tue, Jan 07, 2014 at 05:26:37PM -0800, walt wrote:
> On 01/07/2014 04:47 PM, Sarah Sharp wrote:
>
> > Can you send me the output of `sudo lspci -vvv -n`? Maybe we can just
> > turn off scatter-gather for your host controller until we get a proper
> > fix in that uses link TRBs instead of no-op TRBs.
>
> The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking
> station are the only two usb3 devices I have.
>
> I've wondered if my xhci problems might be caused by hardware quirks, and
> wondering why I seem to be the only one who has this problem.
>
> Maybe I could "take one for the team" by buying new hardware toys that I
> don't really need but I could use to test the xhci driver? (I do enjoy
> buying new toys, necessary, or, um, maybe not :)

It would be appreciated if you could see if your device causes other
host controllers to fail. Who am I to keep a geek from new toys? ;)

In the meantime, try this patch, which is something of a long shot.

Sarah Sharp


Attachments:
(No filename) (3.98 kB)
0001-xhci-Enable-Link-TRB-quirk-for-0.96-ASMedia-host.patch (1.05 kB)
Download all attachments

2014-01-10 14:40:34

by walt

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

On 01/09/2014 03:50 PM, Sarah Sharp wrote:

>>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
>>
>> The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking
>> station are the only two usb3 devices I have.
>>
>> I've wondered if my xhci problems might be caused by hardware quirks, and
>> wondering why I seem to be the only one who has this problem.
>>
>> Maybe I could "take one for the team" by buying new hardware toys that I
>> don't really need but I could use to test the xhci driver? (I do enjoy
>> buying new toys, necessary, or, um, maybe not :)
>
> It would be appreciated if you could see if your device causes other
> host controllers to fail. Who am I to keep a geek from new toys? ;)

Thanks :) Just to clarify, when you say 'your device' do you mean the
outboard disk docking station? So I need to buy other host controllers,
not new docking stations?

>
> In the meantime, try this patch, which is something of a long shot.

No difference. But I notice the code enables the TRB quirk only if the
xhci_version is specifically 0x95. My debug messages claim that "xHCI
doesn't need link TRB QUIRK" so I'm wondering if adding my asmedia device
to the quirks list really doesn't change anything unless it's xhci 0.95.

Does lspci provide that information? I'm not seeing anything obvious.

2014-01-10 14:59:59

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

From: walt
> > In the meantime, try this patch, which is something of a long shot.
>
> No difference. But I notice the code enables the TRB quirk only if the
> xhci_version is specifically 0x95. My debug messages claim that "xHCI
> doesn't need link TRB QUIRK" so I'm wondering if adding my asmedia device
> to the quirks list really doesn't change anything unless it's xhci 0.95.

I think the intention was that the quirk would be applied for your
hardware even though it claims to be version 0.96.
That was gust a hopeful guess that the same change would help.

> Does lspci provide that information? I'm not seeing anything obvious.
I've only seen it in the diagnostic prints (that aren't normally output).

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-10 15:12:43

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

On Thu, 9 Jan 2014, Sarah Sharp wrote:

> > I can't see anything obvious either.
> > However there is no response to the 'stop endpoint' command.
> > Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
> > any USB IN or OUT transaction before raising the command completion event.
> > Possibly it is too 'stuck' to complete the transaction?
>
> The host has to stop processing the transaction, it can't "wait" for the
> transaction to finish. "The Stop Endpoint Command is expected to stop
> endpoint activity as soon as possible, which may mean that it stops in
> the middle of a TRB."

Just to clarify for Walt: There's a difference between a transaction
and a transfer. Transfers can take an indefinitely long time to
complete, because the device doesn't have to accept or send any data
until it is ready.

By contrast, transactions have sharply bounded lifetimes. A
transaction consists of some maximum number of packets (usually 3,
sometimes a little more) with upper limits on the time intervals
between them.

A TRB lies somewhere between a transfer and a transaction. A single
TRB can encompass a single transaction or multiple transactions, and a
single transfer can involve more than one TRB.

Alan Stern

2014-01-13 23:39:14

by walt

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On 01/09/2014 03:50 PM, Sarah Sharp wrote:

>>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
>>
>> I've wondered if my xhci problems might be caused by hardware quirks, and
>> wondering why I seem to be the only one who has this problem.
>>
>> Maybe I could "take one for the team" by buying new hardware toys that I
>> don't really need but I could use to test the xhci driver? (I do enjoy
>> buying new toys, necessary, or, um, maybe not :)
>
> It would be appreciated if you could see if your device causes other
> host controllers to fail. Who am I to keep a geek from new toys? ;)

Sarah, I just fixed my xhci bug for US$19.99 :)

#lspci | tail -1
04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)

This new NEC usb3 controller does everything the ASMedia controller should have
done from the start. I can even power-up the outboard usb3 disk docking station
after the computer is running and still everything Just Works :)

I appreciate all the debugging work you've done to fix the ASMedia problem but
I think it's time to quit now. If hundreds of irate linux users complain about
the same bug then I'll be happy to test more patches.

Until then... :)

2014-01-14 09:46:07

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

From: walt
> On 01/09/2014 03:50 PM, Sarah Sharp wrote:
>
> >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> >>
> >> I've wondered if my xhci problems might be caused by hardware quirks, and
> >> wondering why I seem to be the only one who has this problem.
> >>
> >> Maybe I could "take one for the team" by buying new hardware toys that I
> >> don't really need but I could use to test the xhci driver? (I do enjoy
> >> buying new toys, necessary, or, um, maybe not :)
> >
> > It would be appreciated if you could see if your device causes other
> > host controllers to fail. Who am I to keep a geek from new toys? ;)
>
> Sarah, I just fixed my xhci bug for US$19.99 :)
>
> #lspci | tail -1
> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)

Do you know which version of the xhci spec it conforms to?
(Will probably be 0.96 or 1.00).

Of course, ASMedia will probably change the silicon they are using without
changing anything on the packaging.

David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-14 17:21:17

by Sarah Sharp

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:
> On 01/09/2014 03:50 PM, Sarah Sharp wrote:
>
> >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> >>
> >> I've wondered if my xhci problems might be caused by hardware quirks, and
> >> wondering why I seem to be the only one who has this problem.
> >>
> >> Maybe I could "take one for the team" by buying new hardware toys that I
> >> don't really need but I could use to test the xhci driver? (I do enjoy
> >> buying new toys, necessary, or, um, maybe not :)
> >
> > It would be appreciated if you could see if your device causes other
> > host controllers to fail. Who am I to keep a geek from new toys? ;)
>
> Sarah, I just fixed my xhci bug for US$19.99 :)
>
> #lspci | tail -1
> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
>
> This new NEC usb3 controller does everything the ASMedia controller should have
> done from the start. I can even power-up the outboard usb3 disk docking station
> after the computer is running and still everything Just Works :)
>
> I appreciate all the debugging work you've done to fix the ASMedia problem but
> I think it's time to quit now. If hundreds of irate linux users complain about
> the same bug then I'll be happy to test more patches.

I just got a similar report from someone with a Fresco Logic host
controller, so I may need you to test a work-around patch for the
ASMedia host at some point.

I'm considering disabling the effect of David's patch for those two host
controllers. That will mean USB storage works fine, but USB ethernet
may fail.

I had considered just disabling scatter-gather for the hosts, but we can
still run into the ethernet issue if we need to break a TRB at a 64KB
boundary. So disabling scatter-gather would make USB ethernet work
_most of the time_, but fail intermittently, and USB storage performance
would be impacted. Since USB ethernet will fail in either case, I would
rather keep USB storage performance and sacrifice USB ethernet on those
particular hosts.

So please keep the ASMedia host around for testing, if possible.

Sarah Sharp

2014-01-14 21:27:30

by walt

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On 01/14/2014 09:20 AM, Sarah Sharp wrote:
> On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:

>> Sarah, I just fixed my xhci bug for US$19.99 :)
>>
>> #lspci | tail -1
>> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
>>
>> This new NEC usb3 controller does everything the ASMedia controller should have
>> done from the start.


> I just got a similar report from someone with a Fresco Logic host
> controller, so I may need you to test a work-around patch for the
> ASMedia host at some point.

Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well,
the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk
docking station, too.)

I can't believe the adapter works perfectly in a different computer. Have you
seen this kind of thing before?

At the moment I have two machines using your xhci driver and both work perfectly,
so I thank you again :)

I'm not sure where to go with this next. I could put the adapter back in the
other machine again if you have more patches to test.

2014-01-16 20:46:21

by Sarah Sharp

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On Tue, Jan 14, 2014 at 01:27:25PM -0800, walt wrote:
> On 01/14/2014 09:20 AM, Sarah Sharp wrote:
> > On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:
>
> >> Sarah, I just fixed my xhci bug for US$19.99 :)
> >>
> >> #lspci | tail -1
> >> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
> >>
> >> This new NEC usb3 controller does everything the ASMedia controller should have
> >> done from the start.
>
>
> > I just got a similar report from someone with a Fresco Logic host
> > controller, so I may need you to test a work-around patch for the
> > ASMedia host at some point.

Hmm, the Fresco Logic host issue seems unrelated.

> Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well,
> the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk
> docking station, too.)

Ugh. Well, I suppose we can chalk it up to hardware failure? I think
you're the only one to report a verified issue with the Link TRB patch.

You are sure you're running a vanilla 3.12.7 kernel, right?

> I can't believe the adapter works perfectly in a different computer. Have you
> seen this kind of thing before?

No, at least not this particular host-dying-only-on-one-machine failure
mode.

> At the moment I have two machines using your xhci driver and both work perfectly,
> so I thank you again :)
>
> I'm not sure where to go with this next. I could put the adapter back in the
> other machine again if you have more patches to test.

I think any patches I was going to send are moot with this new
information. The issue with your PCI add-in card only happens in
combination with a specific motherboard, so I don't think it makes sense
to disable the no-op TRBs for that host.

If lots of other people start reporting the same issue with the ASMedia
0.96 host, and reverting commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e
"usb: xhci: Link TRB must not occur within a USB payload burst" helps
them, then I'll reconsider that decision.

Thank you so much for your patience while debugging this issue, and
being willing to try all sorts of kernels and patches. I'm glad we
finally figured out what the issue was, and you have working xHCI hosts
now.

Sarah Sharp

2014-01-17 14:36:25

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

From: walt
> Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well,
> the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk
> docking station, too.)
>
> I can't believe the adapter works perfectly in a different computer. Have you
> seen this kind of thing before?

Could be a horrid timing race between the cpu and xchi controller.
If the cpu manages to write a NOP or LINK TRB for a following transfer
before the controller polls the next entry (after raising the IRQ)
then the controller might process the LINK and then get confused
when it can't process the linked-to TRB.
This might not sound likely, but PCIe has significant latency.

> At the moment I have two machines using your xhci driver and both work perfectly,
> so I thank you again :)
>
> I'm not sure where to go with this next. I could put the adapter back in the
> other machine again if you have more patches to test.

Can you try the patch I posted that stops the ownership on LINK TRBs
being changed before that on the linked-to TRB?

I got a private mail from someone indicating that my earlier 'minimal'
patch helped an ASMedia controller talking to the asx189_178a ethernet
hardware.

David


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-18 18:34:23

by walt

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On 01/17/2014 06:34 AM, David Laight wrote:
> From: walt
>> Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well,
>> the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk
>> docking station, too.)
>>
>> I can't believe the adapter works perfectly in a different computer. Have you
>> seen this kind of thing before?
>
> Could be a horrid timing race between the cpu and xchi controller.
> If the cpu manages to write a NOP or LINK TRB for a following transfer
> before the controller polls the next entry (after raising the IRQ)
> then the controller might process the LINK and then get confused
> when it can't process the linked-to TRB.
> This might not sound likely, but PCIe has significant latency.


> Can you try the patch I posted that stops the ownership on LINK TRBs
> being changed before that on the linked-to TRB?

David, the patch doesn't apply cleanly to any 3.12.x sources I can find,
including the latest git from Linus. Half of the hunks fail. Could you
create a fresh patch against vanilla 3.12.8 so I can test it? Or point me
to the correct kernel sources, perhaps.

Thanks.

2014-01-18 20:23:32

by walt

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On 01/17/2014 06:34 AM, David Laight wrote:

> Can you try the patch I posted that stops the ownership on LINK TRBs
> being changed before that on the linked-to TRB?

Please disregard my earlier post about the patch not applying cleanly.
That was the usual html corruption, so I found the original on the usb
list and it was okay.

Sadly, the patch didn't fix the ASMedia lockup behavior, however :(

I did notice that the lockup occurred only when copying *to* the usb3
drive, and not when copying from it. I think that may be new behavior
but I can't swear to it.

Just to confirm, here are the first few lines of the patch I used.
Please let me know if it's the wrong patch:


diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 53c2e29..589d336 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2928,6 +2928,11 @@ static void queue_trb(struct xhci_hcd *xhci, struct xhci_ring *ring,
struct xhci_generic_trb *trb;

trb = &ring->enqueue->generic;
+
+ field4 = (field4 & ~TRB_CYCLE) | ring->cycle_state;
+ if (trb == &ring->enqueue_first->generic)
+ field4 ^= TRB_CYCLE;
+
trb->field[0] = cpu_to_le32(field1);

<big snip>

2014-01-20 10:41:38

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

From:
> On 01/17/2014 06:34 AM, David Laight wrote:
>
> > Can you try the patch I posted that stops the ownership on LINK TRBs
> > being changed before that on the linked-to TRB?
>
> Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
>
> I did notice that the lockup occurred only when copying *to* the usb3
> drive, and not when copying from it. I think that may be new behavior
> but I can't swear to it.

If the behaviour has changed then the fix is on the right lines.
If reads used to lock up, and don't with the patch then that is
an improvement.
It might be that the tx issue is actually a different 'bug'.

> Just to confirm, here are the first few lines of the patch I used.
> Please let me know if it's the wrong patch:
> ...
> +
> + field4 = (field4 &amp; ~TRB_CYCLE) | ring-&gt;cycle_state;
> + if (trb == &amp;ring-&gt;enqueue_first-&gt;generic)
> + field4 ^= TRB_CYCLE;
> +
> trb-&gt;field[0] = cpu_to_le32(field1);

That looks like my earlier 'test' patch.
The fuller patch is http://article.gmane.org/gmane.linux.usb.general/101784

There shouldn't be any difference in the way the chip is driven, the
later patch just rips out a load of code that is no longer needed.
Mostly it simplifies the way the ownership of ring entries is passed
to the hardware.

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-20 11:22:43

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

From: walt
> On 01/17/2014 06:34 AM, David Laight wrote:
>
> > Can you try the patch I posted that stops the ownership on LINK TRBs
> > being changed before that on the linked-to TRB?
>
> Please disregard my earlier post about the patch not applying cleanly.
> That was the usual html corruption, so I found the original on the usb
> list and it was okay.
>
> Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
>
> I did notice that the lockup occurred only when copying *to* the usb3
> drive, and not when copying from it. I think that may be new behavior
> but I can't swear to it.

Consistent with another report that says that ethernet worked provided
that TSO was disabled (ie no sg tx).
(Without the patch to delay he ownership change on link trbs it didn't
work at all.)

A guess...

In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
of xhci_td_remainder().

You can try that on top of either of my other patches.

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-20 18:15:00

by Sarah Sharp

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On Mon, Jan 20, 2014 at 11:21:14AM +0000, David Laight wrote:
> From: walt
> > On 01/17/2014 06:34 AM, David Laight wrote:
> >
> > > Can you try the patch I posted that stops the ownership on LINK TRBs
> > > being changed before that on the linked-to TRB?
> >
> > Please disregard my earlier post about the patch not applying cleanly.
> > That was the usual html corruption, so I found the original on the usb
> > list and it was okay.
> >
> > Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
> >
> > I did notice that the lockup occurred only when copying *to* the usb3
> > drive, and not when copying from it. I think that may be new behavior
> > but I can't swear to it.
>
> Consistent with another report that says that ethernet worked provided
> that TSO was disabled (ie no sg tx).
> (Without the patch to delay he ownership change on link trbs it didn't
> work at all.)

Please be more clear. What do you mean by these statements? That
someone privately reported that your earlier patch [1] did not help
them, but applying your new patch [2] on top of the old patch did?

[1] http://marc.info/?l=linux-usb&m=138418996717941&w=2
[2] http://marc.info/?l=linux-usb&m=138996538403468&w=2

In general, will you please Cc me and the USB list when replying to
privately reported bugs/confirmations that patches work? Or if the
confirmation was reported, please provide a link to the mailing list
discussion or bugzilla entry. We need to keep bug and fix confirmations
publicly archived. Please keep me on Cc since I filter mail based on
that.

> A guess...
>
> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
> of xhci_td_remainder().

Why? Walt has a 0.96 xHCI host controller, and the format for how to
calculate the TD remainder changed between the 0.96 and the 1.0 spec.
That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().

Sarah Sharp

2014-01-21 09:53:04

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

From: Sarah Sharp
> On Mon, Jan 20, 2014 at 11:21:14AM +0000, David Laight wrote:
...
> > A guess...
> >
> > In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
> > of xhci_td_remainder().
>
> Why? Walt has a 0.96 xHCI host controller, and the format for how to
> calculate the TD remainder changed between the 0.96 and the 1.0 spec.
> That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().

I just wonder how many of those differences are just differences in the
specification, rather than differences in the hardware implementation.
In some cases it might be that the old hardware just ignored the value.

I know that the xhci hardware on my ivy bridge cpu does look at that
value (at least checking for zero), since things failed in subtle ways
when I got it wrong.

In this case it was just something easy to change that might be worth
trying. I didn't necessarily expect it to make a positive difference.

David


2014-01-21 22:07:11

by walt

[permalink] [raw]
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

On 01/21/2014 01:51 AM, David Laight wrote:
> From: Sarah Sharp
>> On Mon, Jan 20, 2014 at 11:21:14AM +0000, David Laight wrote:
> ...
>>> A guess...
>>>
>>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
>>> of xhci_td_remainder().
>>
>> Why? Walt has a 0.96 xHCI host controller, and the format for how to
>> calculate the TD remainder changed between the 0.96 and the 1.0 spec.
>> That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().
>
> I just wonder how many of those differences are just differences in the
> specification, rather than differences in the hardware implementation.
> In some cases it might be that the old hardware just ignored the value.
>
> I know that the xhci hardware on my ivy bridge cpu does look at that
> value (at least checking for zero), since things failed in subtle ways
> when I got it wrong.
>
> In this case it was just something easy to change that might be worth
> trying. I didn't necessarily expect it to make a positive difference.

David, I tried the one-liner below, which changed nothing AFAICS, but
then I'm not sure it's the change you intended:

--- xhci-ring.c.orig 2014-01-21 13:28:36.396278813 -0800
+++ xhci-ring.c 2014-01-21 13:35:11.410312814 -0800
@@ -3335,7 +3335,7 @@
}

/* Set the TRB length, TD size, and interrupter fields. */
- if (xhci->hci_version < 0x100) {
+ if (xhci->hci_version > 0x100) {
remainder = xhci_td_remainder(
urb->transfer_buffer_length -
running_total);

2014-01-22 09:19:03

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

From: walt
> On 01/21/2014 01:51 AM, David Laight wrote:
> > From: Sarah Sharp
> >> On Mon, Jan 20, 2014 at 11:21:14AM +0000, David Laight wrote:
> > ...
> >>> A guess...
> >>>
> >>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
> >>> of xhci_td_remainder().
> >>
> David, I tried the one-liner below, which changed nothing AFAICS, but
> then I'm not sure it's the change you intended:
...
> /* Set the TRB length, TD size, and interrupter fields. */
> - if (xhci->hci_version < 0x100) {
> + if (xhci->hci_version > 0x100) {
> remainder = xhci_td_remainder(
> urb->transfer_buffer_length -
> running_total);

So my wild guess wasn't right.
Can't win them all.

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?