On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
> > 3.12-stable review patch. If anyone has any objections, please let me know.
> >
> > ------------------
> >
> > From: David Laight <[email protected]>
> >
> > commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
> >
> > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> > can only occur at a boundary between underlying USB frames (512 bytes for
> > high speed devices).
> >
> > If this isn't done the USB frames aren't formatted correctly and, for example,
> > the USB3 ethernet ax88179_178a card will stop sending...
>
>
> Unfortunately this patch causes a regression when copying large files to my
> outboard USB3 drive. (Nothing at all to do with networking.)
>
> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
>
> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
> patch completely fixes the problem.
Ok, I had feared that would be a consequence of this patch. I think the
problem is that the usb-storage driver submitted an URB with more
scatter-gather entries than would fit on the ring segment, the xHCI
driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
up on the SCSI command.
Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see
dmesg output from this statement shortly before your drive fails:
if (num_trbs >= TRBS_PER_SEGMENT) {
xhci_err(xhci, "Too many fragments %d, max %d\n",
num_trbs, TRBS_PER_SEGMENT - 1);
return -ENOMEM;
}
> (Note to Sarah: I recently emailed you about this problem, and I *wrongly*
> said that reverting the patch doesn't help. That was a mistake, sorry.)
>
> I'm happy to try any debugging suggestions/tricks.
Unfortunately a real fix for this is going to take a bit. I have a
couple different solutions to the bug the patch solved, but they're much
more invasive than the original patch and will take a couple weeks to
implement and thoroughly test.
If David's patch is just reverted, USB ethernet on 3.12 and later breaks
under xHCI. The networking folks added scatter-gather support in 3.12.
Those patches could be reverted, but I suspect David Miller will not be
happy with that solution, since the real problem is the xHCI driver
itself, and EHCI scatter-gather works fine.
I think the short term solution is to simply turn off scatter-gather all
together under xHCI until this gets fixed. It could mean a big
performance hit for USB storage devices, but that means we get correct
behavior for both USB ethernet and USB storage.
> BTW, please tell me if I've cc'd too many people.
Nope, you're fine. I've Cc'ed the USB and SCSI mailing lists as well.
Sarah Sharp
On 14-01-02 02:15 PM, Sarah Sharp wrote:
> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
..
>> Unfortunately this patch causes a regression when copying large files to my
>> outboard USB3 drive. (Nothing at all to do with networking.)
>>
>> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
>> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
>>
>> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
>> patch completely fixes the problem.
>
> Ok, I had feared that would be a consequence of this patch. I think the
> problem is that the usb-storage driver submitted an URB with more
> scatter-gather entries than would fit on the ring segment, the xHCI
> driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> up on the SCSI command.
Is there not a block layer / scheduler tunable for max sg entries or something?
--
Mark Lord
Real-Time Remedies Inc.
[email protected]
On Thu, 2014-01-02 at 16:01 -0500, Mark Lord wrote:
> On 14-01-02 02:15 PM, Sarah Sharp wrote:
> > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> ..
> >> Unfortunately this patch causes a regression when copying large files to my
> >> outboard USB3 drive. (Nothing at all to do with networking.)
> >>
> >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
> >> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
> >>
> >> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
> >> patch completely fixes the problem.
> >
> > Ok, I had feared that would be a consequence of this patch. I think the
> > problem is that the usb-storage driver submitted an URB with more
> > scatter-gather entries than would fit on the ring segment, the xHCI
> > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> > up on the SCSI command.
>
>
> Is there not a block layer / scheduler tunable for max sg entries or something?
Yes, it's blk_queue_max_segments() You can also add an indirect limit
from userspace using the queue/max_sectors_kb parameter (it doesn't
limit the number of sg elements directly, but it does limit the overall
size of the transfer, and since each segment is 4k or larger except at
the beginning and end, that limits the total number of sg elements as
well).
James
On Thu, Jan 02, 2014 at 04:01:29PM -0500, Mark Lord wrote:
> On 14-01-02 02:15 PM, Sarah Sharp wrote:
> > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> ..
> >> Unfortunately this patch causes a regression when copying large files to my
> >> outboard USB3 drive. (Nothing at all to do with networking.)
> >>
> >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
> >> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
> >>
> >> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
> >> patch completely fixes the problem.
> >
> > Ok, I had feared that would be a consequence of this patch. I think the
> > problem is that the usb-storage driver submitted an URB with more
> > scatter-gather entries than would fit on the ring segment, the xHCI
> > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> > up on the SCSI command.
>
> Is there not a block layer / scheduler tunable for max sg entries or something?
There is a USB host controller tunable for max number of sg entries
that's passed up to the SCSI or block layer. We discussed changing it,
but there wasn't a good consensus on what to change it to:
http://marc.info/?l=linux-usb&m=138496358223904&w=2
http://marc.info/?l=linux-netdev&m=138496706007262&w=2
In the end, we thought we didn't need to limit the sglist size because
Paul thought usb-storage limits the overall transfer size to 120K, which
should fit in 31 TRBs:
http://marc.info/?l=linux-usb&m=138498190419312&w=2
Walt, could you see if limiting the sglist size helps, by setting
hcd->self.sg_tablesize in xhci.c to 31?
Sarah Sharp
On 01/02/2014 11:15 AM, Sarah Sharp wrote:
> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
>>> 3.12-stable review patch. If anyone has any objections, please let me know.
>>>
>>> ------------------
>>>
>>> From: David Laight <[email protected]>
>>>
>>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>>>
>>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
>>> can only occur at a boundary between underlying USB frames (512 bytes for
>>> high speed devices).
>>>
>>> If this isn't done the USB frames aren't formatted correctly and, for example,
>>> the USB3 ethernet ax88179_178a card will stop sending...
>>
>>
>> Unfortunately this patch causes a regression when copying large files to my
>> outboard USB3 drive. (Nothing at all to do with networking.)
> Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see
> dmesg output from this statement shortly before your drive fails:
>
> if (num_trbs >= TRBS_PER_SEGMENT) {
> xhci_err(xhci, "Too many fragments %d, max %d\n",
> num_trbs, TRBS_PER_SEGMENT - 1);
> return -ENOMEM;
> }
Well, the answers depend on whether the usb3 drive uses logical volumes or not
(lvm2), which I can't explain. What I've described so far is with lvm2.
When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
console prints two or three lines stating that the ext4 journal has quit and
the drive is remounted ro. That particular drive stays wedged until the next
reboot, but no other ill effects to the system.
OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
(no logical volumes) the copy failure becomes catastrophic, with kernel panic
messages, leaving the system unresponsive and needing a hard reset to recover.
I also tried your other suggestion:
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 4265b48..1a6a43d 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
int retval;
/* Accept arbitrarily long scatter-gather lists */
- hcd->self.sg_tablesize = ~0;
+ hcd->self.sg_tablesize = 31;
/* support to build packet from discontinuous buffers */
hcd->self.no_sg_constraint = 1;
Sadly it didn't fix the problem. Did I get the patch right?
Thanks for your help, and I'm happy to try more ideas, as always.
On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote:
> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
> > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> >> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
> >>> 3.12-stable review patch. If anyone has any objections, please let me know.
> >>>
> >>> ------------------
> >>>
> >>> From: David Laight <[email protected]>
> >>>
> >>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
> >>>
> >>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> >>> can only occur at a boundary between underlying USB frames (512 bytes for
> >>> high speed devices).
> >>>
> >>> If this isn't done the USB frames aren't formatted correctly and, for example,
> >>> the USB3 ethernet ax88179_178a card will stop sending...
> >>
> >>
> >> Unfortunately this patch causes a regression when copying large files to my
> >> outboard USB3 drive. (Nothing at all to do with networking.)
>
> > Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see
> > dmesg output from this statement shortly before your drive fails:
> >
> > if (num_trbs >= TRBS_PER_SEGMENT) {
> > xhci_err(xhci, "Too many fragments %d, max %d\n",
> > num_trbs, TRBS_PER_SEGMENT - 1);
> > return -ENOMEM;
> > }
>
> Well, the answers depend on whether the usb3 drive uses logical volumes or not
> (lvm2), which I can't explain. What I've described so far is with lvm2.
>
> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
> console prints two or three lines stating that the ext4 journal has quit and
> the drive is remounted ro. That particular drive stays wedged until the next
> reboot, but no other ill effects to the system.
Odd. In 3.12 xHCI has dynamic debugging, and turning on CONFIG_USB_DEBUG
should turn on debugging by default, so it's confusing that you didn't
see any messages.
Can I see your .config from /boot/? Also, did you try capturing dmesg
with `tail -f /var/log/kern.log` or just dmesg? Perhaps you need to run
`sudo dmesg -n 7`?
> OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
> (no logical volumes) the copy failure becomes catastrophic, with kernel panic
> messages, leaving the system unresponsive and needing a hard reset to recover.
>
> I also tried your other suggestion:
>
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 4265b48..1a6a43d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
> int retval;
>
> /* Accept arbitrarily long scatter-gather lists */
> - hcd->self.sg_tablesize = ~0;
> + hcd->self.sg_tablesize = 31;
>
> /* support to build packet from discontinuous buffers */
> hcd->self.no_sg_constraint = 1;
>
> Sadly it didn't fix the problem. Did I get the patch right?
Yes, you did. So perhaps the patch triggers a different bug. I can't
tell until I see xHCI debugging output.
> Thanks for your help, and I'm happy to try more ideas, as always.
Thanks for your patience. :)
Sarah Sharp
On 01/03/2014 11:54 AM, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote:
>> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
>>> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
>>>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
>>>>> 3.12-stable review patch. If anyone has any objections, please let me know.
>>>>>
>>>>> ------------------
>>>>>
>>>>> From: David Laight <[email protected]>
>>>>>
>>>>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>>>>>
>>>>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
>>>>> can only occur at a boundary between underlying USB frames (512 bytes for
>>>>> high speed devices).
>>>>>
>>>>> If this isn't done the USB frames aren't formatted correctly and, for example,
>>>>> the USB3 ethernet ax88179_178a card will stop sending...
>>>>
>>>>
>>>> Unfortunately this patch causes a regression when copying large files to my
>>>> outboard USB3 drive. (Nothing at all to do with networking.)
>>
>>> Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see
>>> dmesg output from this statement shortly before your drive fails:
>>>
>>> if (num_trbs >= TRBS_PER_SEGMENT) {
>>> xhci_err(xhci, "Too many fragments %d, max %d\n",
>>> num_trbs, TRBS_PER_SEGMENT - 1);
>>> return -ENOMEM;
>>> }
>>
>> Well, the answers depend on whether the usb3 drive uses logical volumes or not
>> (lvm2), which I can't explain. What I've described so far is with lvm2.
>>
>> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect
I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm not
going to publish it here :(
Once I finally ran the kernel with debugging actually compiled in, dmesg contains
xhci debugging messages. Wow :)
It's a big file so I zipped and attached it, which I hope is acceptable in lkml.
BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before
doesn't fix the problem. The cp stopped around 7GB just as before.
Sorry for the noise...
On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
> I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm not
> going to publish it here :(
>
> Once I finally ran the kernel with debugging actually compiled in, dmesg contains
> xhci debugging messages. Wow :)
>
> It's a big file so I zipped and attached it, which I hope is acceptable in lkml.
Yep, that's fine. Sticking it in pastebin (or up on your server) is
also fine, if it gets really big.
> BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before
> doesn't fix the problem. The cp stopped around 7GB just as before.
>
> Sorry for the noise...
No worries! :) With the dmesg, I can finally see what happened:
[ 188.703059] xhci_hcd 0000:03:00.0: Cancel URB ffff8800b7d2e0c0, dev 1, ep 0x2, starting at offset 0xbb7b9000
[ 188.703072] xhci_hcd 0000:03:00.0: // Ding dong!
[ 193.711022] xhci_hcd 0000:03:00.0: xHCI host not responding to stop endpoint command.
[ 193.711029] xhci_hcd 0000:03:00.0: Assuming host is dying, halting host.
[ 193.711046] xhci_hcd 0000:03:00.0: // Halt the HC
[ 193.711060] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 0
[ 193.711066] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 2
[ 193.711078] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 3
[ 193.711096] xhci_hcd 0000:03:00.0: Calling usb_hc_died()
[ 193.711103] xhci_hcd 0000:03:00.0: HC died; cleaning up
[ 193.711116] xhci_hcd 0000:03:00.0: xHCI host controller is dead.
It seems that the xHCI driver tried to stop the endpoint ring in order
to cancel a SCSI transfer, and the driver never got a response for that.
The offset is rather suspicious (0xbb7b9000), and it probably means the
driver attempted to cancel a transfer that had been moved to the
beginning of the ring segment, with no-op TRBs before the link TRB.
I suspect David's patch triggers a bug in the command cancellation code.
There's also the unlikely possibility that the no-op TRBs did indeed
cause the host to hang. Either way, I'll have to look into it.
I'll let you know when I have some diagnostic patches ready.
Sarah Sharp
On 14-01-03 10:40 AM, walt wrote:
> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
>> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
>>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
>>>> 3.12-stable review patch. If anyone has any objections, please let me know.
>>>>
>>>> ------------------
>>>>
>>>> From: David Laight <[email protected]>
>>>>
>>>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>>>>
>>>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
>>>> can only occur at a boundary between underlying USB frames (512 bytes for
>>>> high speed devices).
>>>>
>>>> If this isn't done the USB frames aren't formatted correctly and, for example,
>>>> the USB3 ethernet ax88179_178a card will stop sending...
>>>
>>>
>>> Unfortunately this patch causes a regression when copying large files to my
>>> outboard USB3 drive. (Nothing at all to do with networking.)
>
>> Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see
>> dmesg output from this statement shortly before your drive fails:
>>
>> if (num_trbs >= TRBS_PER_SEGMENT) {
>> xhci_err(xhci, "Too many fragments %d, max %d\n",
>> num_trbs, TRBS_PER_SEGMENT - 1);
>> return -ENOMEM;
>> }
>
> Well, the answers depend on whether the usb3 drive uses logical volumes or not
> (lvm2), which I can't explain. What I've described so far is with lvm2.
>
> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
> console prints two or three lines stating that the ext4 journal has quit and
> the drive is remounted ro. That particular drive stays wedged until the next
> reboot, but no other ill effects to the system.
>
> OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
> (no logical volumes) the copy failure becomes catastrophic, with kernel panic
> messages, leaving the system unresponsive and needing a hard reset to recover.
>
> I also tried your other suggestion:
>
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 4265b48..1a6a43d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
> int retval;
>
> /* Accept arbitrarily long scatter-gather lists */
> - hcd->self.sg_tablesize = ~0;
> + hcd->self.sg_tablesize = 31;
>
> /* support to build packet from discontinuous buffers */
> hcd->self.no_sg_constraint = 1;
>
> Sadly it didn't fix the problem. Did I get the patch right?
That sounds almost as if the old version is still being loaded/run,
possibly from the initramfs image?
--
Mark Lord
Real-Time Remedies Inc.
[email protected]
> From: walt
...
> /* Accept arbitrarily long scatter-gather lists */
> - hcd->self.sg_tablesize = ~0;
> + hcd->self.sg_tablesize = 31;
Even if that reduces the number of fragments passed to the xhci driver
it may not be enough to limit the actual number of fragments that
need to be placed in the ring itself.
The xhci driver has to split every fragment on any 64k address boundary.
One possibility is to scan long SG lists to see it they are actually
problematic. If all the fragments are suitably aligned let them through
(without any nops).
My gut feeling is that problems only arise when the ring end isn't at
a 1k boundary in the data.
So provided all the fragments are multiples of 1k (after splitting on 64k
boundaries) the transfer will be processed correctly.
Alternatively, if the fragments are all longer than 1k (after the 64k split),
the one that crosses the ring end can be split in two.
A quick 'fix' would be to assume that anything with too many fragments is
probably ok - maybe check the first fragment is suitably aligned.
That would recover the old behaviour for usb disk transfers with a lot
of fragments - yes it is a hack...
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
> > I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm not
> > going to publish it here :(
> >
> > Once I finally ran the kernel with debugging actually compiled in, dmesg contains
> > xhci debugging messages. Wow :)
> >
> > It's a big file so I zipped and attached it, which I hope is acceptable in lkml.
>
> Yep, that's fine. Sticking it in pastebin (or up on your server) is
> also fine, if it gets really big.
>
> > BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before
> > doesn't fix the problem. The cp stopped around 7GB just as before.
> >
> > Sorry for the noise...
>
> No worries! :) With the dmesg, I can finally see what happened:
>
> [ 188.703059] xhci_hcd 0000:03:00.0: Cancel URB ffff8800b7d2e0c0, dev 1, ep 0x2, starting at offset 0xbb7b9000
> [ 188.703072] xhci_hcd 0000:03:00.0: // Ding dong!
> [ 193.711022] xhci_hcd 0000:03:00.0: xHCI host not responding to stop endpoint command.
> [ 193.711029] xhci_hcd 0000:03:00.0: Assuming host is dying, halting host.
> [ 193.711046] xhci_hcd 0000:03:00.0: // Halt the HC
> [ 193.711060] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 0
> [ 193.711066] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 2
> [ 193.711078] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 3
> [ 193.711096] xhci_hcd 0000:03:00.0: Calling usb_hc_died()
> [ 193.711103] xhci_hcd 0000:03:00.0: HC died; cleaning up
> [ 193.711116] xhci_hcd 0000:03:00.0: xHCI host controller is dead.
>
> It seems that the xHCI driver tried to stop the endpoint ring in order
> to cancel a SCSI transfer, and the driver never got a response for that.
>
> The offset is rather suspicious (0xbb7b9000), and it probably means the
> driver attempted to cancel a transfer that had been moved to the
> beginning of the ring segment, with no-op TRBs before the link TRB.
>
> I suspect David's patch triggers a bug in the command cancellation code.
> There's also the unlikely possibility that the no-op TRBs did indeed
> cause the host to hang. Either way, I'll have to look into it.
>
> I'll let you know when I have some diagnostic patches ready.
Hi Walt,
I have a couple of patches for you to test. You can either apply the
attached three patches, or you can pull down a kernel with:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci.git -b 3.12-td-fragment-failure
Please only apply the first patch (which is diagnostic only), trigger
your issue, and send me the resulting dmesg. Then try applying the
other two patches, and see if the issue goes away. (I suspect it won't
but I can't be sure.)
Sarah Sharp
On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:
>> With the dmesg, I can finally see what happened:
>>
>> [ 188.703059] xhci_hcd 0000:03:00.0: Cancel URB ffff8800b7d2e0c0, dev 1, ep 0x2, starting at offset 0xbb7b9000
>> [ 188.703072] xhci_hcd 0000:03:00.0: // Ding dong!
>> [ 193.711022] xhci_hcd 0000:03:00.0: xHCI host not responding to stop endpoint command.
>> [ 193.711029] xhci_hcd 0000:03:00.0: Assuming host is dying, halting host.
>> [ 193.711046] xhci_hcd 0000:03:00.0: // Halt the HC
>> [ 193.711060] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 0
>> [ 193.711066] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 2
>> [ 193.711078] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 3
>> [ 193.711096] xhci_hcd 0000:03:00.0: Calling usb_hc_died()
>> [ 193.711103] xhci_hcd 0000:03:00.0: HC died; cleaning up
>> [ 193.711116] xhci_hcd 0000:03:00.0: xHCI host controller is dead.
>>
>> It seems that the xHCI driver tried to stop the endpoint ring in order
>> to cancel a SCSI transfer, and the driver never got a response for that.
>>
>> The offset is rather suspicious (0xbb7b9000), and it probably means the
>> driver attempted to cancel a transfer that had been moved to the
>> beginning of the ring segment, with no-op TRBs before the link TRB.
>>
>> I suspect David's patch triggers a bug in the command cancellation code.
>> There's also the unlikely possibility that the no-op TRBs did indeed
>> cause the host to hang. Either way, I'll have to look into it.
>>
>> I'll let you know when I have some diagnostic patches ready.
>
> Hi Walt,
>
> I have a couple of patches for you to test.
> Please only apply the first patch (which is diagnostic only), trigger
> your issue, and send me the resulting dmesg. Then try applying the
> other two patches, and see if the issue goes away. (I suspect it won't
> but I can't be sure.)
Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all
three patches applied. Some of the messages in dmesg1 fell off the end of
the kernel buffer, so I may need to make the buffer larger next time but
I'll need a reminder of how to do it.
As you suspected, the patches didn't fix the problem, sorry.
I find that I can tell in advance whether the copy is going to succeed,
just by watching the light flicker on the usb3 drive. When the flicker
is absolutely regular, with no variation whatever, I can tell in 10 or
15 seconds that the copy will fail.
At the same time the light on the main drive goes dark after 10 seconds,
implying that the usb3 drive stops receiving any data from the main drive
after 10 seconds, yet the light on the usb3 drive continues to flicker as
if writing data -- even after the cp officially fails. The light on the
usb3 drive never stops flickering until I reboot the machine or unplug
the usb cable.
> From: walt
...
> Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all
> three patches applied. Some of the messages in dmesg1 fell off the end of
> the kernel buffer, so I may need to make the buffer larger next time but
> I'll need a reminder of how to do it.
>
> As you suspected, the patches didn't fix the problem, sorry.
I think Sarah will want the traces of the transfer being setup (ie just
before the error). Some 'normal' completing transfers are also useful.
You might also find the full trace output in one of the files in /var/log.
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
The dmesg contains:
[ 538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
An 8MB transfer will need at least 128 ring entries (TRB) even if the request
is a single contiguous memory block.
Are you using the patch that increases the ring size from 64 to 256?
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On 01/07/2014 05:58 AM, David Laight wrote:
> The dmesg contains:
>
> [ 538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
>
> An 8MB transfer will need at least 128 ring entries (TRB) even if the request
> is a single contiguous memory block.
>
> Are you using the patch that increases the ring size from 64 to 256?
Yes, 256. I'll work on getting more debugging output.
Okay, I used log_buf_len to make dmesg bigger and now I think I have
the whole thing. It's attached.
On Tue, Jan 07, 2014 at 05:29:48AM -0800, walt wrote:
> On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> > Hi Walt,
> >
> > I have a couple of patches for you to test.
>
> > Please only apply the first patch (which is diagnostic only), trigger
> > your issue, and send me the resulting dmesg. Then try applying the
> > other two patches, and see if the issue goes away. (I suspect it won't
> > but I can't be sure.)
>
> Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all
> three patches applied. Some of the messages in dmesg1 fell off the end of
> the kernel buffer, so I may need to make the buffer larger next time but
> I'll need a reminder of how to do it.
Set CONFIG_LOG_BUF_SHIFT to 21.
> As you suspected, the patches didn't fix the problem, sorry.
Yep, I thought so. I did glean one bit of information from the logs: it
seems that your host does handle no-op TRBs, at least for a while.
However, after a bigger chunk of TRBs, it goes off into la-la-land.
Assuming one of the rings is comprised of two segments:
0xbb711000 (start)
0xbb7113f0 (end)
0xbb711400 (start)
0xbb7117f0 (end)
The log show no-ops were inserted at:
0xbb7207d0
0xbb7206a0
0xbb720be0
0xbb720be0
0xbb720bd0
0xbb7207e0
0xbb711370 = 8 no-ops
0xbb7117c0 = 3
0xbb7113b0 = 4
0xbb7113a0 = 5
0xbb7117d0 = 2
0xbb711340 = 11
0xbb711770 = 8
0xbb711230 = 28
0xbb7117e0 = 1
0xbb7117b0 = 4
0xbb7113d0 = 2
0xbb7117b0 = 4
0xbb711340 = 11
0xbb711690 = 22
So the host was able to process 28 no-op TRBs, but failed on 22 no-ops
later. The event ring debugging shows the last event was for
0xbb711680, which is the last TRB before the first no-op inserted before
the host died. There's no Stop Endpoint Command completion, and it
looks like the command was correctly put on the command ring, so it
seems the host is actually hanging for some reason.
Unfortunately, I made a mistake in the debugging patch I sent
you, so it didn't print out the endpoint rings when the host died. I
need that info, to see whether the link TRB was still intact, or if we
over-wrote it and caused the host to go fetch some invalid memory.
Can you please try the attached patch, on top of the previous three
patches, and send me dmesg?
> I find that I can tell in advance whether the copy is going to succeed,
> just by watching the light flicker on the usb3 drive. When the flicker
> is absolutely regular, with no variation whatever, I can tell in 10 or
> 15 seconds that the copy will fail.
>
> At the same time the light on the main drive goes dark after 10 seconds,
> implying that the usb3 drive stops receiving any data from the main drive
> after 10 seconds, yet the light on the usb3 drive continues to flicker as
> if writing data -- even after the cp officially fails. The light on the
> usb3 drive never stops flickering until I reboot the machine or unplug
> the usb cable.
Interesting. Without a USB analyzer, we can't really tell what's
happening. However, one hypothesis could be that the blinking light is
triggered by an active SCSI command (read/write, etc).
There are three phases of the command: setup, data, and status. I think
your device is getting the setup phase, and the host is dying before it
sends the data phase. If the light blinks when it gets a setup phase,
and turns off when the devices sends a status phase, that would explain
its behavior.
But that's just a hypothesis, I have no idea whether it's correct.
Sarah Sharp
On Tue, Jan 07, 2014 at 01:58:32PM +0000, David Laight wrote:
> The dmesg contains:
>
> [ 538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
>
> An 8MB transfer will need at least 128 ring entries (TRB) even if the request
> is a single contiguous memory block.
>
> Are you using the patch that increases the ring size from 64 to 256?
It's likely that the block layer is breaking up the EXT4 write into
several transfers, since usb-storage limits overall transfer size to
120 KB. In any case, I added more debugging in the last patch to print
the number of TRBs necessary. That way we can verify the patch to limit
the number of scatter-gather list entries is working.
Sarah Sharp
On Tue, Jan 07, 2014 at 12:00:00PM -0800, walt wrote:
> Okay, I used log_buf_len to make dmesg bigger and now I think I have
> the whole thing. It's attached.
Walt, can you make sure the patch I sent you was applied? The output
doesn't look like it is.
Sarah Sharp
On 01/03/2014 03:29 PM, Sarah Sharp wrote:
> I'll let you know when I have some diagnostic patches ready.
Hi Sarah. I see today gregkh committed the patches you've already sent
me, so I assume someone (other than me) has tested those patches and
discovered some benefit from them?
I'm still wondering if I'm suffering from hardware quirks. From the
first day I installed my usb3 adapter card and the usb3 disk docking
station I've noticed some quirky behavior.
e.g. I boot the machine with the docking station powered-off, and then
later I power it on, the usb3 disk is not detected at all -- until I
reboot the machine with the docking station still powered on.
Minor stuff, yes, but maybe relevant? I dunno.
> From: walt
...
> I'm still wondering if I'm suffering from hardware quirks. From the
> first day I installed my usb3 adapter card and the usb3 disk docking
> station I've noticed some quirky behavior.
Ah - this isn't an 'on chip' usb3 adapter.
Some kind of PCIe card ?
> e.g. I boot the machine with the docking station powered-off, and then
> later I power it on, the usb3 disk is not detected at all -- until I
> reboot the machine with the docking station still powered on.
That will be something completely different to failures when running.
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On 01/09/2014 02:05 AM, David Laight wrote:
>> From: walt
> ...
>> I'm still wondering if I'm suffering from hardware quirks. From the
>> first day I installed my usb3 adapter card and the usb3 disk docking
>> station I've noticed some quirky behavior.
>
> Ah - this isn't an 'on chip' usb3 adapter.
> Some kind of PCIe card ?
This one:
http://www.sabrent.com/category/controller-cards/PCIX-USB3/