2013-02-22 20:50:49

by Seth Forshee

[permalink] [raw]
Subject: carl9170 A-MPDU transmit problem

Hi Christian,

I was given a D-Link wireless dongle by a colleague who said he was
having trouble with the connection stalling. One thing I noticed right
away when running iperf is that the tx BA sessions seem to be getting
stopped and restarted pretty frequently. I've been doing some debugging,
and here's what I'm seeing.

On the air everything seems to go smoothly for a while, but then the
D-Link adapter stops transmitting data frames for a while. It still
sends CTS and BA frames in response to the AP, just no data frames.
Eventually it sends the action frame with the DELBA request, but
immediately before sending the action frame it sends a single data
frame. This pattern repeats each time this happens.

That final data frame is a bit curious, so I added some tracing to
carl9170. It looks like this frame gets passed down to the hardware as
the last in a batch of A-MPDU frames, but for some reason the hardware
stops transmitting without sending this frame. carl9170 doesn't pass any
more A-MPDU frames to the hardware while tx for that frame is pending,
and when the action frame for the DELBA request finally comes along it
seems to get things moving again.

Do you have any idea what might be causing the hardware to stop
transmitting with one frame left? Any suggestions on how to fix it?

Thanks,
Seth



2013-02-25 20:19:05

by Alan Stern

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Mon, 25 Feb 2013, Seth Forshee wrote:

> > > >ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
> > > >
> > > > [... long period where the device receives commands on EP4 and sends
> > > > wifi data to the host via EP2 - so it is working!]
> > > >
> > > >ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
> > > >ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
> > > >ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion

> > Which kernel version are you testing under? Can you please recompile
> > with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
> > send me dmesg? I should be able to see if there's an unhandled pending
> > event on the xHCI rings if the data URB stalls for longer than say, a
> > minute.
>
> I'll do this.

Bear in mind that the usbmon trace shows the URB stalling for 2 - 3
seconds. There may have been other examples that went on for longer,
but probably nothing anywhere near as long as a minute.

Alan Stern


2013-02-25 19:14:21

by Sarah Sharp

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Mon, Feb 25, 2013 at 05:03:40PM +0100, Christian Lamparter wrote:
> On Monday, February 25, 2013 04:29:55 PM Alan Stern wrote:
> > On Mon, 25 Feb 2013, Christian Lamparter wrote:
> One detail that might be important (to keep in mind):
>
> Original report <http://www.spinics.net/lists/linux-wireless/msg103880.html>:
> > On the air everything seems to go smoothly for a while, but
> > then the D-Link adapter stops transmitting *DATA* frames for a while. [...]
> > Eventually it sends the action frame with the *DELBA* request, but
> > immediately before sending the action frame it sends a single *DATA*
> > frame (1). This pattern repeats each time this happens.
>
> Now if we take this and apply it to the usbmon recording in:
> <http://www.spinics.net/lists/linux-wireless/msg103915.html>
>
> > Normally the time between submission and callback for a given urb
> > is short. However, some are much longer, e.g.:
> >
> >ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
> >
> > [... long period where the device receives commands on EP4 and sends
> > wifi data to the host via EP2 - so it is working!]
> >
> >ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
> >ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
> >ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion
>
> It would mean that the (delayed) urb with the *DATA* frame urb was not
> sent (?or received?) by the usb dongle until the *DELBA* came along (1)
> and triggered the TX for both (in quick succession). So, I think we
> should be looking for lost/unhandled interrupts/events.
>
> One more thing: So far this issue only occurs with:
> 00:14.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller [8086:1e31] (rev 04) (prog-if 30 [XHCI])

Which kernel version are you testing under? Can you please recompile
with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
send me dmesg? I should be able to see if there's an unhandled pending
event on the xHCI rings if the data URB stalls for longer than say, a
minute.

As for lost interrupts, the only issue we had with that particular xHCI
host had to do with lost isochronous transfers. We fixed that with a
quirk to avoid the BEI flag on that platform. But your data is being
sent via a bulk endpoint, so we shouldn't have any issues there.

> However, it not all xhci-hcd are affected. I have not seen this with
> the NEC Corporation uPD720200 I have in my sandy bridge laptop:
> 19:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 04)

Hmm, yeah, that kind of points to an Intel xHCI hardware issue. It's
too bad you don't have a USB analyzer (the high speed ones are about
$480). Can you send me a link so I can purchase the device and test it
with my analyzer?

Sarah Sharp

2013-02-25 23:31:02

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Monday, February 25, 2013 09:19:03 PM Alan Stern wrote:
> On Mon, 25 Feb 2013, Seth Forshee wrote:
> > > > >ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
> > > > >
> > > > > [... long period where the device receives commands on EP4 and sends
> > > > > wifi data to the host via EP2 - so it is working!]
> > > > >
> > > > >ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
> > > > >ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
> > > > >ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion
>
> > > Which kernel version are you testing under? Can you please recompile
> > > with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
> > > send me dmesg? I should be able to see if there's an unhandled pending
> > > event on the xHCI rings if the data URB stalls for longer than say, a
> > > minute.
> >
> > I'll do this.
>
> Bear in mind that the usbmon trace shows the URB stalling for 2 - 3
> seconds. There may have been other examples that went on for longer,
> but probably nothing anywhere near as long as a minute.
>
---
This should open up the window for as long as we want it.
The driver will now queue, but not deliver any frames to
the device when there are unfinished urbs.

Note: Just in case someone is interested in the interval
values of the eps. I attached a lsusb dump of a typical
carl9170 device (So, apart from the vendor and product
id, it should be the same for DWA-160 A2).

Regards,
Christian
---
diff --git a/drivers/net/wireless/ath/carl9170/tx.c b/drivers/net/wireless/ath/carl9170/tx.c
index 9c0b150..3931a1e 100644
--- a/drivers/net/wireless/ath/carl9170/tx.c
+++ b/drivers/net/wireless/ath/carl9170/tx.c
@@ -598,7 +598,6 @@ next:
* - all bugs you can't...
* - ...
*/
- carl9170_restart(ar, CARL9170_RR_STUCK_TX);
}
}

@@ -1336,6 +1335,9 @@ static void carl9170_tx(struct ar9170 *ar)
if (unlikely(!IS_STARTED(ar)))
return;

+ if (atomic_read(&ar->tx_total_pending))
+ return;
+
carl9170_usb_handle_tx_err(ar);

for (i = 0; i < ar->hw->queues; i++) {
---

Bus 004 Device 005: ID 0846:9010 NetGear, Inc. WNDA3100v1 802.11abgn [Atheros AR9170+AR9104]
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 2.00
bDeviceClass 255 Vendor Specific Class
bDeviceSubClass 255 Vendor Specific Subclass
bDeviceProtocol 255 Vendor Specific Protocol
bMaxPacketSize0 64
idVendor 0x0846 NetGear, Inc.
idProduct 0x9010 WNDA3100v1 802.11abgn [Atheros AR9170+AR9104]
bcdDevice 1.06
iManufacturer 16 ATHER
iProduct 32 USB2.0 WLAN
iSerial 48 12345
bNumConfigurations 1
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 46
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0x80
(Bus Powered)
MaxPower 500mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 4
bInterfaceClass 255 Vendor Specific Class
bInterfaceSubClass 0
bInterfaceProtocol 0
iInterface 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x01 EP 1 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x82 EP 2 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83 EP 3 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0040 1x 64 bytes
bInterval 1
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x04 EP 4 OUT
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0040 1x 64 bytes
bInterval 1
Device Qualifier (for other device speed):
bLength 10
bDescriptorType 6
bcdUSB 2.00
bDeviceClass 255 Vendor Specific Class
bDeviceSubClass 255 Vendor Specific Subclass
bDeviceProtocol 255 Vendor Specific Protocol
bMaxPacketSize0 64
bNumConfigurations 0
Device Status: 0x0000
(Bus Powered)

2013-02-23 14:07:15

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

CC'd linux-usb and Stephen.

If some folks there know how to debug these pesky usb transport
stalls/errors?... then please join!
(Start of the discussion was this mail:
<http://www.spinics.net/lists/linux-wireless/msg103880.html>)

Stephen, I'm sorry to bother you again ;). But can you
tell me how the download DMA engine and FUSB200 handle
and react to usb transfer errors? Is the error detection
and recovery done by the hardware or is there something
the driver/firmware should be aware of?

On Saturday 23 February 2013 07:46:00 Seth Forshee wrote:
> On Sat, Feb 23, 2013 at 12:48:51AM +0100, Christian Lamparter wrote:
> > So it looks like we need to ask whenever the USB transport
> > is reliable or not. Did you (by any change) also monitor
> > "usb_tx_anch_urbs" [And does it get stuck at some > 1 value
> > as well?]. I'm asking this because if the driver has more than
> > 8 (=AR9170_NUM_TX_URBS) concurrently outgoing URBs, the
> > overflow is queued in tx_wait. Of course, the urb completion
> > handler (carl9170_usb_tx_data_complete) takes care of
> > delivering the next frame in the tx_wait line and so on...
> > But according to your report, this doesn't seem to work!
> > What's a bit odd is that the device is able to recover. Because
> > normally if there is an USB error, the endpoint will halt and
> > no traffic will get through it anymore [So the DELBA should be
> > stuck as well!].
>
> The carl9170 usb code seems to be working properly. If tx_anch_urbs
> reaches 8 the overflow is queued in tx_wait as you said, and the next
> queued frame gets delivered from carl9170_usb_tx_data_complete(). The
> stuck frame does get passed to a successful usb_submit_urb() call before
> tx stops, but it still isn't transmitted until the DELBA comes along
> (and tx_anch_urbs decrements to 1 and then gets stuck there while tx is
> stalled, as would be expected).
usb_submit_urb() is async, so unless the URB data structure is
bogus, the device is in the middle of a reset/is removed or OOM
it won't return with an -ENUMBER.

Since neither of us has probably access to an USB analyzer, the
next best thing would be to enable ehci_hcd's debug facilities
and check if the stuck frame produced any DataBufferErr, XactErr
or something else.

Also, we should check what the device is doing. The hardware has
an (Faraday Tech) FUSB200 PHY. It's initialized and partially
controlled by the carl9170 firmware.
(fw source is available at <https://github.com/chunkeey/carl9170fw>).

The standard usb code (ep0 control, get/set_configuration/
interface, get_status, ...) is under carlfw/usb/usb.c and
carlfw/usb/main.c.

The code which deals with the I/O of WiFi-frames is located in
carlfw/src/hostif.c (TX is handled by handle_download and
handle_download_exception).

Note: If you want to add printf in the firmware:
INFO("Text %d %x %p", int, hex, pointer);
(And then watch dmesg)

You can also use debugfs's hw_ioread32 and hw_iowrite32 to
monitor and manipulate hardware registers (0x1c0000-0x1e2000)
and the firmware memory space (0x20000-0x203ffc).
[addresses have to be aligned on a 4-byte boundary]

For example:

To read the FUSB200 register base from 0x1e1000 - 0x1e1034:
# echo "0x1e1000 14" > /sys/kernel/debug/.../carl9170/hw_ioread32
# cat /sys/.../hw_ioread32
001e1000 = 00008464
001e1004 = 044c097b
...
001e1034 = 00000000

To write some 0xdeadbeef into 0x1e1000:
# echo "0x1e1000 0xdeadbeef" > /sys/kernel/debug/.../carl9170/hw_iowrite32

Regards,
Christian

2013-02-25 14:51:45

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sun, Feb 24, 2013 at 06:41:06PM -0500, Alan Stern wrote:
> > Or can you think of any other "interesting"
> > bits that could help to explain why the "Arrandale box [...] worked
> > perfectly" whereas (all his) Ivy Bridge ones have problems.
> > (Of course, I assume that it is always the same device, the
> > same firmware and the same kernel drivers in all tests, right)?
>
> No, not really. Unless one is using USB-2 and the other USB-3 -- the
> device might have a bug in its USB-3 firmware.

This appears to be a high-speed device; at least, dmesg says "new
high-speed USB device ..." when I plug it in. So for the firmware isn't
it USB 2.0 even when plugged in a SuperSpeed port?

> What happens if xhci-hcd is unloaded before the test?

I'll test this, but I need to rebuild first. xhci-hcd is currently
built-in to my kenel.

Seth


2013-02-25 22:42:20

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Mon, Feb 25, 2013 at 01:46:23PM -0600, Seth Forshee wrote:
> > Which kernel version are you testing under? Can you please recompile
> > with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
> > send me dmesg? I should be able to see if there's an unhandled pending
> > event on the xHCI rings if the data URB stalls for longer than say, a
> > minute.
>
> I'll do this.

dmesg turned out to be rather large here, so I put in on a server rather
than attaching it. Here's the link.

http://people.canonical.com/~sforshee/dmesg-carl9170-usb-debug.txt

Thanks,
Seth


2013-02-25 20:24:02

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Mon, Feb 25, 2013 at 11:13:57AM -0800, Sarah Sharp wrote:
> Which kernel version are you testing under?

Forgot to answer this. I'm running 3.8 on these machines.

Seth


2013-02-23 21:26:29

by Alan Stern

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sat, 23 Feb 2013, Christian Lamparter wrote:

> CC'd linux-usb and Stephen.
>
> If some folks there know how to debug these pesky usb transport
> stalls/errors?... then please join!
> (Start of the discussion was this mail:
> <http://www.spinics.net/lists/linux-wireless/msg103880.html>)

The easiest way to start is to get a usbmon trace. See
Documentation/usb/usbmon.txt.

Alan STern


2013-02-22 22:07:44

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

Hello Seth,

On Friday 22 February 2013 21:50:44 Seth Forshee wrote:
> I was given a D-Link wireless dongle by a colleague who said he was
> having trouble with the connection stalling. One thing I noticed right
> away when running iperf is that the tx BA sessions seem to be getting
> stopped and restarted pretty frequently. I've been doing some debugging,
> and here's what I'm seeing.
>
> On the air everything seems to go smoothly for a while, but then the
> D-Link adapter stops transmitting data frames for a while. It still
> sends CTS and BA frames in response to the AP, just no data frames.
> Eventually it sends the action frame with the DELBA request, but
> immediately before sending the action frame it sends a single data
> frame. This pattern repeats each time this happens.
>
> That final data frame is a bit curious, so I added some tracing to
> carl9170. It looks like this frame gets passed down to the hardware as
> the last in a batch of A-MPDU frames, but for some reason the hardware
> stops transmitting without sending this frame. carl9170 doesn't pass any
> more A-MPDU frames to the hardware while tx for that frame is pending,
> and when the action frame for the DELBA request finally comes along it
> seems to get things moving again.
>
> Do you have any idea what might be causing the hardware to stop
> transmitting with one frame left? Any suggestions on how to fix it?
Do you also see messages like "invalid plcp cck rate.",
"rx stream does not start with a first_mpdu frame tag." and/or
"plcp info/frame tail is clipped" or messages about "driver
reset"?

Can you tell me what version firmware are you using?
If your driver is up to date, you could try the
firmware attached to:
<https://bugzilla.redhat.com/show_bug.cgi?id=831953>

About the "frame gets passed down..." thing. Do you know
if this frame is received by the firmware or if it is
stuck in the usb-pipe? (if you have DEBUGFS support, you
can look at tx_ampdu_upload. If it's value stays the same
(that is >= 1. If it drops to 0, then it should be fine)
until the DELBA rolls around, then something is wrong with
the USB BUS).

Note: Unfortunately, Control frames (BA + CTS + ACK)
don't share the tx queues with DATA or MGMT frames.
These frames are generated within the bowls of the
hardware MAC and they take "short paths" through
the tx arbiter (this is all done in the silicon).
[However, if you filter the monitor dumps again, you
could look for nullfuncs or Probe Requests originating
from the device. These should be a good indication
whenever the hardware or the driver is on the fritz.]

Note2: It's too early to tell anything yet ;-)
But just in case you are planning to test the latest
wireless-testing.git or compat-driver: Here's a patch
which was just posted:
<http://www.spinics.net/lists/linux-wireless/msg103866.html>
It sorts out a disagreement between carl9170 and the
latest minstrel_ht changes. Without it, you'll definitely
get a very choppy connection (lots of WARNings and frame
drop).

Regards,
Christian

2013-02-22 23:48:57

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Friday 22 February 2013 23:54:54 Seth Forshee wrote:
> On Fri, Feb 22, 2013 at 11:07:37PM +0100, Christian Lamparter wrote:
> > About the "frame gets passed down..." thing. Do you know
> > if this frame is received by the firmware or if it is
> > stuck in the usb-pipe? (if you have DEBUGFS support, you
> > can look at tx_ampdu_upload. If it's value stays the same
> > (that is >= 1. If it drops to 0, then it should be fine)
> > until the DELBA rolls around, then something is wrong with
> > the USB BUS).
>
> tx_ampdu_upload is one of the things I was monitoring. In one example,
> frames with sequence numbers 1282 - 1286 are put in the tx_pending
> queue, increasing tx_ampdu_upload to 5. All 5 frames are passed to
> carl9170_usb_tx() (this is what I mean when I say the frame gets passed
> down), but only the first 4 are transmitted, and tx_ampdu_upload
> decrements down to 1. It stays at 1 until the DELBA comes, at which
> point it finally gets transmitted and tx_ampdu_upload decrements to 0.
> So it sounds like there's a USB bus problem. I've verified that this
> problem shows up on multiple machines.

Thanks.

So it looks like we need to ask whenever the USB transport
is reliable or not. Did you (by any change) also monitor
"usb_tx_anch_urbs" [And does it get stuck at some > 1 value
as well?]. I'm asking this because if the driver has more than
8 (=AR9170_NUM_TX_URBS) concurrently outgoing URBs, the
overflow is queued in tx_wait. Of course, the urb completion
handler (carl9170_usb_tx_data_complete) takes care of
delivering the next frame in the tx_wait line and so on...
But according to your report, this doesn't seem to work!
What's a bit odd is that the device is able to recover. Because
normally if there is an USB error, the endpoint will halt and
no traffic will get through it anymore [So the DELBA should be
stuck as well!].

When you were testing this at a different machine, did you
use the same cable/hub? Or did you plug it into the USB
port directly?

BTW: Is this a DWA-160 REV A1 or A2?

Regards,
Christian

2013-02-24 22:31:01

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sunday, February 24, 2013 04:54:26 PM Alan Stern wrote:
> On Sat, 23 Feb 2013, Seth Forshee wrote:
> > On Sat, Feb 23, 2013 at 03:07:08PM +0100, Christian Lamparter wrote:
> > > usb_submit_urb() is async, so unless the URB data structure is
> > > bogus, the device is in the middle of a reset/is removed or OOM
> > > it won't return with an -ENUMBER.
> > >
> > > Since neither of us has probably access to an USB analyzer, the
> > > next best thing would be to enable ehci_hcd's debug facilities
> > > and check if the stuck frame produced any DataBufferErr, XactErr
> > > or something else.
> There aren't any occurrences of such errors in the usbmon trace.

Yes, I see no -EPROTO, -EPIPE (no stalled endpoints), -ECOMM, -EIMPORTANT...
either. Just the -EINPROGRESS, however this is perfectly OK!

> > I've found a couple of things here. First, I wasn't sure if I had tested
> > this on anything other than Ivy Bridge machines yet, so I tried it out
> > on an Arrandale box and it worked perfectly. lspci on the Ivy Bridge
> > boxes yields:
> >
> > 00:14.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller [8086:1e31] (rev 04) (prog-if 30 [XHCI])
> > 00:1a.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 [8086:1e2d] (rev 04) (prog-if 20 [EHCI])
> > 00:1d.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 [8086:1e26] (rev 04) (prog-if 20 [EHCI])
> >
> > And on the Arrandale box:
> >
> > 00:1a.0 USB controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b3c] (rev 06) (prog-if 20 [EHCI])
> > 00:1d.0 USB controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b34] (rev 06) (prog-if 20 [EHCI])
> >
> > Second, I collected the usbmon trace as suggested by Alan. I don't know
> > enough about USB to really read it, but normally the time between
> > submission and callback for a given urb is short. However, some are much
> > longer, e.g.:
> >
> > ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 190f0100 23232303 42b53600 82b11a00 01c02e00 6a00e846 c2ad3e00
> > ...
> > ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 >
> >
> > I checked the urb addresses for a couple of the stalled frames and in
> > each case found a matching urb in the usbmon trace with a similarly long
> > duration between submission and callback. I've uploaded the full trace
> > to:
> >
> > http://people.canonical.com/~sforshee/carl9170-usbmon.log

Seth,

Did you plug the DWA-160 device into a USB 3.0 port on your ivy bridge
system or does it react in the same way on the 'legacy' USB 2.0 ports as
well [so we are not comparing xhci (Ivy Bridge) to ehci (Arrandale)]?

> The usbmon trace indicates that the data gets delivered to the device
> as it should, but the device doesn't accept it for several seconds.

Looking at the logs, I find myself wondering how the "ffff88012fe19500"
urb-complete ninja'd right in between the ffff880146c8af00 xmit and
complete.

ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 190f0100 23232303 42b53600 82b11a00 01c02e00 6a00e846 c2ad3e00
...
ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 01000500 03000000 00000000 00000000 00000000 22000$
ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 >
ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 >

>From the device side, taking the data shouldn't be a problem. The
rx is handled by hardware dma. The data from the host is put into
packages of 320 bytes (The carl9170 firmware has about 180 to 190
of these 320 byte packages reserved for this purpose. So at no
point there should be any long delay because of lack of resources
or whatever). Also, it doesn't look the was any unusually high
load on the link. And the hardware can handle sustained wifi traffic
up to 160mbit/s (udp peaks at about 180mbit/s) without choking.

Alan,

Do you know if there a way to monitor whenever individual attempts
from the ehci/xhci hcd? Or can you think of any other "interesting"
bits that could help to explain why the "Arrandale box [...] worked
perfectly" whereas (all his) Ivy Bridge ones have problems.
(Of course, I assume that it is always the same device, the
same firmware and the same kernel drivers in all tests, right)?

BTW: My sandy-bridge laptop has the following usb controllers:
00:1a.0 USB controller [0c03]: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 [8086:1c2d] (rev 04)
00:1d.0 USB controller [0c03]: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 [8086:1c26] (rev 04)
19:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 04)
And so far I haven't come across the described "gap-of-silence".
[my kernel: 3.8.0-wl+. Tested devices: WNDA3100v1 and TL-WN821Nv2].

Regards,
Christian

2013-02-22 22:54:59

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Fri, Feb 22, 2013 at 11:07:37PM +0100, Christian Lamparter wrote:
> Hello Seth,
>
> On Friday 22 February 2013 21:50:44 Seth Forshee wrote:
> > I was given a D-Link wireless dongle by a colleague who said he was
> > having trouble with the connection stalling. One thing I noticed right
> > away when running iperf is that the tx BA sessions seem to be getting
> > stopped and restarted pretty frequently. I've been doing some debugging,
> > and here's what I'm seeing.
> >
> > On the air everything seems to go smoothly for a while, but then the
> > D-Link adapter stops transmitting data frames for a while. It still
> > sends CTS and BA frames in response to the AP, just no data frames.
> > Eventually it sends the action frame with the DELBA request, but
> > immediately before sending the action frame it sends a single data
> > frame. This pattern repeats each time this happens.
> >
> > That final data frame is a bit curious, so I added some tracing to
> > carl9170. It looks like this frame gets passed down to the hardware as
> > the last in a batch of A-MPDU frames, but for some reason the hardware
> > stops transmitting without sending this frame. carl9170 doesn't pass any
> > more A-MPDU frames to the hardware while tx for that frame is pending,
> > and when the action frame for the DELBA request finally comes along it
> > seems to get things moving again.
> >
> > Do you have any idea what might be causing the hardware to stop
> > transmitting with one frame left? Any suggestions on how to fix it?
> Do you also see messages like "invalid plcp cck rate.",
> "rx stream does not start with a first_mpdu frame tag." and/or
> "plcp info/frame tail is clipped" or messages about "driver
> reset"?

No, none of these messages.

> Can you tell me what version firmware are you using?

usb 3-3: firmware API: 1.9.6 2012-07-07

> If your driver is up to date, you could try the
> firmware attached to:
> <https://bugzilla.redhat.com/show_bug.cgi?id=831953>

This firmware exhibits the same behavior.

> About the "frame gets passed down..." thing. Do you know
> if this frame is received by the firmware or if it is
> stuck in the usb-pipe? (if you have DEBUGFS support, you
> can look at tx_ampdu_upload. If it's value stays the same
> (that is >= 1. If it drops to 0, then it should be fine)
> until the DELBA rolls around, then something is wrong with
> the USB BUS).

tx_ampdu_upload is one of the things I was monitoring. In one example,
frames with sequence numbers 1282 - 1286 are put in the tx_pending
queue, increasing tx_ampdu_upload to 5. All 5 frames are passed to
carl9170_usb_tx() (this is what I mean when I say the frame gets passed
down), but only the first 4 are transmitted, and tx_ampdu_upload
decrements down to 1. It stays at 1 until the DELBA comes, at which
point it finally gets transmitted and tx_ampdu_upload decrements to 0.

So it sounds like there's a USB bus problem. I've verified that this
problem shows up on multiple machines.

> Note: Unfortunately, Control frames (BA + CTS + ACK)
> don't share the tx queues with DATA or MGMT frames.
> These frames are generated within the bowls of the
> hardware MAC and they take "short paths" through
> the tx arbiter (this is all done in the silicon).
> [However, if you filter the monitor dumps again, you
> could look for nullfuncs or Probe Requests originating
> from the device. These should be a good indication
> whenever the hardware or the driver is on the fritz.]
>
> Note2: It's too early to tell anything yet ;-)
> But just in case you are planning to test the latest
> wireless-testing.git or compat-driver: Here's a patch
> which was just posted:
> <http://www.spinics.net/lists/linux-wireless/msg103866.html>
> It sorts out a disagreement between carl9170 and the
> latest minstrel_ht changes. Without it, you'll definitely
> get a very choppy connection (lots of WARNings and frame
> drop).

I'm currently testing with 3.8, so the minstrel_ht changes aren't an
issue.

Thanks,
Seth


2013-02-26 16:50:27

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Tue, Feb 26, 2013 at 12:30:58AM +0100, Christian Lamparter wrote:
> On Monday, February 25, 2013 09:19:03 PM Alan Stern wrote:
> > On Mon, 25 Feb 2013, Seth Forshee wrote:
> > > > > >ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
> > > > > >
> > > > > > [... long period where the device receives commands on EP4 and sends
> > > > > > wifi data to the host via EP2 - so it is working!]
> > > > > >
> > > > > >ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
> > > > > >ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
> > > > > >ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion
> >
> > > > Which kernel version are you testing under? Can you please recompile
> > > > with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
> > > > send me dmesg? I should be able to see if there's an unhandled pending
> > > > event on the xHCI rings if the data URB stalls for longer than say, a
> > > > minute.
> > >
> > > I'll do this.
> >
> > Bear in mind that the usbmon trace shows the URB stalling for 2 - 3
> > seconds. There may have been other examples that went on for longer,
> > but probably nothing anywhere near as long as a minute.
> >
> ---
> This should open up the window for as long as we want it.
> The driver will now queue, but not deliver any frames to
> the device when there are unfinished urbs.

Here's dmesg with USB debug with the patch. I let it run for at least 5
minutes after the connection stalled.

http://people.canonical.com/~sforshee/dmesg-carl9170-usb-debug2.txt

I pasted lsusb for the device below as well.

Thanks,
Seth


Bus 003 Device 004: ID 07d1:3a09 D-Link System DWA-160 802.11abgn Xtreme
N Dual Band Adapter(rev.A2) [Atheros AR9170+AR9104]
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 2.00
bDeviceClass 255 Vendor Specific Class
bDeviceSubClass 255 Vendor Specific Subclass
bDeviceProtocol 255 Vendor Specific Protocol
bMaxPacketSize0 64
idVendor 0x07d1 D-Link System
idProduct 0x3a09 DWA-160 802.11abgn Xtreme N Dual Band Adapter(rev.A2) [Atheros AR9170+AR9104]
bcdDevice 1.07
iManufacturer 16 ATHER
iProduct 32 11n adapter
iSerial 48 12345
bNumConfigurations 1
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 46
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0x80
(Bus Powered)
MaxPower 500mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 4
bInterfaceClass 255 Vendor Specific Class
bInterfaceSubClass 0
bInterfaceProtocol 0
iInterface 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x01 EP 1 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x82 EP 2 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83 EP 3 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0040 1x 64 bytes
bInterval 1
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x04 EP 4 OUT
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0040 1x 64 bytes
bInterval 1
Device Qualifier (for other device speed):
bLength 10
bDescriptorType 6
bcdUSB 2.00
bDeviceClass 255 Vendor Specific Class
bDeviceSubClass 255 Vendor Specific Subclass
bDeviceProtocol 255 Vendor Specific Protocol
bMaxPacketSize0 64
bNumConfigurations 0
Device Status: 0x0000
(Bus Powered)


2013-02-25 19:52:50

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Monday, February 25, 2013 08:46:23 PM Seth Forshee wrote:
> On Mon, Feb 25, 2013 at 11:13:57AM -0800, Sarah Sharp wrote:
> > Hmm, yeah, that kind of points to an Intel xHCI hardware issue. It's
> > too bad you don't have a USB analyzer (the high speed ones are about
> > $480). Can you send me a link so I can purchase the device and test it
> > with my analyzer?
>
> Hmm. This looks the same, but it sounds like there are multiple hardware
> versions sold under the same name so I'm not sure. The one I have is
> hardware version A2.
>
> http://www.amazon.com/D-Link-DWA-160-Extreme-N-Dual-Band-802-11n/dp/B00127OVHI
>
Or this one:
<http://www.ebay.com/itm/D-Link-DWA-160-300Mbps-Xtreme-N-Duo-Band-Wireless-USB-Adapter-Win7-Vista-XP-/180869622792>

In the description they say it's A2 (so it should be the same Seth is using)
(Note: The "current" DWA-160 have RaLink/MediaTek chips)

Regards,
Chr

2013-02-24 04:52:29

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sat, Feb 23, 2013 at 03:07:08PM +0100, Christian Lamparter wrote:
> usb_submit_urb() is async, so unless the URB data structure is
> bogus, the device is in the middle of a reset/is removed or OOM
> it won't return with an -ENUMBER.
>
> Since neither of us has probably access to an USB analyzer, the
> next best thing would be to enable ehci_hcd's debug facilities
> and check if the stuck frame produced any DataBufferErr, XactErr
> or something else.

I've found a couple of things here. First, I wasn't sure if I had tested
this on anything other than Ivy Bridge machines yet, so I tried it out
on an Arrandale box and it worked perfectly. lspci on the Ivy Bridge
boxes yields:

00:14.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller [8086:1e31] (rev 04) (prog-if 30 [XHCI])
00:1a.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 [8086:1e2d] (rev 04) (prog-if 20 [EHCI])
00:1d.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 [8086:1e26] (rev 04) (prog-if 20 [EHCI])

And on the Arrandale box:

00:1a.0 USB controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b3c] (rev 06) (prog-if 20 [EHCI])
00:1d.0 USB controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b34] (rev 06) (prog-if 20 [EHCI])

Second, I collected the usbmon trace as suggested by Alan. I don't know
enough about USB to really read it, but normally the time between
submission and callback for a given urb is short. However, some are much
longer, e.g.:

ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 190f0100 23232303 42b53600 82b11a00 01c02e00 6a00e846 c2ad3e00
...
ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 >

I checked the urb addresses for a couple of the stalled frames and in
each case found a matching urb in the usbmon trace with a similarly long
duration between submission and callback. I've uploaded the full trace
to:

http://people.canonical.com/~sforshee/carl9170-usbmon.log

> Also, we should check what the device is doing. The hardware has
> an (Faraday Tech) FUSB200 PHY. It's initialized and partially
> controlled by the carl9170 firmware.
> (fw source is available at <https://github.com/chunkeey/carl9170fw>).

I probably won't get a chance to play with the firmware until Monday.

Thanks,
Seth


2013-02-24 15:54:27

by Alan Stern

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sat, 23 Feb 2013, Seth Forshee wrote:

> On Sat, Feb 23, 2013 at 03:07:08PM +0100, Christian Lamparter wrote:
> > usb_submit_urb() is async, so unless the URB data structure is
> > bogus, the device is in the middle of a reset/is removed or OOM
> > it won't return with an -ENUMBER.
> >
> > Since neither of us has probably access to an USB analyzer, the
> > next best thing would be to enable ehci_hcd's debug facilities
> > and check if the stuck frame produced any DataBufferErr, XactErr
> > or something else.

There aren't any occurrences of such errors in the usbmon trace.

> I've found a couple of things here. First, I wasn't sure if I had tested
> this on anything other than Ivy Bridge machines yet, so I tried it out
> on an Arrandale box and it worked perfectly. lspci on the Ivy Bridge
> boxes yields:
>
> 00:14.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller [8086:1e31] (rev 04) (prog-if 30 [XHCI])
> 00:1a.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 [8086:1e2d] (rev 04) (prog-if 20 [EHCI])
> 00:1d.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 [8086:1e26] (rev 04) (prog-if 20 [EHCI])
>
> And on the Arrandale box:
>
> 00:1a.0 USB controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b3c] (rev 06) (prog-if 20 [EHCI])
> 00:1d.0 USB controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b34] (rev 06) (prog-if 20 [EHCI])
>
> Second, I collected the usbmon trace as suggested by Alan. I don't know
> enough about USB to really read it, but normally the time between
> submission and callback for a given urb is short. However, some are much
> longer, e.g.:
>
> ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 190f0100 23232303 42b53600 82b11a00 01c02e00 6a00e846 c2ad3e00
> ...
> ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 >
>
> I checked the urb addresses for a couple of the stalled frames and in
> each case found a matching urb in the usbmon trace with a similarly long
> duration between submission and callback. I've uploaded the full trace
> to:
>
> http://people.canonical.com/~sforshee/carl9170-usbmon.log

The usbmon trace indicates that the data gets delivered to the device
as it should, but the device doesn't accept it for several seconds.

Alan Stern


2013-02-25 19:46:29

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Mon, Feb 25, 2013 at 11:13:57AM -0800, Sarah Sharp wrote:
> On Mon, Feb 25, 2013 at 05:03:40PM +0100, Christian Lamparter wrote:
> > On Monday, February 25, 2013 04:29:55 PM Alan Stern wrote:
> > > On Mon, 25 Feb 2013, Christian Lamparter wrote:
> > One detail that might be important (to keep in mind):
> >
> > Original report <http://www.spinics.net/lists/linux-wireless/msg103880.html>:
> > > On the air everything seems to go smoothly for a while, but
> > > then the D-Link adapter stops transmitting *DATA* frames for a while. [...]
> > > Eventually it sends the action frame with the *DELBA* request, but
> > > immediately before sending the action frame it sends a single *DATA*
> > > frame (1). This pattern repeats each time this happens.
> >
> > Now if we take this and apply it to the usbmon recording in:
> > <http://www.spinics.net/lists/linux-wireless/msg103915.html>
> >
> > > Normally the time between submission and callback for a given urb
> > > is short. However, some are much longer, e.g.:
> > >
> > >ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
> > >
> > > [... long period where the device receives commands on EP4 and sends
> > > wifi data to the host via EP2 - so it is working!]
> > >
> > >ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
> > >ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
> > >ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion
> >
> > It would mean that the (delayed) urb with the *DATA* frame urb was not
> > sent (?or received?) by the usb dongle until the *DELBA* came along (1)
> > and triggered the TX for both (in quick succession). So, I think we
> > should be looking for lost/unhandled interrupts/events.
> >
> > One more thing: So far this issue only occurs with:
> > 00:14.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller [8086:1e31] (rev 04) (prog-if 30 [XHCI])
>
> Which kernel version are you testing under? Can you please recompile
> with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
> send me dmesg? I should be able to see if there's an unhandled pending
> event on the xHCI rings if the data URB stalls for longer than say, a
> minute.

I'll do this.

> As for lost interrupts, the only issue we had with that particular xHCI
> host had to do with lost isochronous transfers. We fixed that with a
> quirk to avoid the BEI flag on that platform. But your data is being
> sent via a bulk endpoint, so we shouldn't have any issues there.
>
> > However, it not all xhci-hcd are affected. I have not seen this with
> > the NEC Corporation uPD720200 I have in my sandy bridge laptop:
> > 19:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 04)
>
> Hmm, yeah, that kind of points to an Intel xHCI hardware issue. It's
> too bad you don't have a USB analyzer (the high speed ones are about
> $480). Can you send me a link so I can purchase the device and test it
> with my analyzer?

Hmm. This looks the same, but it sounds like there are multiple hardware
versions sold under the same name so I'm not sure. The one I have is
hardware version A2.

http://www.amazon.com/D-Link-DWA-160-Extreme-N-Dual-Band-802-11n/dp/B00127OVHI

Seth


2013-02-23 06:46:05

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sat, Feb 23, 2013 at 12:48:51AM +0100, Christian Lamparter wrote:
> On Friday 22 February 2013 23:54:54 Seth Forshee wrote:
> > On Fri, Feb 22, 2013 at 11:07:37PM +0100, Christian Lamparter wrote:
> > > About the "frame gets passed down..." thing. Do you know
> > > if this frame is received by the firmware or if it is
> > > stuck in the usb-pipe? (if you have DEBUGFS support, you
> > > can look at tx_ampdu_upload. If it's value stays the same
> > > (that is >= 1. If it drops to 0, then it should be fine)
> > > until the DELBA rolls around, then something is wrong with
> > > the USB BUS).
> >
> > tx_ampdu_upload is one of the things I was monitoring. In one example,
> > frames with sequence numbers 1282 - 1286 are put in the tx_pending
> > queue, increasing tx_ampdu_upload to 5. All 5 frames are passed to
> > carl9170_usb_tx() (this is what I mean when I say the frame gets passed
> > down), but only the first 4 are transmitted, and tx_ampdu_upload
> > decrements down to 1. It stays at 1 until the DELBA comes, at which
> > point it finally gets transmitted and tx_ampdu_upload decrements to 0.
> > So it sounds like there's a USB bus problem. I've verified that this
> > problem shows up on multiple machines.
>
> Thanks.
>
> So it looks like we need to ask whenever the USB transport
> is reliable or not. Did you (by any change) also monitor
> "usb_tx_anch_urbs" [And does it get stuck at some > 1 value
> as well?]. I'm asking this because if the driver has more than
> 8 (=AR9170_NUM_TX_URBS) concurrently outgoing URBs, the
> overflow is queued in tx_wait. Of course, the urb completion
> handler (carl9170_usb_tx_data_complete) takes care of
> delivering the next frame in the tx_wait line and so on...
> But according to your report, this doesn't seem to work!
> What's a bit odd is that the device is able to recover. Because
> normally if there is an USB error, the endpoint will halt and
> no traffic will get through it anymore [So the DELBA should be
> stuck as well!].

The carl9170 usb code seems to be working properly. If tx_anch_urbs
reaches 8 the overflow is queued in tx_wait as you said, and the next
queued frame gets delivered from carl9170_usb_tx_data_complete(). The
stuck frame does get passed to a successful usb_submit_urb() call before
tx stops, but it still isn't transmitted until the DELBA comes along
(and tx_anch_urbs decrements to 1 and then gets stuck there while tx is
stalled, as would be expected).

> When you were testing this at a different machine, did you
> use the same cable/hub? Or did you plug it into the USB
> port directly?

In both cases the adapter was plugged directly into the machine's USB
ports.

> BTW: Is this a DWA-160 REV A1 or A2?

A2

Thanks,
Seth

2013-02-25 15:29:56

by Alan Stern

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Mon, 25 Feb 2013, Christian Lamparter wrote:

> > In fact this is both normal and required. Packets to any particular
> > endpoint must always be delivered in order. Therefore the URBs have
> > to complete in the same order as they were submitted.
> Yes, I know that ;). I guess I should have said: "It's strange that
> after such a long silence the urb tx trigger at 1519981417 seemd to
> unfreeze the pending urb. It's almost as if a urb completion event
> was lost and the urb_complete just had to wait until another tx urb
> on the same ep went by to free it.

That's quite possible. The new URB may trigger something in the
xhci-hcd driver or in the xHCI hardware. Sarah (CC'ed), the maintainer
for xhci-hcd, is the person who would know best.

> > > From the device side, taking the data shouldn't be a problem. The
> > > rx is handled by hardware dma. The data from the host is put into
> > > packages of 320 bytes (The carl9170 firmware has about 180 to 190
> > > of these 320 byte packages reserved for this purpose. So at no
> > > point there should be any long delay because of lack of resources
> > > or whatever). Also, it doesn't look the was any unusually high
> > > load on the link. And the hardware can handle sustained wifi traffic
> > > up to 160mbit/s (udp peaks at about 180mbit/s) without choking.
> > >
> > > Or can you think of any other "interesting"
> > > bits that could help to explain why the "Arrandale box [...] worked
> > > perfectly" whereas (all his) Ivy Bridge ones have problems.
> > > (Of course, I assume that it is always the same device, the
> > > same firmware and the same kernel drivers in all tests, right)?
> >
> > No, not really. Unless one is using USB-2 and the other USB-3 -- the
> > device might have a bug in its USB-3 firmware.
> Don't worry, the device was designed about 5 years ago. Hence, we don't
> need to care about any USB 3.0 features.

Since the device firmware is therefore unlikely to be the cause of the
problem, it now seems a lot more likely that the problem is in the xHCI
hardware or driver.

> > What happens if xhci-hcd is unloaded before the test?
> Isn't xhci-hcd needed to drive the usb 3.0 ports? I know about the hub
> concept with uhci/ohci and ehci, but I thought they did away with that.

Yes. Without xhci-hcd, nothing will happen when a device is plugged
into a USB-3 port. (Although on some systems, ports are wired up as
both USB-2 and USB-3, so when xhci-hcd isn't loaded the ports are
managed by ehci-hcd.)

Never mind; when I wrote the question it wasn't clear that the failures
occurred only when the device was attached to a USB-3 port.

Alan Stern


2013-02-24 23:41:07

by Alan Stern

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sun, 24 Feb 2013, Christian Lamparter wrote:

> > The usbmon trace indicates that the data gets delivered to the device
> > as it should, but the device doesn't accept it for several seconds.
>
> Looking at the logs, I find myself wondering how the "ffff88012fe19500"
> urb-complete ninja'd right in between the ffff880146c8af00 xmit and
> complete.
>
> ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 190f0100 23232303 42b53600 82b11a00 01c02e00 6a00e846 c2ad3e00
> ...
> ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 01000500 03000000 00000000 00000000 00000000 22000$
> ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 >
> ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 >

In fact this is both normal and required. Packets to any particular
endpoint must always be delivered in order. Therefore the URBs have to
complete in the same order as they were submitted.

> From the device side, taking the data shouldn't be a problem. The
> rx is handled by hardware dma. The data from the host is put into
> packages of 320 bytes (The carl9170 firmware has about 180 to 190
> of these 320 byte packages reserved for this purpose. So at no
> point there should be any long delay because of lack of resources
> or whatever). Also, it doesn't look the was any unusually high
> load on the link. And the hardware can handle sustained wifi traffic
> up to 160mbit/s (udp peaks at about 180mbit/s) without choking.
>
> Alan,
>
> Do you know if there a way to monitor whenever individual attempts
> from the ehci/xhci hcd?

Only with a USB bus analyzer. Unfortuantely these things tend to be
expensive, especially if you want one that works at SuperSpeed.

> Or can you think of any other "interesting"
> bits that could help to explain why the "Arrandale box [...] worked
> perfectly" whereas (all his) Ivy Bridge ones have problems.
> (Of course, I assume that it is always the same device, the
> same firmware and the same kernel drivers in all tests, right)?

No, not really. Unless one is using USB-2 and the other USB-3 -- the
device might have a bug in its USB-3 firmware.

What happens if xhci-hcd is unloaded before the test?

Alan Stern


2013-02-25 15:04:20

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Monday, February 25, 2013 12:41:06 AM Alan Stern wrote:
> On Sun, 24 Feb 2013, Christian Lamparter wrote:
>
> > > The usbmon trace indicates that the data gets delivered to the device
> > > as it should, but the device doesn't accept it for several seconds.
> >
> > Looking at the logs, I find myself wondering how the "ffff88012fe19500"
> > urb-complete ninja'd right in between the ffff880146c8af00 xmit and
> > complete.
> >
> > ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 190f0100 23232303 42b53600 82b11a00 01c02e00 6a00e846 c2ad3e00
> > ...
> > ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 01000500 03000000 00000000 00000000 00000000 22000$
> > ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 >
> > ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 >
>
> In fact this is both normal and required. Packets to any particular
> endpoint must always be delivered in order. Therefore the URBs have
> to complete in the same order as they were submitted.
Yes, I know that ;). I guess I should have said: "It's strange that
after such a long silence the urb tx trigger at 1519981417 seemd to
unfreeze the pending urb. It's almost as if a urb completion event
was lost and the urb_complete just had to wait until another tx urb
on the same ep went by to free it.

> > From the device side, taking the data shouldn't be a problem. The
> > rx is handled by hardware dma. The data from the host is put into
> > packages of 320 bytes (The carl9170 firmware has about 180 to 190
> > of these 320 byte packages reserved for this purpose. So at no
> > point there should be any long delay because of lack of resources
> > or whatever). Also, it doesn't look the was any unusually high
> > load on the link. And the hardware can handle sustained wifi traffic
> > up to 160mbit/s (udp peaks at about 180mbit/s) without choking.
> >
> > Or can you think of any other "interesting"
> > bits that could help to explain why the "Arrandale box [...] worked
> > perfectly" whereas (all his) Ivy Bridge ones have problems.
> > (Of course, I assume that it is always the same device, the
> > same firmware and the same kernel drivers in all tests, right)?
>
> No, not really. Unless one is using USB-2 and the other USB-3 -- the
> device might have a bug in its USB-3 firmware.
Don't worry, the device was designed about 5 years ago. Hence, we don't
need to care about any USB 3.0 features.

> What happens if xhci-hcd is unloaded before the test?
Isn't xhci-hcd needed to drive the usb 3.0 ports? I know about the hub
concept with uhci/ohci and ehci, but I thought they did away with that.

Regards,
Chr

2013-02-25 14:44:30

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Sun, Feb 24, 2013 at 11:30:56PM +0100, Christian Lamparter wrote:
> Did you plug the DWA-160 device into a USB 3.0 port on your ivy bridge
> system or does it react in the same way on the 'legacy' USB 2.0 ports as
> well [so we are not comparing xhci (Ivy Bridge) to ehci (Arrandale)]?

Those two machines only have USB 3.0 ports, but I did manage to find an
Ivy Bridge machine with both. On that machine I get a stable connection
on the USB 2.0 ports and a flaky connection on the USB 3.0 ports. I
didn't actually sniff the frames, but the behavior is consistent with
what I saw on the other machines.

Seth


2013-02-25 16:03:45

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Monday, February 25, 2013 04:29:55 PM Alan Stern wrote:
> On Mon, 25 Feb 2013, Christian Lamparter wrote:
> > > In fact this is both normal and required. Packets to any particular
> > > endpoint must always be delivered in order. Therefore the URBs have
> > > to complete in the same order as they were submitted.
> > Yes, I know that ;). I guess I should have said: "It's strange that
> > after such a long silence the urb tx trigger at 1519981417 seemd to
> > unfreeze the pending urb. It's almost as if a urb completion event
> > was lost and the urb_complete just had to wait until another tx urb
> > on the same ep went by to free it.
>
> That's quite possible. The new URB may trigger something in the
> xhci-hcd driver or in the xHCI hardware. Sarah (CC'ed), the maintainer
> for xhci-hcd, is the person who would know best.

Thanks (Hello Sarah).

One detail that might be important (to keep in mind):

Original report <http://www.spinics.net/lists/linux-wireless/msg103880.html>:
> On the air everything seems to go smoothly for a while, but
> then the D-Link adapter stops transmitting *DATA* frames for a while. [...]
> Eventually it sends the action frame with the *DELBA* request, but
> immediately before sending the action frame it sends a single *DATA*
> frame (1). This pattern repeats each time this happens.

Now if we take this and apply it to the usbmon recording in:
<http://www.spinics.net/lists/linux-wireless/msg103915.html>

> Normally the time between submission and callback for a given urb
> is short. However, some are much longer, e.g.:
>
>ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
>
> [... long period where the device receives commands on EP4 and sends
> wifi data to the host via EP2 - so it is working!]
>
>ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
>ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
>ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion

It would mean that the (delayed) urb with the *DATA* frame urb was not
sent (?or received?) by the usb dongle until the *DELBA* came along (1)
and triggered the TX for both (in quick succession). So, I think we
should be looking for lost/unhandled interrupts/events.

One more thing: So far this issue only occurs with:
00:14.0 USB controller [0c03]: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller [8086:1e31] (rev 04) (prog-if 30 [XHCI])

However, it not all xhci-hcd are affected. I have not seen this with
the NEC Corporation uPD720200 I have in my sandy bridge laptop:
19:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 04)

Regards,
Chr

2013-03-07 23:37:52

by Sarah Sharp

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Thu, Mar 07, 2013 at 11:46:23AM -0600, Seth Forshee wrote:
> Hi Sarah,
>
> Just wondering if you've had a chance to take a look at this yet.
>
> Thanks,
> Seth

I have the device sitting in my cube, but I haven't had a chance to hook
it up to the analyzer yet. I'm going to be out mid-next week as well,
for a conference, so it's not likely I will have an answer until two
weeks from now.

Sarah Sharp

2013-03-07 17:46:28

by Seth Forshee

[permalink] [raw]
Subject: Re: carl9170 A-MPDU transmit problem

On Tue, Feb 26, 2013 at 10:50:21AM -0600, Seth Forshee wrote:
> On Tue, Feb 26, 2013 at 12:30:58AM +0100, Christian Lamparter wrote:
> > On Monday, February 25, 2013 09:19:03 PM Alan Stern wrote:
> > > On Mon, 25 Feb 2013, Seth Forshee wrote:
> > > > > > >ffff88012fe19500 1519981417 S Bo:3:003:1 -115 126 = 7e000000 ... <-- DATA
> > > > > > >
> > > > > > > [... long period where the device receives commands on EP4 and sends
> > > > > > > wifi data to the host via EP2 - so it is working!]
> > > > > > >
> > > > > > >ffff880146c8af00 1522200650 S Bo:3:003:1 -115 62 = 3e000000 ... <-- DELBA
> > > > > > >ffff88012fe19500 1522200720 C Bo:3:003:1 0 126 > <-- DATA urb completion
> > > > > > >ffff880146c8af00 1522200756 C Bo:3:003:1 0 62 > <-- DELBA urb completion
> > >
> > > > > Which kernel version are you testing under? Can you please recompile
> > > > > with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and
> > > > > send me dmesg? I should be able to see if there's an unhandled pending
> > > > > event on the xHCI rings if the data URB stalls for longer than say, a
> > > > > minute.
> > > >
> > > > I'll do this.
> > >
> > > Bear in mind that the usbmon trace shows the URB stalling for 2 - 3
> > > seconds. There may have been other examples that went on for longer,
> > > but probably nothing anywhere near as long as a minute.
> > >
> > ---
> > This should open up the window for as long as we want it.
> > The driver will now queue, but not deliver any frames to
> > the device when there are unfinished urbs.
>
> Here's dmesg with USB debug with the patch. I let it run for at least 5
> minutes after the connection stalled.
>
> http://people.canonical.com/~sforshee/dmesg-carl9170-usb-debug2.txt

Hi Sarah,

Just wondering if you've had a chance to take a look at this yet.

Thanks,
Seth