2011-12-31 14:08:01

by Andreas Hartmann

[permalink] [raw]
Subject: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Hello,

I tested compat-wireless-3.2-rc6-3 with kernel 3.1 and a rt2860 device
running in AP mode (80211.n - 40 MHz). During

netperf -t TCP_STREAM -H ap

the data stream mostly stalls (-> the device receives data) after a few
seconds and never comes up any more. The connection is completely death.
I have to restart wpa_supplicant to get it working again.



I tried to isolate the problem the following way:

1. Running compat-wireless-3.2-rc6-3 with the rt2x00 level from
compat-wireless-3.1.1-1. The problem didn't gone.

2. Running compat-wireless-3.1.1-1 with the rt2x00 based changes in
compat-wireless-3.2-rc6-3. I can't see the problem any more.


That's why I think, that the problem must be outside the rt2x00 changes.
But there are many changes outside :-).


So far, I removed the following patches from compat-wireless-3.2-rc6-3:

mac80211: fix race between the AGG SM and the Tx data path
mac80211: don't stop a single aggregation session twice
mac80211: uAPSD - fix IEEE80211_FCTL_MOREDATA bit setting
mac80211: fix race condition caused by late addBA response

But they seem not to be the culprit.


I tried to remove these patches, too, but this wasn't possible:

mac80211: don't stop a single aggregation session twice
mac80211: fix another race in aggregation start



Does anybody has an idea, which patch could have raised this problem?



Kind regards,
Andreas Hartmann


2011-12-31 20:37:30

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Andreas Hartmann schrieb:
> Hello,
>
> I tested compat-wireless-3.2-rc6-3 with kernel 3.1 and a rt2860 device
> running in AP mode (80211.n - 40 MHz). During
>
> netperf -t TCP_STREAM -H ap
>
> the data stream mostly stalls (-> the device receives data) after a few
> seconds and never comes up any more. The connection is completely death.
> I have to restart wpa_supplicant to get it working again.
>
>
>
> I tried to isolate the problem the following way:
>
> 1. Running compat-wireless-3.2-rc6-3 with the rt2x00 level from
> compat-wireless-3.1.1-1. The problem didn't gone.
>
> 2. Running compat-wireless-3.1.1-1 with the rt2x00 based changes in
> compat-wireless-3.2-rc6-3. I can't see the problem any more.
>
>
> That's why I think, that the problem must be outside the rt2x00 changes.
> But there are many changes outside :-).


Removing this patch

mac80211: retry sending failed BAR frames later instead of tearing down
aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
it working again.

Device is: RaLink RT2800 802.11n PCI


Kind regards,
Andreas Hartmann

2012-01-23 13:29:09

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
<[email protected]> wrote:
> All in one: Your workaround was a good idea, but unfortunately it
> doesn't "solve" / workaround the problem introduced with the patch
> "mac80211: retry sending failed BAR frames later instead of tearing
> down aggr" [1]. I don't know off any other solution at this time as to revert it.

Ok, so we've got at least one hw issue here: rt2800 is not able to report
correct ACK state of BARs :(

The interesting bit is that the legacy drivers tear the BA session down as
soon as the first AMPDU subframe failed. That's the same behavior as
mac80211 had before the mentioned patch.

So, let's assume the generic way Felix implemented works for other devices
(like ath9k) we could go with a workaround inside rt2x00 to tear down
the BA session as soon as the first AMPDU frame failed.

Mind to try that one (not even compile tested though :) ).

Signed-off-by: Helmut Schaa <[email protected]>
---
diff --git a/drivers/net/wireless/rt2x00/rt2x00dev.c
b/drivers/net/wireless/rt2x00/rt2x00dev.c
index c930ce0..9038934 100644
--- a/drivers/net/wireless/rt2x00/rt2x00dev.c
+++ b/drivers/net/wireless/rt2x00/rt2x00dev.c
@@ -420,8 +420,16 @@ void rt2x00lib_txdone(struct queue_entry *entry,
tx_info->status.ampdu_len = 1;
tx_info->status.ampdu_ack_len = success ? 1 : 0;

- if (!success)
- tx_info->flags |= IEEE80211_TX_STAT_AMPDU_NO_BACK;
+ if (!success && ieee80211_is_data_qos(hdr->frame_control)) {
+ /*
+ * Tear down BA session
+ */
+ struct ieee80211_sta *sta = tx_info->control.sta;
+ u8 *qc = ieee80211_get_qos_ctl(hdr);
+ int tid = *qc & IEEE80211_QOS_CTL_TID_MASK;
+
+ ieee80211_stop_tx_ba_session(sta, tid);
+ }
}

if (rate_flags & IEEE80211_TX_RC_USE_RTS_CTS) {

2012-01-04 12:31:44

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Helmut Schaa schrieb:

[...]

> What's the client device connected to the rt2x00 AP? Mind to send a
> 802.11 capture wen this stall happens?

I sent Helmut a trace, which hopefully contains some hints.


Regards,
Andreas

2012-01-24 14:39:36

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Tue, Jan 24, 2012 at 09:07:04AM +0100, Helmut Schaa wrote:
> > but the machine does crash anyway. Could hdr be null, too?
>
> Ok, I'll do some testing as soon as time allows and get back to you then.

Not sure how good idea is is, but perhaps we could introduce mac80211 flag
that will disable behaviour of f0425bed commit for drivers, which will
set that flag.

Stanislaw

2012-01-23 16:06:12

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Helmut Schaa schrieb:
> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
> <[email protected]> wrote:
>> All in one: Your workaround was a good idea, but unfortunately it
>> doesn't "solve" / workaround the problem introduced with the patch
>> "mac80211: retry sending failed BAR frames later instead of tearing
>> down aggr" [1]. I don't know off any other solution at this time as to revert it.
>
> Ok, so we've got at least one hw issue here: rt2800 is not able to report
> correct ACK state of BARs :(
>
> The interesting bit is that the legacy drivers tear the BA session down as
> soon as the first AMPDU subframe failed. That's the same behavior as
> mac80211 had before the mentioned patch.
>
> So, let's assume the generic way Felix implemented works for other devices
> (like ath9k) we could go with a workaround inside rt2x00 to tear down
> the BA session as soon as the first AMPDU frame failed.
>
> Mind to try that one (not even compile tested though :) ).

hdr is missing. I tried to derive it like this:

struct ieee80211_hdr *hdr=(void*)entry->skb->data;

Unfortunately, it crashes the machine after short time of operation.
Therefore I think that it isn't derived correctly!


Regards,
Andreas

>
> Signed-off-by: Helmut Schaa <[email protected]>
> ---
> diff --git a/drivers/net/wireless/rt2x00/rt2x00dev.c
> b/drivers/net/wireless/rt2x00/rt2x00dev.c
> index c930ce0..9038934 100644
> --- a/drivers/net/wireless/rt2x00/rt2x00dev.c
> +++ b/drivers/net/wireless/rt2x00/rt2x00dev.c
> @@ -420,8 +420,16 @@ void rt2x00lib_txdone(struct queue_entry *entry,
> tx_info->status.ampdu_len = 1;
> tx_info->status.ampdu_ack_len = success ? 1 : 0;
>
> - if (!success)
> - tx_info->flags |= IEEE80211_TX_STAT_AMPDU_NO_BACK;
> + if (!success && ieee80211_is_data_qos(hdr->frame_control)) {
> + /*
> + * Tear down BA session
> + */
> + struct ieee80211_sta *sta = tx_info->control.sta;
> + u8 *qc = ieee80211_get_qos_ctl(hdr);
> + int tid = *qc & IEEE80211_QOS_CTL_TID_MASK;
> +
> + ieee80211_stop_tx_ba_session(sta, tid);
> + }
> }
>
> if (rate_flags & IEEE80211_TX_RC_USE_RTS_CTS) {


2012-01-10 11:59:26

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Helmut Schaa schrieb:
> On Tue, Jan 10, 2012 at 9:03 AM, Andreas Hartmann
> <[email protected]> wrote:
>> Meanwhile, I took a look at the tx_path with Felix patch applied. I'm
>> getting this output:
>>
>>
>> -> netperf start
>>
>> [38139.839028] Open BA session requested for 00:25:9c:de:4e:30 tid 0
>> [38139.848151] IEEE80211_AMPDU_TX_START
>> [38139.848157] activated addBA response timer on tid 0
>> [38139.849723] switched off addBA timer for tid 0
>> [38139.849728] Aggregation is on for tid 0
>>
>> -> stall
>>
>> [38161.686305] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 1
>> [38161.698957] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 2
>>
>> [38161.698963] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.699116] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.710238] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 3
>> [38161.710240] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.710394] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.724512] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 4
>> [38161.724519] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.724659] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.735416] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 5
>> [38161.735423] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.735486] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.747815] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 6
>> [38161.747822] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.747888] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.755268] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 7
>> [38161.755271] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.758823] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.772898] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 8
>> [38161.772901] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
>> [38161.773049] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
>> [38161.789222] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 9
>> ...
>>
>> Looks as if sending of any package is reported as broken at some
>> point of time and this circle cannot be left anymore.
>
> Not necessarily since mac80211 will only retry the BAR if a data frame
> transmission was successful.

Ok - this can't be seen in the trace as I just traced channel 1 - there
could be some data transfer outside of channel 1.

But if I look at the debug output, I fear, that at some point the tx is
broken completely as I can see at the beginning of each stall solely
tx failures.

> Hence, it seems as if only the BARs TX
> status is reported incorrectly ...

Seems so. Because each BAR can be seen in the trace and is responded by
the STA.

> I see two issues here:
> 1) rt2800pci seems to have problems delivering the BAR _or_ doesn't
> report the tx status correctly

I think, that there is a problem with the tx status, as if I can see the
BAR in the trace and even the response of the STA (so the BAR was
probably ok), there really seems to be a problem with the tx status.

> 2) If the same BAR fails consecutively we should maybe really tear down
> the BA session as it was done before
>
> 2 is done in the below untested patch and will also work around 1.
> Mind to give it a try?

Yes :-)

The workaround patch works fine for me (at a first glance - I didn't
check for any side effects yet). The stall doesn't happen any more. But
I'm not really sure, why it works:

At the beginning, you just can see 9 tx failures. After a few tx
failures, you can see 7 BAR's, which are pending, but never passed to
ieee80211_send_bar().

- Why are they suddenly ok (are they really ok)? Or is it a logging
problem (too much input at the same time)?


At session 38176, you can see your patch at work :-).


[27229.720767] rt2800_txdone_entry - tx failed
[27229.720791] rt2800_txdone_entry - tx failed
[27229.720800] rt2800_txdone_entry - tx failed
[27229.720807] rt2800_txdone_entry - tx failed
[27229.720814] rt2800_txdone_entry - tx failed
[27229.720821] rt2800_txdone_entry - tx failed
[27229.720828] rt2800_txdone_entry - tx failed
[27229.720835] rt2800_txdone_entry - tx failed
[27229.727130] rt2800_txdone_entry - tx failed
[27229.727133] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38064>
[27229.734419] rt2800_txdone_entry - tx failed
[27229.734431] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38080>
[27229.741731] rt2800_txdone_entry - tx failed
[27229.741741] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38096>
[27229.749097] rt2800_txdone_entry - tx failed
[27229.749107] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38112>
[27229.758136] rt2800_txdone_entry - tx failed
[27229.758143] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38128>
[27229.765453] rt2800_txdone_entry - tx failed
[27229.765461] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38144>
[27229.771836] rt2800_txdone_entry - tx failed
[27229.771843] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38160>
[27229.779281] rt2800_txdone_entry - tx failed
[27229.779290] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38176>

[27229.779315] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 2 tid: 0 failed_bar_ssn: 38176
[27229.786705] rt2800_txdone_entry - tx failed
[27229.786714] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38176>
[27229.793439] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 1 tid: 0 failed_bar_ssn: 38176

[27229.793452] ieee80211_check_pending_bar -> ieee80211_stop_tx_ba_session() retry cnt: 0 tid: 0 failed_bar_ssn: 38176

[27229.793476] Tx BA session stop requested for 00:25:9c:11:22:33 tid 0
[27229.802082] Stopping Tx BA session for 00:25:9c:11:22:33 tid 0

[27229.807455] rt2800_txdone_entry - tx failed
[27229.807465] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38176>

[27229.816023] Open BA session requested for 00:25:9c:11:22:33 tid 0
[27229.834077] activated addBA response timer on tid 0
[27229.835517] switched off addBA timer for tid 0
[27229.835522] Aggregation is on for tid 0



The following test is done with reduced debugging (without the tx
failed entries). But I can see the set_bar_pending calls without a
send_bar, too. This block is typical and comes up about 6 times a
minute (not regularly).

[29007.375460] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4000>
[29007.384394] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4016>
[29007.391329] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4032>
[29007.398646] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4048>
[29007.409470] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4064>
[29007.415295] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4080>
[29007.425943] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4096>
[29007.439950] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4112>
[29007.440041] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 2 tid: 0 failed_bar_ssn: 4112
[29007.440061] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 1 tid: 0 failed_bar_ssn: 4112
[29007.440071] ieee80211_check_pending_bar -> ieee80211_stop_tx_ba_session() retry cnt: 0 tid: 0 failed_bar_ssn: 4112
[29007.440100] Tx BA session stop requested for 00:25:9c:11:22:33 tid 0
[29007.447797] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4112>
[29007.449024] Stopping Tx BA session for 00:25:9c:11:22:33 tid 0
[29007.461327] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4112>
[29007.490023] Open BA session requested for 00:25:9c:11:22:33 tid 0
[29007.499048] activated addBA response timer on tid 0
[29007.504198] switched off addBA timer for tid 0
[29007.504203] Aggregation is on for tid 0



Thank you for your time - I really appreciate your work!
Kind regards,
Andreas

2012-01-09 15:37:34

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Helmut Schaa schrieb:
> On Thu, Jan 5, 2012 at 9:44 AM, Andreas Hartmann
> <[email protected]> wrote:
>> Helmut Schaa schrieb:
>>> Hi,
>>>
>>> Am Mittwoch, 4. Januar 2012, 09:38:02 schrieb Andreas Hartmann:
>>>> Helmut Schaa schrieb:
>>>>>> Removing this patch
>>>>>>
>>>>>> mac80211: retry sending failed BAR frames later instead of tearing down
>>>>>> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
>>>>>> it working again.
>>>>>>
>>>>>> Device is: RaLink RT2800 802.11n PCI
>>>>>
>>>>> What's the client device connected to the rt2x00 AP?
>>>>
>>>> It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
>>>> module.
>>>>
>>>>> Mind to send a
>>>>> 802.11 capture wen this stall happens?
>>>
>>> Andreas provided me with a capture now. The interesting part is that
>>> the rt2x00 AP sends bursts of BARs with the same sequence number when
>>> the stall happens. I only had a quick look at the code in question but
>>> couldn't see anything wrong at a first glance.
>>
>>
>> Some additional information:
>>
>> - The AP initiates BARs for 6 different, successive sequence numbers.
>> - The first 5 sequence numbers are finished after sending of about 8 (!)
>> BAR's.
>> Important: each BAR is responded *instantly* by the STA.
>> - The BAR for the next sequence number is sent "endless" though the STA
>> responses to each BAR instantly again.
>>
>>
>> Another thing: it is striking, that the performance from STA -> AP is at
>> max 8,5 MBit/s, whereas the performance from AP -> STA can be 15 MBit/s
>> (all measured with netperf).
>>
>>
>> Could it be, that somewhere in the stack of the AP the response packet
>> from STA -> AP get's lost or ignored or blocked?
>
> The BlockAck is solely handled in hardware. But I need to check, maybe
> the BAR tx status is reported as unacknowledged when the BA contains
> a BA bitmap of only zeros ...
>
> Don't know when I can find time to do this but if you like to test something
> you could try to add some printks in the rt2x00 tx status path to see if any
> BAR frames are reported as acknowledged (and if these have a bitmap != 0).

I tried the following with reverted patch "mac80211: retry sending
failed BAR frames later instead of tearing down" to see what's done to
not break down the data stream:

-> enabling CONFIG_MAC80211_HT_DEBUG
-> some additional output


Start of netperf:

[18836.107487] Open BA session requested for 00:25:9c:aa:bb:cc tid 0
[18836.116149] IEEE80211_AMPDU_TX_START
[18836.116156] activated addBA response timer on tid 0
[18836.116209] Rx A-MPDU request on tid 0 result 0
[18836.118846] switched off addBA timer for tid 0
[18836.118851] Aggregation is on for tid 0

(that's the same as with the patch, which I reverted)


Later on, I can see some sequences like this:

[18868.875495] ieee80211_ba_session_work() called
___ieee80211_stop_tx_ba_session
[18868.875503] Tx BA session stop requested for 00:25:9c:aa:bb:cc tid 0
[18868.884143] IEEE80211_AMPDU_TX_STOP
[18868.884155] Stopping Tx BA session for 00:25:9c:aa:bb:cc tid 0
[18868.963066] Open BA session requested for 00:25:9c:aa:bb:cc tid 0
[18868.972047] IEEE80211_AMPDU_TX_START
[18868.972054] activated addBA response timer on tid 0
[18868.979154] switched off addBA timer for tid 0
[18868.979156] Aggregation is on for tid 0


With the reverted patch applied again, I can't see any entries in the
moment of the stall. It just stalls silently.
How is the situation I can see with the reverted patch handled now?
Could it be, that it isn't handled at all?


Here you can see the triggering of the teardown of the BA session
(hopefully), if the patch is reverted:

"ieee80211_ba_session_work() called ___ieee80211_stop_tx_ba_session"
triggers the dropping of BA session. The code at this point is
(net/mac80211/ht.c):


--------------------------------------------------------------------------
void ieee80211_ba_session_work(struct work_struct *work) {
...
for (tid = 0; tid < STA_TID_NUM; tid++) {

...

tid_tx = rcu_dereference_protected_tid_tx(sta, tid);

if (tid_tx && test_and_clear_bit(HT_AGG_STATE_WANT_STOP,
&tid_tx->state)) {

printk("ieee80211_ba_session_work called
___ieee80211_stop_tx_ba_session\n"); // debug

___ieee80211_stop_tx_ba_session(sta,tid,
WLAN_BACK_INITIATOR,
true);
}
}
...
}
----------------------------------------------------------------------------

Does this help you to get the reason for the problem? I could do some
more tests - but I'm not sure where to get the bitmap you desire to see :-).


Kind regards,
Andreas

2012-01-17 09:18:06

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Hello Helmut,

Am Tue, 10 Jan 2012 10:50:16 +0100
schrieb Helmut Schaa <[email protected]>:

> On Tue, Jan 10, 2012 at 9:03 AM, Andreas Hartmann
> <[email protected]> wrote:
[...]
> > Looks as if sending of any package is reported as broken at some
> > point of time and this circle cannot be left anymore.
>
> Not necessarily since mac80211 will only retry the BAR if a data frame
> transmission was successful. Hence, it seems as if only the BARs TX
> status is reported incorrectly ...
>
> I see two issues here:
> 1) rt2800pci seems to have problems delivering the BAR _or_ doesn't
> report the tx status correctly
> 2) If the same BAR fails consecutively we should maybe really tear down
> the BA session as it was done before
>
> 2 is done in the below untested patch and will also work around 1.
> Mind to give it a try?

I encountered some temporary, short stalls during low bandwith
(running X terminal sessions tunneled through ssh). Therefore I set

MAX_BAR_RETRIES=0

This means,

ieee80211_send_bar(&sta->sdata->vif, addr, tid,
tid_tx->failed_bar_ssn);

isn't called any more. But that's not all: the BA session isn't torn
down any more, too - but it's working fine (high bandwith, too) anyway!?
Why? (If I remove your patch again, the session completely stalls as
ever).

Behavior looks like this now:

Jan 17 08:13:06 ap kernel: [ 4247.178448] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178456] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 113
Jan 17 08:13:06 ap kernel: [ 4247.178467] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.178478] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178481] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 114
Jan 17 08:13:06 ap kernel: [ 4247.178487] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.178493] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178495] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 115
Jan 17 08:13:06 ap kernel: [ 4247.178500] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.178506] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178508] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 116
Jan 17 08:13:06 ap kernel: [ 4247.178513] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.178518] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178521] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 117
Jan 17 08:13:06 ap kernel: [ 4247.178525] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.178530] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178533] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 118
Jan 17 08:13:06 ap kernel: [ 4247.178537] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.178542] rt2800_txdone_entry - tx failed
Jan 17 08:13:06 ap kernel: [ 4247.178545] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 119
Jan 17 08:13:06 ap kernel: [ 4247.178550] __set_bit ENTRY_TXD_HT_AMPDU
Jan 17 08:13:06 ap kernel: [ 4247.185886] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.185894] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 120
Jan 17 08:13:07 ap kernel: [ 4247.185900] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4320>
Jan 17 08:13:07 ap kernel: [ 4247.196303] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.196310] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 121
Jan 17 08:13:07 ap kernel: [ 4247.196315] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4336>
Jan 17 08:13:07 ap kernel: [ 4247.211337] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.211346] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 122
Jan 17 08:13:07 ap kernel: [ 4247.211352] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4352>
Jan 17 08:13:07 ap kernel: [ 4247.217669] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.217673] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 123
Jan 17 08:13:07 ap kernel: [ 4247.217675] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4368>
Jan 17 08:13:07 ap kernel: [ 4247.226812] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.226821] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 124
Jan 17 08:13:07 ap kernel: [ 4247.226826] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4384>
Jan 17 08:13:07 ap kernel: [ 4247.234151] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.234159] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 125
Jan 17 08:13:07 ap kernel: [ 4247.234165] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4400>
Jan 17 08:13:07 ap kernel: [ 4247.244042] rt2800_txdone_entry - tx failed
Jan 17 08:13:07 ap kernel: [ 4247.244050] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 126
Jan 17 08:13:07 ap kernel: [ 4247.244056] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4416>




BTW:
I can see, that RX A-MPDU is switched on - why not for TX, too? Where
can I see if a hardware is capable to use A-MPDU?

iw list prints for the AP device (2,4 GHz):

Wiphy phy0
Band 1:
HT capabilities: 0x02f2
* 20/40 MHz operation
* static SM PS
* HT-greenfield
* 20 MHz short GI
* 40 MHz short GI
* TX STBC
* RX STBC 2 streams
* max A-MSDU len 3839
HT A-MPDU factor: 0x0003 (65535 bytes)
HT A-MPDU density: 0x0004 (2 usec)
HT MCS set: ff ff 00 00 01 00 00 00 00 00 00 00 07 00 00 00
HT RX MCS rate indexes supported:
MCS index 0
MCS index 1
MCS index 2
MCS index 3
MCS index 4
MCS index 5
MCS index 6
MCS index 7
MCS index 8
MCS index 9
MCS index 10
MCS index 11
MCS index 12
MCS index 13
MCS index 14
MCS index 15
MCS index 32
TX unequal modulation not supported
HT TX Max spatiel streams: 5
HT TX MCS rate indexes supported may differ
Frequencies:
* 2412 MHz [1] (20.0 dBm)
* 2417 MHz [2] (20.0 dBm)
* 2422 MHz [3] (20.0 dBm)
* 2427 MHz [4] (20.0 dBm)
* 2432 MHz [5] (20.0 dBm)
* 2437 MHz [6] (20.0 dBm)
* 2442 MHz [7] (20.0 dBm)
* 2447 MHz [8] (20.0 dBm)
* 2452 MHz [9] (20.0 dBm)
* 2457 MHz [10] (20.0 dBm)
* 2462 MHz [11] (20.0 dBm)
* 2467 MHz [12] (20.0 dBm)
* 2472 MHz [13] (20.0 dBm)
* 2484 MHz [14] (disabled)
Bitrates (non-HT):
* 1.0 Mbps
* 2.0 Mbps (short preamble supported)
* 5.5 Mbps (short preamble supported)
* 11.0 Mbps (short preamble supported)
* 6.0 Mbps
* 9.0 Mbps
* 12.0 Mbps
* 18.0 Mbps
* 24.0 Mbps
* 36.0 Mbps
* 48.0 Mbps
* 54.0 Mbps

Kind regards,
Andreas


> Signed-off-by: Helmut Schaa <[email protected]>
> ---
> diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
> index a18f524..983994b 100644
> --- a/net/mac80211/sta_info.h
> +++ b/net/mac80211/sta_info.h
> @@ -122,7 +122,7 @@ struct tid_ampdu_tx {
> u8 buf_size;
>
> u16 failed_bar_ssn;
> - bool bar_pending;
> + unsigned int bar_pending;
> };
>
> /**
> diff --git a/net/mac80211/status.c b/net/mac80211/status.c
> index 30c265c..ea782f1 100644
> --- a/net/mac80211/status.c
> +++ b/net/mac80211/status.c
> @@ -17,6 +17,7 @@
> #include "led.h"
> #include "wme.h"
>
> +#define MAX_BAR_RETRIES 3
>
> void ieee80211_tx_status_irqsafe(struct ieee80211_hw *hw,
> struct sk_buff *skb)
> @@ -171,8 +172,17 @@ static void ieee80211_check_pending_bar(struct
> sta_info *sta, u8 *addr, u8 tid)
> if (!tid_tx || !tid_tx->bar_pending)
> return;
>
> - tid_tx->bar_pending = false;
> - ieee80211_send_bar(&sta->sdata->vif, addr, tid, tid_tx->failed_bar_ssn);
> + if (--tid_tx->bar_pending) {
> + ieee80211_send_bar(&sta->sdata->vif, addr, tid,
> + tid_tx->failed_bar_ssn);
> + return;
> + }
> +
> + /*
> + * The same BAR failed multiple times, something is clearly wrong
> + * -> Stop the BA session.
> + */
> + ieee80211_stop_tx_ba_session(&sta->sta, tid);
> }
>
> static void ieee80211_frame_acked(struct sta_info *sta, struct sk_buff *skb)
> @@ -225,8 +235,16 @@ static void ieee80211_set_bar_pending(struct
> sta_info *sta, u8 tid, u16 ssn)
> if (!tid_tx)
> return;
>
> + /*
> + * A BAR for the same SSN is still pending, don't
> + * update the pending count.
> + */
> + if (tid_tx->failed_bar_ssn == ssn &&
> + tid_tx->bar_pending)
> + return;
> +
> tid_tx->failed_bar_ssn = ssn;
> - tid_tx->bar_pending = true;
> + tid_tx->bar_pending = MAX_BAR_RETRIES;
> }
>
> static int ieee80211_tx_radiotap_len(struct ieee80211_tx_info *info)


2012-01-23 17:47:07

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Am Mon, 23 Jan 2012 17:36:57 +0100
schrieb Helmut Schaa <[email protected]>:

> On Mon, Jan 23, 2012 at 5:03 PM, Andreas Hartmann
> <[email protected]> wrote:
> > Helmut Schaa schrieb:
> >> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
> >> <[email protected]> wrote:
> >>> All in one: Your workaround was a good idea, but unfortunately it
> >>> doesn't "solve" / workaround the problem introduced with the patch
> >>> "mac80211: retry sending failed BAR frames later instead of tearing
> >>> down aggr" [1]. I don't know off any other solution at this time as to revert it.
> >>
> >> Ok, so we've got at least one hw issue here: rt2800 is not able to report
> >> correct ACK state of BARs :(
> >>
> >> The interesting bit is that the legacy drivers tear the BA session down as
> >> soon as the first AMPDU subframe failed. That's the same behavior as
> >> mac80211 had before the mentioned patch.
> >>
> >> So, let's assume the generic way Felix implemented works for other devices
> >> (like ath9k) we could go with a workaround inside rt2x00 to tear down
> >> the BA session as soon as the first AMPDU frame failed.
> >>
> >> Mind to try that one (not even compile tested though :) ).
> >
> > hdr is missing. I tried to derive it like this:
> >
> > struct ieee80211_hdr *hdr=(void*)entry->skb->data;
>
> Looks correct but you should put it right before the ieee80211_is_data_qos check
> not at the start of the function ...

I did it like this:

--- drivers/net/wireless/rt2x00/rt2x00dev.c.orig 2012-01-23 16:50:03.188718671 +0100
+++ drivers/net/wireless/rt2x00/rt2x00dev.c 2012-01-23 18:11:56.248553764 +0100
@@ -388,8 +388,17 @@
tx_info->status.ampdu_len = 1;
tx_info->status.ampdu_ack_len = success ? 1 : 0;

- if (!success)
- tx_info->flags |= IEEE80211_TX_STAT_AMPDU_NO_BACK;
+ struct ieee80211_hdr *hdr = (void*) entry->skb->data;
+ if (!success && tx_info->control.sta != NULL && ieee80211_is_data_qos(hdr->frame_control)) {
+ /*
+ * Tear down BA session
+ */
+ struct ieee80211_sta *sta = tx_info->control.sta;
+ u8 *qc = ieee80211_get_qos_ctl(hdr);
+ int tid = *qc & IEEE80211_QOS_CTL_TID_MASK;
+
+ ieee80211_stop_tx_ba_session(sta, tid);
+ }
}

if (rate_flags & IEEE80211_TX_RC_USE_RTS_CTS) {


but the machine does crash anyway. Could hdr be null, too?

Andreas

2012-01-10 14:44:39

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Am Tue, 10 Jan 2012 14:43:21 +0100
schrieb Helmut Schaa <[email protected]>:

> Hi,
>
> On Tue, Jan 10, 2012 at 2:01 PM, Andreas Hartmann
> <[email protected]> wrote:
> > Helmut Schaa schrieb:
> > [...]
> >> So this looks good to me although the broken tx status for BAR frames
> >> is for sure a rt2800pci problem (or even a hw issue).
> >
> > Sorry - What do you mean with "hw issue"? Is my hw broken?
>
> Hehe, no :)
>
> > Or is it a specific "behavior" of exactly this device:
>
> I think most rt2800pci will behave the same and just interpret a
> BlockAck as response to a BAR in a different manner as a BlockAck
> in response to a Data frame.
>
> When we send the BAR we tell the hw that the frame needs to be
> acknowledged and according to the spec the hw will only count
> a normal ACK as successful transmission, the peer will however
> answer with a BlockAck as long as a BA session is active.
>
> It might be possible to advise the hw to allow the BAR to be acked
> by a BlockAck by specifying AMDPU=1 in the TXWI. Not sure
> if that breaks anything else.
>
> However, a BAR seems to be "ack-able" by both an ACK and a
> BlockAck according to 802.11-2007. If the recipient answers with
> an ACK the BA session might already be terminated and we would
> stop the BA session as soon as 3 BARs failed with the previous
> patch. This should be fine I guess.
>
> So you might just give the following a try (also fully untested and
> maybe with some of your debugging output enabled).
>
> Signed-off-by: Helmut Schaa <[email protected]>
> ---
>
> diff --git a/drivers/net/wireless/rt2x00/rt2x00queue.c
> b/drivers/net/wireless/rt2x00/rt2x00queue.c
> index e4641da..5d0a3d3 100644
> --- a/drivers/net/wireless/rt2x00/rt2x00queue.c
> +++ b/drivers/net/wireless/rt2x00/rt2x00queue.c
> @@ -393,6 +393,9 @@ static void
> rt2x00queue_create_tx_descriptor_ht(struct rt2x00_dev *rt2x00dev,
> txdesc->u.ht.txop = TXOP_SIFS;
> else
> txdesc->u.ht.txop = TXOP_HTTXOP;
> +
> + if (ieee80211_is_back_req(hdr->frame_control))
> + __set_bit(ENTRY_TXD_HT_AMPDU, &txdesc->flags);
> }
>
> static void rt2x00queue_create_tx_descriptor(struct rt2x00_dev *rt2x00dev,

Unfortunately, I don't think it fixes anything (an equal looking
sequence could be seen just 10s before - ok other ssn numbers :-)):

[38517.363964] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 19
[38517.363977] __set_bit ENTRY_TXD_HT_AMPDU
[38517.363989] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 20
[38517.363995] __set_bit ENTRY_TXD_HT_AMPDU
[38517.364005] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 21
[38517.364011] __set_bit ENTRY_TXD_HT_AMPDU
[38517.364020] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 22
[38517.364026] __set_bit ENTRY_TXD_HT_AMPDU
[38517.364034] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 23
[38517.364042] __set_bit ENTRY_TXD_HT_AMPDU
[38517.375021] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 24
[38517.375028] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27248>
[38517.389129] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 25
[38517.389132] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27264>
[38517.397279] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 26
[38517.397283] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27280>
[38517.407062] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 27
[38517.407069] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27296>
[38517.417684] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 28
[38517.417691] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27312>
[38517.417742] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 2 tid: 0 failed_bar_ssn: 27312
[38517.417751] __set_bit ENTRY_TXD_HT_AMPDU
[38517.417763] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 1 tid: 0 failed_bar_ssn: 27312
[38517.417768] __set_bit ENTRY_TXD_HT_AMPDU
[38517.417775] ieee80211_check_pending_bar -> ieee80211_stop_tx_ba_session() retry cnt: 0 tid: 0 failed_bar_ssn: 27312
[38517.417840] Tx BA session stop requested for 00:25:9c:11:22:33 tid 0
[38517.426147] Stopping Tx BA session for 00:25:9c:11:22:33 tid 0
[38517.432511] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 29
[38517.432518] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27312>
[38517.439356] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 30
[38517.439362] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <27312>
[38517.491026] Open BA session requested for 00:25:9c:11:22:33 tid 0
[38517.499048] activated addBA response timer on tid 0
[38517.507633] switched off addBA timer for tid 0
[38517.507637] Aggregation is on for tid 0


Some additional information might be interesting:
I'm getting these tx errors nearly always (I haven't seen them in the
other direction until now) while sending data from STA -> AP (not from
AP -> STA). The throughput from STA -> AP is mostly about 8 MBit/s
whereas the other direction reaches up to 15 MBit/s.



Kind regards,
Andreas

2012-01-24 07:29:53

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Mon, Jan 23, 2012 at 02:29:08PM +0100, Helmut Schaa wrote:
> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
> <[email protected]> wrote:
> > All in one: Your workaround was a good idea, but unfortunately it
> > doesn't "solve" / workaround the problem introduced with the patch
> > "mac80211: retry sending failed BAR frames later instead of tearing
> > down aggr" [1]. I don't know off any other solution at this time as to revert it.
>
> Ok, so we've got at least one hw issue here: rt2800 is not able to report
> correct ACK state of BARs :(

Whould it help, if rt2800 driver will not filter out BAR frames?

diff --git a/drivers/net/wireless/rt2x00/rt2800lib.c b/drivers/net/wireless/rt2x00/rt2800lib.c
index 22a1a8f..20418be 100644
--- a/drivers/net/wireless/rt2x00/rt2800lib.c
+++ b/drivers/net/wireless/rt2x00/rt2800lib.c
@@ -1205,8 +1205,7 @@ void rt2800_config_filter(struct rt2x00_dev *rt2x00dev,
!(filter_flags & FIF_PSPOLL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BA,
!(filter_flags & FIF_CONTROL));
- rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BAR,
- !(filter_flags & FIF_CONTROL));
+ rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BAR, 0);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CNTL,
!(filter_flags & FIF_CONTROL));
rt2800_register_write(rt2x00dev, RX_FILTER_CFG, reg);



2012-01-10 13:43:21

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Hi,

On Tue, Jan 10, 2012 at 2:01 PM, Andreas Hartmann
<[email protected]> wrote:
> Helmut Schaa schrieb:
> [...]
>> So this looks good to me although the broken tx status for BAR frames
>> is for sure a rt2800pci problem (or even a hw issue).
>
> Sorry - What do you mean with "hw issue"? Is my hw broken?

Hehe, no :)

> Or is it a specific "behavior" of exactly this device:

I think most rt2800pci will behave the same and just interpret a
BlockAck as response to a BAR in a different manner as a BlockAck
in response to a Data frame.

When we send the BAR we tell the hw that the frame needs to be
acknowledged and according to the spec the hw will only count
a normal ACK as successful transmission, the peer will however
answer with a BlockAck as long as a BA session is active.

It might be possible to advise the hw to allow the BAR to be acked
by a BlockAck by specifying AMDPU=1 in the TXWI. Not sure
if that breaks anything else.

However, a BAR seems to be "ack-able" by both an ACK and a
BlockAck according to 802.11-2007. If the recipient answers with
an ACK the BA session might already be terminated and we would
stop the BA session as soon as 3 BARs failed with the previous
patch. This should be fine I guess.

So you might just give the following a try (also fully untested and
maybe with some of your debugging output enabled).

Signed-off-by: Helmut Schaa <[email protected]>
---

diff --git a/drivers/net/wireless/rt2x00/rt2x00queue.c
b/drivers/net/wireless/rt2x00/rt2x00queue.c
index e4641da..5d0a3d3 100644
--- a/drivers/net/wireless/rt2x00/rt2x00queue.c
+++ b/drivers/net/wireless/rt2x00/rt2x00queue.c
@@ -393,6 +393,9 @@ static void
rt2x00queue_create_tx_descriptor_ht(struct rt2x00_dev *rt2x00dev,
txdesc->u.ht.txop = TXOP_SIFS;
else
txdesc->u.ht.txop = TXOP_HTTXOP;
+
+ if (ieee80211_is_back_req(hdr->frame_control))
+ __set_bit(ENTRY_TXD_HT_AMPDU, &txdesc->flags);
}

static void rt2x00queue_create_tx_descriptor(struct rt2x00_dev *rt2x00dev,

2012-01-10 13:04:24

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Helmut Schaa schrieb:
[...]
> So this looks good to me although the broken tx status for BAR frames
> is for sure a rt2800pci problem (or even a hw issue).

Sorry - What do you mean with "hw issue"? Is my hw broken? Or is
it a specific "behavior" of exactly this device:


05:07.0 Network controller: RaLink RT2800 802.11n PCI
Subsystem: Linksys Device 0067
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 32 (500ns min, 1000ns max), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 21
Region 0: Memory at fdae0000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: rt2800pci


Kind regards,
Andreas

2012-01-10 08:06:38

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Am Mon, 9 Jan 2012 14:45:06 +0100
schrieb Helmut Schaa <[email protected]>:

> On Mon, Jan 9, 2012 at 2:29 PM, Andreas Hartmann
> <[email protected]> wrote:
> > Helmut Schaa schrieb:
> >> On Thu, Jan 5, 2012 at 9:44 AM, Andreas Hartmann
> >> <[email protected]> wrote:
> >>> Helmut Schaa schrieb:
> >>>> Hi,
> >>>>
> >>>> Am Mittwoch, 4. Januar 2012, 09:38:02 schrieb Andreas Hartmann:
> >>>>> Helmut Schaa schrieb:
> >>>>>>> Removing this patch
> >>>>>>>
> >>>>>>> mac80211: retry sending failed BAR frames later instead of tearing down
> >>>>>>> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
> >>>>>>> it working again.
> >>>>>>>
> >>>>>>> Device is: RaLink RT2800 802.11n PCI
> >>>>>>
> >>>>>> What's the client device connected to the rt2x00 AP?
> >>>>>
> >>>>> It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
> >>>>> module.
> >>>>>
> >>>>>> Mind to send a
> >>>>>> 802.11 capture wen this stall happens?
> >>>>
> >>>> Andreas provided me with a capture now. The interesting part is that
> >>>> the rt2x00 AP sends bursts of BARs with the same sequence number when
> >>>> the stall happens. I only had a quick look at the code in question but
> >>>> couldn't see anything wrong at a first glance.
> >>>
> >>>
> >>> Some additional information:
> >>>
> >>> - The AP initiates BARs for 6 different, successive sequence numbers.
> >>> - The first 5 sequence numbers are finished after sending of about 8 (!)
> >>> BAR's.
> >>> Important: each BAR is responded *instantly* by the STA.
> >>> - The BAR for the next sequence number is sent "endless" though the STA
> >>> responses to each BAR instantly again.
> >>>
> >>>
> >>> Another thing: it is striking, that the performance from STA -> AP is at
> >>> max 8,5 MBit/s, whereas the performance from AP -> STA can be 15 MBit/s
> >>> (all measured with netperf).
> >>>
> >>>
> >>> Could it be, that somewhere in the stack of the AP the response packet
> >>> from STA -> AP get's lost or ignored or blocked?
> >>
> >> The BlockAck is solely handled in hardware. But I need to check, maybe
> >> the BAR tx status is reported as unacknowledged when the BA contains
> >> a BA bitmap of only zeros ...
> >>
> >> Don't know when I can find time to do this but if you like to test something
> >> you could try to add some printks in the rt2x00 tx status path to see if any
> >> BAR frames are reported as acknowledged (and if these have a bitmap != 0).
> >
> > I tried the following with reverted patch "mac80211: retry sending
> > failed BAR frames later instead of tearing down" to see what's done to
> > not break down the data stream:
> >
> > -> enabling CONFIG_MAC80211_HT_DEBUG
> > -> some additional output
> >
> >
> > Start of netperf:
> >
> > [18836.107487] Open BA session requested for 00:25:9c:aa:bb:cc tid 0
> > [18836.116149] IEEE80211_AMPDU_TX_START
> > [18836.116156] activated addBA response timer on tid 0
> > [18836.116209] Rx A-MPDU request on tid 0 result 0
> > [18836.118846] switched off addBA timer for tid 0
> > [18836.118851] Aggregation is on for tid 0
> >
> > (that's the same as with the patch, which I reverted)
> >
> >
> > Later on, I can see some sequences like this:
> >
> > [18868.875495] ieee80211_ba_session_work() called
> > ___ieee80211_stop_tx_ba_session
> > [18868.875503] Tx BA session stop requested for 00:25:9c:aa:bb:cc tid 0
> > [18868.884143] IEEE80211_AMPDU_TX_STOP
> > [18868.884155] Stopping Tx BA session for 00:25:9c:aa:bb:cc tid 0
> > [18868.963066] Open BA session requested for 00:25:9c:aa:bb:cc tid 0
> > [18868.972047] IEEE80211_AMPDU_TX_START
> > [18868.972054] activated addBA response timer on tid 0
> > [18868.979154] switched off addBA timer for tid 0
> > [18868.979156] Aggregation is on for tid 0
> >
> >
> > With the reverted patch applied again, I can't see any entries in the
> > moment of the stall. It just stalls silently.
> > How is the situation I can see with the reverted patch handled now?
> > Could it be, that it isn't handled at all?
>
> Felix patch intended to not restart the BA session in this case as it
> introduces some unnecessary processing if it would be possible to
> just flush the reordering buffer of the peer by resending the BAR
> later. However, there is no message printed in this case ...

Thank you for your really fast reply!
Unfortunately, freenet mailing was broken yesterday. Hope it's fine again today.

Meanwhile, I took a look at the tx_path with Felix patch applied. I'm
getting this output:


-> netperf start

[38139.839028] Open BA session requested for 00:25:9c:de:4e:30 tid 0
[38139.848151] IEEE80211_AMPDU_TX_START
[38139.848157] activated addBA response timer on tid 0
[38139.849723] switched off addBA timer for tid 0
[38139.849728] Aggregation is on for tid 0

-> stall

[38161.686305] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 1
[38161.698957] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 2

[38161.698963] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.699116] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.710238] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 3
[38161.710240] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.710394] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.724512] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 4
[38161.724519] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.724659] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.735416] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 5
[38161.735423] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.735486] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.747815] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 6
[38161.747822] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.747888] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.755268] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 7
[38161.755271] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.758823] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.772898] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 8
[38161.772901] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
[38161.773049] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
[38161.789222] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 9
...

Looks as if sending of any package is reported as broken at some
point of time and this circle cannot be left anymore.


The additional debug output in rt2x00dev.c is:

void rt2x00lib_txdone(struct queue_entry *entry,
struct txdone_entry_desc *txdesc)
{
...

if (!(tx_info->flags & IEEE80211_TX_CTL_NO_ACK)) {
if (success)
tx_info->flags |= IEEE80211_TX_STAT_ACK;
else {
rt2x00dev->low_level_stats.dot11ACKFailureCount++;
printk("rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: %d\n",
rt2x00dev->low_level_stats.dot11ACKFailureCount); // debug
}
}
...
}


In status.c:

void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
{
...
if (!acked && ieee80211_is_back_req(fc)) {
u16 control;

/*
* BAR failed, store the last SSN and retry sending
* the BAR when the next unicast transmission on the
* same TID succeeds.
*/
bar = (struct ieee80211_bar *) skb->data;
control = le16_to_cpu(bar->control);
if (!(control & IEEE80211_BAR_CTRL_MULTI_TID)) {
u16 ssn = le16_to_cpu(bar->start_seq_num);

tid = (control &
IEEE80211_BAR_CTRL_TID_INFO_MASK) >>
IEEE80211_BAR_CTRL_TID_INFO_SHIFT;

printk("status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <%d> control: <%d>\n",
tid,control); // debug
ieee80211_set_bar_pending(sta, tid, ssn);
}
}
...
}



static void ieee80211_check_pending_bar(struct sta_info *sta, u8 *addr, u8 tid)
{
struct tid_ampdu_tx *tid_tx;

tid_tx = rcu_dereference(sta->ampdu_mlme.tid_tx[tid]);
if (!tid_tx || !tid_tx->bar_pending)
return;

printk("ieee80211_check_pending_bar -> ieee80211_send_bar() %s tid: %d failed_bar_ssn: %d\n",
tid_tx->bar_pending?"true":"false",tid, tid_tx->failed_bar_ssn); // debug

tid_tx->bar_pending = false;
ieee80211_send_bar(&sta->sdata->vif, addr, tid, tid_tx->failed_bar_ssn);
}



Kind regards,
Andreas



2012-01-04 15:04:28

by Helmut Schaa

[permalink] [raw]
Subject: Re: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Hi,

Am Mittwoch, 4. Januar 2012, 09:38:02 schrieb Andreas Hartmann:
> Helmut Schaa schrieb:
> >> Removing this patch
> >>
> >> mac80211: retry sending failed BAR frames later instead of tearing down
> >> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
> >> it working again.
> >>
> >> Device is: RaLink RT2800 802.11n PCI
> >
> > What's the client device connected to the rt2x00 AP?
>
> It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
> module.
>
> > Mind to send a
> > 802.11 capture wen this stall happens?

Andreas provided me with a capture now. The interesting part is that
the rt2x00 AP sends bursts of BARs with the same sequence number when
the stall happens. I only had a quick look at the code in question but
couldn't see anything wrong at a first glance.

Helmut

2012-01-24 08:04:54

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Stanislaw Gruszka schrieb:
> On Mon, Jan 23, 2012 at 02:29:08PM +0100, Helmut Schaa wrote:
>> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
>> <[email protected]> wrote:
>>> All in one: Your workaround was a good idea, but unfortunately it
>>> doesn't "solve" / workaround the problem introduced with the patch
>>> "mac80211: retry sending failed BAR frames later instead of tearing
>>> down aggr" [1]. I don't know off any other solution at this time as to revert it.
>>
>> Ok, so we've got at least one hw issue here: rt2800 is not able to report
>> correct ACK state of BARs :(
>
> Whould it help, if rt2800 driver will not filter out BAR frames?
>
> diff --git a/drivers/net/wireless/rt2x00/rt2800lib.c b/drivers/net/wireless/rt2x00/rt2800lib.c
> index 22a1a8f..20418be 100644
> --- a/drivers/net/wireless/rt2x00/rt2800lib.c
> +++ b/drivers/net/wireless/rt2x00/rt2800lib.c
> @@ -1205,8 +1205,7 @@ void rt2800_config_filter(struct rt2x00_dev *rt2x00dev,
> !(filter_flags & FIF_PSPOLL));
> rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BA,
> !(filter_flags & FIF_CONTROL));
> - rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BAR,
> - !(filter_flags & FIF_CONTROL));
> + rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BAR, 0);
> rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CNTL,
> !(filter_flags & FIF_CONTROL));
> rt2800_register_write(rt2x00dev, RX_FILTER_CFG, reg);
>
>

This can't be applied to compat-wireless-3.2-1-s (same in
compat-wireless-3.2-rc6-3) - it seems to be already existing. The
relevant part looks like this:


void rt2800_config_filter(struct rt2x00_dev *rt2x00dev,
const unsigned int filter_flags)
{
u32 reg;

/*
* Start configuration steps.
* Note that the version error will always be dropped
* and broadcast frames will always be accepted since
* there is no filter for it at this time.
*/
rt2800_register_read(rt2x00dev, RX_FILTER_CFG, &reg);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CRC_ERROR,
!(filter_flags & FIF_FCSFAIL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_PHY_ERROR,
!(filter_flags & FIF_PLCPFAIL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_NOT_TO_ME,
!(filter_flags & FIF_PROMISC_IN_BSS));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_NOT_MY_BSSD, 0);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_VER_ERROR, 1);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_MULTICAST,
!(filter_flags & FIF_ALLMULTI));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BROADCAST, 0);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_DUPLICATE, 1);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CF_END_ACK,
!(filter_flags & FIF_CONTROL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CF_END,
!(filter_flags & FIF_CONTROL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_ACK,
!(filter_flags & FIF_CONTROL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CTS,
!(filter_flags & FIF_CONTROL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_RTS,
!(filter_flags & FIF_CONTROL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_PSPOLL,
!(filter_flags & FIF_PSPOLL));
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BA, 1);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_BAR, 0);
rt2x00_set_field32(&reg, RX_FILTER_CFG_DROP_CNTL,
!(filter_flags & FIF_CONTROL));
rt2800_register_write(rt2x00dev, RX_FILTER_CFG, reg);
}
EXPORT_SYMBOL_GPL(rt2800_config_filter);


Thanks,
Andreas

2012-01-04 08:40:39

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Hi Helmut,

happy new year for you, too!

Helmut Schaa schrieb:
> Hi Andreas,
>
> Am Samstag, 31. Dezember 2011, 21:35:04 schrieb Andreas Hartmann:
>> Andreas Hartmann schrieb:
>>> Hello,
>>>
>>> I tested compat-wireless-3.2-rc6-3 with kernel 3.1 and a rt2860 device
>>> running in AP mode (80211.n - 40 MHz). During
>>>
>>> netperf -t TCP_STREAM -H ap
>>>
>>> the data stream mostly stalls (-> the device receives data) after a few
>>> seconds and never comes up any more. The connection is completely death.
>>> I have to restart wpa_supplicant to get it working again.
>>>
>>>
>>>
>>> I tried to isolate the problem the following way:
>>>
>>> 1. Running compat-wireless-3.2-rc6-3 with the rt2x00 level from
>>> compat-wireless-3.1.1-1. The problem didn't gone.
>>>
>>> 2. Running compat-wireless-3.1.1-1 with the rt2x00 based changes in
>>> compat-wireless-3.2-rc6-3. I can't see the problem any more.
>>>
>>>
>>> That's why I think, that the problem must be outside the rt2x00 changes.
>>> But there are many changes outside :-).
>>
>>
>> Removing this patch
>>
>> mac80211: retry sending failed BAR frames later instead of tearing down
>> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
>> it working again.
>>
>> Device is: RaLink RT2800 802.11n PCI
>
> What's the client device connected to the rt2x00 AP?

It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
module.

> Mind to send a
> 802.11 capture wen this stall happens?

I would try to scan the traffic with a third device in monitor mode
(airmon-ng). But: 802.11n uses more than one channel. How can I get
them? airmon-ng can't be told to listen to more than one channel at a
time (or better: I probably don't know how to do it :-)).

Do you have an idea, how to completely trace 802.11n?


Thanks,
Andreas

2012-01-09 13:45:07

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Mon, Jan 9, 2012 at 2:29 PM, Andreas Hartmann
<[email protected]> wrote:
> Helmut Schaa schrieb:
>> On Thu, Jan 5, 2012 at 9:44 AM, Andreas Hartmann
>> <[email protected]> wrote:
>>> Helmut Schaa schrieb:
>>>> Hi,
>>>>
>>>> Am Mittwoch, 4. Januar 2012, 09:38:02 schrieb Andreas Hartmann:
>>>>> Helmut Schaa schrieb:
>>>>>>> Removing this patch
>>>>>>>
>>>>>>> mac80211: retry sending failed BAR frames later instead of tearing down
>>>>>>> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
>>>>>>> it working again.
>>>>>>>
>>>>>>> Device is: RaLink RT2800 802.11n PCI
>>>>>>
>>>>>> What's the client device connected to the rt2x00 AP?
>>>>>
>>>>> It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
>>>>> module.
>>>>>
>>>>>> Mind to send a
>>>>>> 802.11 capture wen this stall happens?
>>>>
>>>> Andreas provided me with a capture now. The interesting part is that
>>>> the rt2x00 AP sends bursts of BARs with the same sequence number when
>>>> the stall happens. I only had a quick look at the code in question but
>>>> couldn't see anything wrong at a first glance.
>>>
>>>
>>> Some additional information:
>>>
>>> - The AP initiates BARs for 6 different, successive sequence numbers.
>>> - The first 5 sequence numbers are finished after sending of about 8 (!)
>>> BAR's.
>>> Important: each BAR is responded *instantly* by the STA.
>>> - The BAR for the next sequence number is sent "endless" though the STA
>>> responses to each BAR instantly again.
>>>
>>>
>>> Another thing: it is striking, that the performance from STA -> AP is at
>>> max 8,5 MBit/s, whereas the performance from AP -> STA can be 15 MBit/s
>>> (all measured with netperf).
>>>
>>>
>>> Could it be, that somewhere in the stack of the AP the response packet
>>> from STA -> AP get's lost or ignored or blocked?
>>
>> The BlockAck is solely handled in hardware. But I need to check, maybe
>> the BAR tx status is reported as unacknowledged when the BA contains
>> a BA bitmap of only zeros ...
>>
>> Don't know when I can find time to do this but if you like to test something
>> you could try to add some printks in the rt2x00 tx status path to see if any
>> BAR frames are reported as acknowledged (and if these have a bitmap != 0).
>
> I tried the following with reverted patch "mac80211: retry sending
> failed BAR frames later instead of tearing down" to see what's done to
> not break down the data stream:
>
> -> enabling CONFIG_MAC80211_HT_DEBUG
> -> some additional output
>
>
> Start of netperf:
>
> [18836.107487] Open BA session requested for 00:25:9c:aa:bb:cc tid 0
> [18836.116149] IEEE80211_AMPDU_TX_START
> [18836.116156] activated addBA response timer on tid 0
> [18836.116209] Rx A-MPDU request on tid 0 result 0
> [18836.118846] switched off addBA timer for tid 0
> [18836.118851] Aggregation is on for tid 0
>
> (that's the same as with the patch, which I reverted)
>
>
> Later on, I can see some sequences like this:
>
> [18868.875495] ieee80211_ba_session_work() called
> ___ieee80211_stop_tx_ba_session
> [18868.875503] Tx BA session stop requested for 00:25:9c:aa:bb:cc tid 0
> [18868.884143] IEEE80211_AMPDU_TX_STOP
> [18868.884155] Stopping Tx BA session for 00:25:9c:aa:bb:cc tid 0
> [18868.963066] Open BA session requested for 00:25:9c:aa:bb:cc tid 0
> [18868.972047] IEEE80211_AMPDU_TX_START
> [18868.972054] activated addBA response timer on tid 0
> [18868.979154] switched off addBA timer for tid 0
> [18868.979156] Aggregation is on for tid 0
>
>
> With the reverted patch applied again, I can't see any entries in the
> moment of the stall. It just stalls silently.
> How is the situation I can see with the reverted patch handled now?
> Could it be, that it isn't handled at all?

Felix patch intended to not restart the BA session in this case as it
introduces some unnecessary processing if it would be possible to
just flush the reordering buffer of the peer by resending the BAR
later. However, there is no message printed in this case ...

Helmut

2012-01-03 21:51:21

by Helmut Schaa

[permalink] [raw]
Subject: Re: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Hi Andreas,

Am Samstag, 31. Dezember 2011, 21:35:04 schrieb Andreas Hartmann:
> Andreas Hartmann schrieb:
> > Hello,
> >
> > I tested compat-wireless-3.2-rc6-3 with kernel 3.1 and a rt2860 device
> > running in AP mode (80211.n - 40 MHz). During
> >
> > netperf -t TCP_STREAM -H ap
> >
> > the data stream mostly stalls (-> the device receives data) after a few
> > seconds and never comes up any more. The connection is completely death.
> > I have to restart wpa_supplicant to get it working again.
> >
> >
> >
> > I tried to isolate the problem the following way:
> >
> > 1. Running compat-wireless-3.2-rc6-3 with the rt2x00 level from
> > compat-wireless-3.1.1-1. The problem didn't gone.
> >
> > 2. Running compat-wireless-3.1.1-1 with the rt2x00 based changes in
> > compat-wireless-3.2-rc6-3. I can't see the problem any more.
> >
> >
> > That's why I think, that the problem must be outside the rt2x00 changes.
> > But there are many changes outside :-).
>
>
> Removing this patch
>
> mac80211: retry sending failed BAR frames later instead of tearing down
> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
> it working again.
>
> Device is: RaLink RT2800 802.11n PCI

What's the client device connected to the rt2x00 AP? Mind to send a
802.11 capture wen this stall happens?

Thanks and a happy new year,
Helmut

2012-01-10 12:23:01

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Tue, Jan 10, 2012 at 12:56 PM, Andreas Hartmann
<[email protected]> wrote:
> I think, that there is a problem with the tx status, as if I can see the
> BAR in the trace and even the response of the STA (so the BAR was
> probably ok), there really seems to be a problem with the tx status.
>
>> 2) If the same BAR fails consecutively we should maybe really tear down
>> ? ?the BA session as it was done before
>>
>> 2 is done in the below untested patch and will also work around 1.
>> Mind to give it a try?
>
> Yes :-)
>
> The workaround patch works fine for me (at a first glance - I didn't
> check for any side effects yet). The stall doesn't happen any more. But
> I'm not really sure, why it works:
>
> At the beginning, you just can see 9 tx failures.

This is most likely a complete AMPDU that failed.

> After a few tx
> failures, you can see 7 BAR's, which are pending, but never passed to
> ieee80211_send_bar().

The BAR for each failed AMPDU subframe is sent by mac80211.

> - Why are they suddenly ok (are they really ok)? Or is it a logging
> ?problem (too much input at the same time)?
>
>
> At session 38176, you can see your patch at work :-).
>
>
> [27229.720767] rt2800_txdone_entry - tx failed
> [27229.720791] rt2800_txdone_entry - tx failed
> [27229.720800] rt2800_txdone_entry - tx failed
> [27229.720807] rt2800_txdone_entry - tx failed
> [27229.720814] rt2800_txdone_entry - tx failed
> [27229.720821] rt2800_txdone_entry - tx failed
> [27229.720828] rt2800_txdone_entry - tx failed
> [27229.720835] rt2800_txdone_entry - tx failed
> [27229.727130] rt2800_txdone_entry - tx failed
> [27229.727133] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38064>
> [27229.734419] rt2800_txdone_entry - tx failed
> [27229.734431] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38080>
> [27229.741731] rt2800_txdone_entry - tx failed
> [27229.741741] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38096>
> [27229.749097] rt2800_txdone_entry - tx failed
> [27229.749107] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38112>
> [27229.758136] rt2800_txdone_entry - tx failed
> [27229.758143] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38128>
> [27229.765453] rt2800_txdone_entry - tx failed
> [27229.765461] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38144>
> [27229.771836] rt2800_txdone_entry - tx failed
> [27229.771843] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38160>
> [27229.779281] rt2800_txdone_entry - tx failed
> [27229.779290] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38176>
>
> [27229.779315] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 2 tid: 0 failed_bar_ssn: 38176
> [27229.786705] rt2800_txdone_entry - tx failed
> [27229.786714] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38176>
> [27229.793439] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 1 tid: 0 failed_bar_ssn: 38176
>
> [27229.793452] ieee80211_check_pending_bar -> ieee80211_stop_tx_ba_session() retry cnt: 0 tid: 0 failed_bar_ssn: 38176
>
> [27229.793476] Tx BA session stop requested for 00:25:9c:11:22:33 tid 0
> [27229.802082] Stopping Tx BA session for 00:25:9c:11:22:33 tid 0
>
> [27229.807455] rt2800_txdone_entry - tx failed
> [27229.807465] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <38176>
>
> [27229.816023] Open BA session requested for 00:25:9c:11:22:33 tid 0
> [27229.834077] activated addBA response timer on tid 0
> [27229.835517] switched off addBA timer for tid 0
> [27229.835522] Aggregation is on for tid 0
>
>
>
> The following test is done with reduced debugging (without the tx
> failed entries). But I can see the set_bar_pending calls without a
> send_bar, too. This block is typical and comes up about 6 times a
> minute (not regularly).
>
> [29007.375460] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4000>
> [29007.384394] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4016>
> [29007.391329] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4032>
> [29007.398646] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4048>
> [29007.409470] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4064>
> [29007.415295] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4080>
> [29007.425943] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4096>
> [29007.439950] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4112>
> [29007.440041] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 2 tid: 0 failed_bar_ssn: 4112
> [29007.440061] ieee80211_check_pending_bar -> ieee80211_send_bar() retry cnt: 1 tid: 0 failed_bar_ssn: 4112
> [29007.440071] ieee80211_check_pending_bar -> ieee80211_stop_tx_ba_session() retry cnt: 0 tid: 0 failed_bar_ssn: 4112
> [29007.440100] Tx BA session stop requested for 00:25:9c:11:22:33 tid 0
> [29007.447797] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4112>
> [29007.449024] Stopping Tx BA session for 00:25:9c:11:22:33 tid 0
> [29007.461327] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4> ssn: <4112>
> [29007.490023] Open BA session requested for 00:25:9c:11:22:33 tid 0
> [29007.499048] activated addBA response timer on tid 0
> [29007.504198] switched off addBA timer for tid 0
> [29007.504203] Aggregation is on for tid 0

The BAR will only be resent after the next successful AMPDU transmission. And if
this BAR transmission fails 3 times the session will be shut down (and restarted
later).

So this looks good to me although the broken tx status for BAR frames
is for sure
a rt2800pci problem (or even a hw issue).

Helmut

2012-01-09 10:23:22

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Thu, Jan 5, 2012 at 9:44 AM, Andreas Hartmann
<[email protected]> wrote:
> Helmut Schaa schrieb:
>> Hi,
>>
>> Am Mittwoch, 4. Januar 2012, 09:38:02 schrieb Andreas Hartmann:
>>> Helmut Schaa schrieb:
>>>>> Removing this patch
>>>>>
>>>>> mac80211: retry sending failed BAR frames later instead of tearing down
>>>>> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
>>>>> it working again.
>>>>>
>>>>> Device is: RaLink RT2800 802.11n PCI
>>>>
>>>> What's the client device connected to the rt2x00 AP?
>>>
>>> It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
>>> module.
>>>
>>>> Mind to send a
>>>> 802.11 capture wen this stall happens?
>>
>> Andreas provided me with a capture now. The interesting part is that
>> the rt2x00 AP sends bursts of BARs with the same sequence number when
>> the stall happens. I only had a quick look at the code in question but
>> couldn't see anything wrong at a first glance.
>
>
> Some additional information:
>
> - The AP initiates BARs for 6 different, successive sequence numbers.
> - The first 5 sequence numbers are finished after sending of about 8 (!)
> BAR's.
> Important: each BAR is responded *instantly* by the STA.
> - The BAR for the next sequence number is sent "endless" though the STA
> responses to each BAR instantly again.
>
>
> Another thing: it is striking, that the performance from STA -> AP is at
> max 8,5 MBit/s, whereas the performance from AP -> STA can be 15 MBit/s
> (all measured with netperf).
>
>
> Could it be, that somewhere in the stack of the AP the response packet
> from STA -> AP get's lost or ignored or blocked?

The BlockAck is solely handled in hardware. But I need to check, maybe
the BAR tx status is reported as unacknowledged when the BA contains
a BA bitmap of only zeros ...

Don't know when I can find time to do this but if you like to test something
you could try to add some printks in the rt2x00 tx status path to see if any
BAR frames are reported as acknowledged (and if these have a bitmap != 0).

Helmut

2012-01-24 15:19:28

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Tue, Jan 24, 2012 at 3:39 PM, Stanislaw Gruszka <[email protected]> wrote:
> On Tue, Jan 24, 2012 at 09:07:04AM +0100, Helmut Schaa wrote:
>> > but the machine does crash anyway. Could hdr be null, too?
>>
>> Ok, I'll do some testing as soon as time allows and get back to you then.
>
> Not sure how good idea is is, but perhaps we could introduce mac80211 flag
> that will disable behaviour of f0425bed commit for drivers, which will
> set that flag.

I don't think this is a good idea, since if it is a device workaround
it should be
implemented in the driver. That's what I tried before but according to Andreas
the patch causes a crash. But in general it should be possible to bring a
BA session down by calling ieee80211_stop_tx_ba_session from the driver.
Not sure why that'll crash :(

Helmut

2012-01-23 16:36:58

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Mon, Jan 23, 2012 at 5:03 PM, Andreas Hartmann
<[email protected]> wrote:
> Helmut Schaa schrieb:
>> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
>> <[email protected]> wrote:
>>> All in one: Your workaround was a good idea, but unfortunately it
>>> doesn't "solve" / workaround the problem introduced with the patch
>>> "mac80211: retry sending failed BAR frames later instead of tearing
>>> down aggr" [1]. I don't know off any other solution at this time as to revert it.
>>
>> Ok, so we've got at least one hw issue here: rt2800 is not able to report
>> correct ACK state of BARs :(
>>
>> The interesting bit is that the legacy drivers tear the BA session down as
>> soon as the first AMPDU subframe failed. That's the same behavior as
>> mac80211 had before the mentioned patch.
>>
>> So, let's assume the generic way Felix implemented works for other devices
>> (like ath9k) we could go with a workaround inside rt2x00 to tear down
>> the BA session as soon as the first AMPDU frame failed.
>>
>> Mind to try that one (not even compile tested though :) ).
>
> hdr is missing. I tried to derive it like this:
>
> struct ieee80211_hdr *hdr=(void*)entry->skb->data;

Looks correct but you should put it right before the ieee80211_is_data_qos check
not at the start of the function ...

Also you might want to add " tx_info->control.sta != NULL" check in
the same if clause.

Helmut

2012-01-10 09:50:18

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Tue, Jan 10, 2012 at 9:03 AM, Andreas Hartmann
<[email protected]> wrote:
> Meanwhile, I took a look at the tx_path with Felix patch applied. I'm
> getting this output:
>
>
> -> netperf start
>
> [38139.839028] Open BA session requested for 00:25:9c:de:4e:30 tid 0
> [38139.848151] IEEE80211_AMPDU_TX_START
> [38139.848157] activated addBA response timer on tid 0
> [38139.849723] switched off addBA timer for tid 0
> [38139.849728] Aggregation is on for tid 0
>
> -> stall
>
> [38161.686305] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 1
> [38161.698957] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 2
>
> [38161.698963] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.699116] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.710238] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 3
> [38161.710240] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.710394] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.724512] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 4
> [38161.724519] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.724659] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.735416] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 5
> [38161.735423] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.735486] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.747815] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 6
> [38161.747822] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.747888] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.755268] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 7
> [38161.755271] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.758823] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.772898] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 8
> [38161.772901] status.c ieee80211_tx_status() calls ieee80211_set_bar_pending. tid: <0> control: <4>
> [38161.773049] ieee80211_check_pending_bar() -> ieee80211_send_bar() true tid: 0 failed_bar_ssn: 65040
> [38161.789222] rt2x00lib_txdone - no success - low_level_stats.dot11ACKFailureCount: 9
> ...
>
> Looks as if sending of any package is reported as broken at some
> point of time and this circle cannot be left anymore.

Not necessarily since mac80211 will only retry the BAR if a data frame
transmission was successful. Hence, it seems as if only the BARs TX
status is reported incorrectly ...

I see two issues here:
1) rt2800pci seems to have problems delivering the BAR _or_ doesn't
report the tx status correctly
2) If the same BAR fails consecutively we should maybe really tear down
the BA session as it was done before

2 is done in the below untested patch and will also work around 1.
Mind to give it a try?

Thanks,
Helmut

Signed-off-by: Helmut Schaa <[email protected]>
---
diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
index a18f524..983994b 100644
--- a/net/mac80211/sta_info.h
+++ b/net/mac80211/sta_info.h
@@ -122,7 +122,7 @@ struct tid_ampdu_tx {
u8 buf_size;

u16 failed_bar_ssn;
- bool bar_pending;
+ unsigned int bar_pending;
};

/**
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 30c265c..ea782f1 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -17,6 +17,7 @@
#include "led.h"
#include "wme.h"

+#define MAX_BAR_RETRIES 3

void ieee80211_tx_status_irqsafe(struct ieee80211_hw *hw,
struct sk_buff *skb)
@@ -171,8 +172,17 @@ static void ieee80211_check_pending_bar(struct
sta_info *sta, u8 *addr, u8 tid)
if (!tid_tx || !tid_tx->bar_pending)
return;

- tid_tx->bar_pending = false;
- ieee80211_send_bar(&sta->sdata->vif, addr, tid, tid_tx->failed_bar_ssn);
+ if (--tid_tx->bar_pending) {
+ ieee80211_send_bar(&sta->sdata->vif, addr, tid,
+ tid_tx->failed_bar_ssn);
+ return;
+ }
+
+ /*
+ * The same BAR failed multiple times, something is clearly wrong
+ * -> Stop the BA session.
+ */
+ ieee80211_stop_tx_ba_session(&sta->sta, tid);
}

static void ieee80211_frame_acked(struct sta_info *sta, struct sk_buff *skb)
@@ -225,8 +235,16 @@ static void ieee80211_set_bar_pending(struct
sta_info *sta, u8 tid, u16 ssn)
if (!tid_tx)
return;

+ /*
+ * A BAR for the same SSN is still pending, don't
+ * update the pending count.
+ */
+ if (tid_tx->failed_bar_ssn == ssn &&
+ tid_tx->bar_pending)
+ return;
+
tid_tx->failed_bar_ssn = ssn;
- tid_tx->bar_pending = true;
+ tid_tx->bar_pending = MAX_BAR_RETRIES;
}

static int ieee80211_tx_radiotap_len(struct ieee80211_tx_info *info)

2012-01-20 07:32:15

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Andreas Hartmann schrieb:
> Hello Helmut,
>
> Am Tue, 10 Jan 2012 10:50:16 +0100
> schrieb Helmut Schaa <[email protected]>:
>
>> On Tue, Jan 10, 2012 at 9:03 AM, Andreas Hartmann
>> <[email protected]> wrote:
> [...]
>>> Looks as if sending of any package is reported as broken at some
>>> point of time and this circle cannot be left anymore.
>>
>> Not necessarily since mac80211 will only retry the BAR if a data frame
>> transmission was successful. Hence, it seems as if only the BARs TX
>> status is reported incorrectly ...
>>
>> I see two issues here:
>> 1) rt2800pci seems to have problems delivering the BAR _or_ doesn't
>> report the tx status correctly
>> 2) If the same BAR fails consecutively we should maybe really tear down
>> the BA session as it was done before
>>
>> 2 is done in the below untested patch and will also work around 1.
>> Mind to give it a try?
>
> I encountered some temporary, short stalls during low bandwith
> (running X terminal sessions tunneled through ssh). Therefore I set
>
> MAX_BAR_RETRIES=0
>
> This means,
>
> ieee80211_send_bar(&sta->sdata->vif, addr, tid,
> tid_tx->failed_bar_ssn);
>
> isn't called any more. But that's not all: the BA session isn't torn
> down any more, too - but it's working fine (high bandwith, too) anyway!?
> Why? (If I remove your patch again, the session completely stalls as
> ever).

Cool, this behaves completely other with ath9k-based STA. In the latter,
I'm "sometimes" getting entries like these in messages of the AP:

Jan 19 13:51:06 pc kernel: [17742.408040] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:51:19 pc kernel: [17755.342010] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:51:19 pc kernel: [17755.342013] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:51:19 pc kernel: [17755.342015] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:51:19 pc kernel: [17755.342016] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.271100] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.271106] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.271111] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.271114] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.271118] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.399106] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.399113] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.399117] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.399120] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:14 pc kernel: [17810.399124] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.581007] net_ratelimit: 3 callbacks suppressed
Jan 19 13:52:59 pc kernel: [17855.581011] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.581013] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.581016] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.581018] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.581020] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.582015] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.582018] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.582020] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.582022] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames
Jan 19 13:52:59 pc kernel: [17855.582023] ieee80211 phy0: release an RX reorder frame due to timeout on earlier frames


nfs or streaming protocols aren't that happy with this situation. Iaw:
it's unusable.


All in one: Your workaround was a good idea, but unfortunately it
doesn't "solve" / workaround the problem introduced with the patch
"mac80211: retry sending failed BAR frames later instead of tearing
down aggr" [1]. I don't know off any other solution at this time as to revert it.

Do you have any idea how to fix the problem introduced with [1]?


Kind regards and thank you!
Andreas


[1] http://www.spinics.net/lists/linux-wireless/msg76379.html

2012-01-05 08:46:37

by Andreas Hartmann

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

Helmut Schaa schrieb:
> Hi,
>
> Am Mittwoch, 4. Januar 2012, 09:38:02 schrieb Andreas Hartmann:
>> Helmut Schaa schrieb:
>>>> Removing this patch
>>>>
>>>> mac80211: retry sending failed BAR frames later instead of tearing down
>>>> aggr (http://www.spinics.net/lists/linux-wireless/msg76379.html) makes
>>>> it working again.
>>>>
>>>> Device is: RaLink RT2800 802.11n PCI
>>>
>>> What's the client device connected to the rt2x00 AP?
>>
>> It's a rt3572 usb chip (Linksys WUSB600N v2), driven with the rt3572sta
>> module.
>>
>>> Mind to send a
>>> 802.11 capture wen this stall happens?
>
> Andreas provided me with a capture now. The interesting part is that
> the rt2x00 AP sends bursts of BARs with the same sequence number when
> the stall happens. I only had a quick look at the code in question but
> couldn't see anything wrong at a first glance.


Some additional information:

- The AP initiates BARs for 6 different, successive sequence numbers.
- The first 5 sequence numbers are finished after sending of about 8 (!)
BAR's.
Important: each BAR is responded *instantly* by the STA.
- The BAR for the next sequence number is sent "endless" though the STA
responses to each BAR instantly again.


Another thing: it is striking, that the performance from STA -> AP is at
max 8,5 MBit/s, whereas the performance from AP -> STA can be 15 MBit/s
(all measured with netperf).


Could it be, that somewhere in the stack of the AP the response packet
from STA -> AP get's lost or ignored or blocked?



Kind regards,
Andreas

2012-01-24 08:07:05

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Mon, Jan 23, 2012 at 6:44 PM, Andreas Hartmann
<[email protected]> wrote:
> Am Mon, 23 Jan 2012 17:36:57 +0100
> schrieb Helmut Schaa <[email protected]>:
>
>> On Mon, Jan 23, 2012 at 5:03 PM, Andreas Hartmann
>> <[email protected]> wrote:
>> > Helmut Schaa schrieb:
>> >> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
>> >> <[email protected]> wrote:
>> >>> All in one: Your workaround was a good idea, but unfortunately it
>> >>> doesn't "solve" / workaround the problem introduced with the patch
>> >>> "mac80211: retry sending failed BAR frames later instead of tearing
>> >>> down aggr" [1]. I don't know off any other solution at this time as to revert it.
>> >>
>> >> Ok, so we've got at least one hw issue here: rt2800 is not able to report
>> >> correct ACK state of BARs :(
>> >>
>> >> The interesting bit is that the legacy drivers tear the BA session down as
>> >> soon as the first AMPDU subframe failed. That's the same behavior as
>> >> mac80211 had before the mentioned patch.
>> >>
>> >> So, let's assume the generic way Felix implemented works for other devices
>> >> (like ath9k) we could go with a workaround inside rt2x00 to tear down
>> >> the BA session as soon as the first AMPDU frame failed.
>> >>
>> >> Mind to try that one (not even compile tested though :) ).
>> >
>> > hdr is missing. I tried to derive it like this:
>> >
>> > struct ieee80211_hdr *hdr=(void*)entry->skb->data;
>>
>> Looks correct but you should put it right before the ieee80211_is_data_qos check
>> not at the start of the function ...
>
> I did it like this:
>
> --- drivers/net/wireless/rt2x00/rt2x00dev.c.orig ? ? ? ?2012-01-23 16:50:03.188718671 +0100
> +++ drivers/net/wireless/rt2x00/rt2x00dev.c ? ? 2012-01-23 18:11:56.248553764 +0100
> @@ -388,8 +388,17 @@
> ? ? ? ? ? ? ? ?tx_info->status.ampdu_len = 1;
> ? ? ? ? ? ? ? ?tx_info->status.ampdu_ack_len = success ? 1 : 0;
>
> - ? ? ? ? ? ? ? if (!success)
> - ? ? ? ? ? ? ? ? ? ? ? tx_info->flags |= IEEE80211_TX_STAT_AMPDU_NO_BACK;
> + ? ? ? ? ? ? ? struct ieee80211_hdr *hdr = (void*) entry->skb->data;
> + ? ? ? ? ? ? ? if (!success && tx_info->control.sta != NULL && ieee80211_is_data_qos(hdr->frame_control)) {
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* Tear down BA session
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? struct ieee80211_sta *sta = tx_info->control.sta;
> + ? ? ? ? ? ? ? ? ? ? ? u8 *qc = ieee80211_get_qos_ctl(hdr);
> + ? ? ? ? ? ? ? ? ? ? ? int tid = *qc & IEEE80211_QOS_CTL_TID_MASK;
> +
> + ? ? ? ? ? ? ? ? ? ? ? ieee80211_stop_tx_ba_session(sta, tid);
> + ? ? ? ? ? ? ? }
> ? ? ? ?}
>
> ? ? ? ?if (rate_flags & IEEE80211_TX_RC_USE_RTS_CTS) {
>
>
> but the machine does crash anyway. Could hdr be null, too?

Ok, I'll do some testing as soon as time allows and get back to you then.

Helmut

2012-01-24 08:06:29

by Helmut Schaa

[permalink] [raw]
Subject: Re: Compat-wireless-3.2-rc6-3 is broken for rt2860 device

On Tue, Jan 24, 2012 at 8:29 AM, Stanislaw Gruszka <[email protected]> wrote:
> On Mon, Jan 23, 2012 at 02:29:08PM +0100, Helmut Schaa wrote:
>> On Fri, Jan 20, 2012 at 8:29 AM, Andreas Hartmann
>> <[email protected]> wrote:
>> > All in one: Your workaround was a good idea, but unfortunately it
>> > doesn't "solve" / workaround the problem introduced with the patch
>> > "mac80211: retry sending failed BAR frames later instead of tearing
>> > down aggr" [1]. I don't know off any other solution at this time as to revert it.
>>
>> Ok, so we've got at least one hw issue here: rt2800 is not able to report
>> correct ACK state of BARs :(
>
> Whould it help, if rt2800 driver will not filter out BAR frames?

I don't think so since we're having problems with tx aggregation and not rx.
Also, BARs are not handled at all in mac80211 AFAIK.

Helmut