2012-06-07 06:53:32

by Sean Patrick Santos

[permalink] [raw]
Subject:

Hello all,

I have had a couple of problems with an adapter using carl9170 (the
device is an X.USB from AirLive). This is with the in-kernel version
from 3.4, and having build the git version of the firmware (although I
switched to an older binary at some point and saw no difference; I
can't date it though).

The first "problem", which is actually mild enough that I wouldn't
bother writing in about it if it was the only issue, is that I get a
lot of messages like these:

[13215.559590] ieee80211 phy2: channel change: 2432 -> 2437 failed (2).
[13215.844041] ieee80211 phy2: channel change: -1 -> 2437 failed (2).
[13215.844044] usb 3-1.2: restart device (7)
[13216.983593] usb 3-1.2: device restarted successfully.
[13216.988414] ieee80211 phy2: Hardware restart was requested

Also occasional blocks like these:

[13548.136457] ieee80211 phy2: invalid plcp cck rate (0).
[13597.224429] ieee80211 phy2: invalid plcp cck rate (0).
[13601.512838] ieee80211 phy2: invalid plcp cck rate (0).

I gather from a previous post on this mailing list that these are
signs of interference in the area, which doesn't surprise me. I have a
draft-N router that can only use the 2.4Ghz range, and there are three
cordless phones, a wall, a microwave, and several other wireless
devices between the adapter and the router. This doesn't bother me
that much, because when the above messages are being printed
performance is usually still OK, and when a restart does happen the
device recovers rapidly. Plus, I'm somewhat stuck with the situation,
since I don't have much control over how things are arranged in this
space, and because the other adapter I have on hand is even worse off,
both in terms of hardware and drivers.

The second, more troubling problem is that I seem to get a "silent"
failure (at least I can't find any errors) if I start a large download
or set of downloads that take more than 10 seconds to a minute (in
particular, trying to clone large directories using mercurial is
impossible, because it will always trigger this problem, though for
some reason git and subversion work most, but not all, of the time).
What I mean by "failure" is that one of these two things will happen:

1. The device will simply fail to receive anything (or trickle out to
a rate of 500 bytes/min), at which point it will remain in that state
for hours, occasionally registering minuscule amounts of activity,
unless it is dis/reconnected to the wireless network (toggling power
to the device or reloading the module also work, but do neither better
nor worse than just reassociating). Upon reconnecting it immediately
works fine, as long as I don't trigger the same problem again.

2. Less often, the device will fail as above, but then suddenly start
working again another minute or so later, allowing the process that
had been overworking it (mercurial, wget, firefox, whatever) to
continue what it was doing for another 10-50 seconds, at which point
the connection trickles out again. This cycle keeps happening until
the process either completes successfully, errors and dies, hangs, or
is killed, at which point everything seems fine again. (Killing the
process does not solve the problem, it just prevents the same process
from causing the problem *again* in the event that the device
spontaneously recovers within a minute or two).

I know that it's physically possible to get a stable connection here,
because the Windows installation on the same machine can almost always
get fairly good speeds with the same device in the same place on the
same network at the same time of day (~25Mbps, lose connection maybe
once per 40 hours of use). I also know this because I can always fix
the problem manually on Linux by reconnecting the device. What I'm not
sure about is what the problem is with carl9170, or how to convince it
to be more fault-tolerant. (Is this behavior an overreaction to the
noise level? Is it hanging while waiting for some event that may never
happen?) I'm afraid I'm not even sure how to diagnose the problem
further; wireless adapters are not familiar territory for me.

-Sean


2012-06-10 12:20:38

by Christian Lamparter

[permalink] [raw]
Subject: Re: [RFC v3] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

On Sunday 10 June 2012 07:13:25 Emmanuel Grumbach wrote:
> > > But that notification trap would also have to call some function
> > > in the stack, and the drivers that don't have it (or don't
> > > implement it yet) need something more like this patch?
> >
> > (see v3 attached to this mail).
> >
> > Yes, drivers that don't pass BAs to the stack and won't call
> > ieee80211_ba_notification will have their BA sessions "removed" in v3.
> > Of course, we can also retain the old behavoir and reset the
> > tx ba session timeout until we know we don't break anything.
>
> But if the peer acked a packet sent in a BA session, then it must send
> a BA (spec AFAIK), so when the driver calls ieee80211_tx_status at the
> originator side, mac80211 can assume that a BA was received, right?
Not necessarily, some wifi devices (especially USB) just don't provide
a status report. I think rtlwifi is a example for this case, since it
just sticks "STAT_ACK" on every package.

(Also, when the signal is bad, some devices fall back to legacy (retry)
rates and abandon aggregation at least for the time... so no, we
have to look elsewhere. Furthermore, I haven't found any
references in the spec that explicitly says that you can't
do that (e.g. A-MSDU vs A-MPDU mixed, etc...), so do you know
the paragraph that says BAs and normal acks can't be mixed?)
> Even if you still want a specific notification from the driver to
> mac80211, what about adding it to ieee80211_tx_info.status ?
> We have a trade-off here. We can have a new notification which is a
> new API (which is a disadvantage in terms of runtime and design) but
> can be called once per batch: meaning that we will reset the the timer
> once every 15 packets or so. Or we can have a bit in the
> ieee80211_tx_info.status which would be in every packet, but then, no
> need for new API.
> Maybe the best way is to have the driver set the new bit in
> ieee80211_tx_info.status only for the last packet of the batch (or the
> first, whatever).
If you look in ieee80211_tx_prep_agg you'll see that in the current
design the timer is already updated for every frame as well. So if
we just move this to a different place so the "cost" should remain
the same, right?

Regards,
Christian

2012-06-08 07:57:27

by Sean Patrick Santos

[permalink] [raw]
Subject: Re: carl9170 issue

I believe that I have found the commit that introduced this problem,
which was a change in mac80211. However, I'm out of my depth in
figuring out what a really "correct" solution is; all I've done is a
trial-and-error bisection. The commit in question:

commit f0425beda4d404a6e751439b562100b902ba9c98
Author: Felix Fietkau <[email protected]>
Date: Sun Aug 28 21:11:01 2011 +0200

mac80211: retry sending failed BAR frames later instead of tearing down aggr

Unfortunately failed BAR tx attempts happen more frequently than I
expected, and the resulting aggregation teardowns cause performance
issues, as the aggregation session does not always get re-established
properly.
Instead of tearing down the entire aggr session, we can simply store the
SSN of the last failed BAR tx attempt, wait for the first successful
tx status event, and then send another BAR with the same SSN.

Signed-off-by: Felix Fietkau <[email protected]>
Cc: Helmut Schaa <[email protected]>
Signed-off-by: John W. Linville <[email protected]>

This looks relevant. As a matter of personal convenience, I might try
backing out the change tomorrow if it seems that it'll help.

On Thu, Jun 7, 2012 at 1:31 PM, Sean Patrick Santos
<[email protected]> wrote:
> Thanks for the quick response, Christian,
>
>>> Plus, I'm somewhat stuck with the situation,
>>> since I don't have much control over how things are arranged in this
>>> space, and because the other adapter I have on hand is even worse off,
>>> both in terms of hardware and drivers.
>> Fair enough. But what's the other adapter?
>
> It's a Belkin N150 (http://www.wikidevi.com/wiki/Belkin_F9L1001),
> which didn't seem to work for more that 30 seconds at a time, and had
> low throughput even then. The driver may have improved, or be easier
> to fix, since the last time I tried it, but the actual antenna on the
> device is not suitable for this environment and so I haven't been
> motivated to make it work. The "official" driver on Windows gave
> halfway serviceable performance, emphasis firmly on the "halfway". I
> may get around to re-testing/troubleshooting it if I want to use it
> elsewhere or if I'm bored some time, but that's really a different
> question.
>
>> Your problems sound somewhat familiar to
>> "Re: carl9170 driver - network connection breaks"
>> <http://permalink.gmane.org/gmane.linux.kernel.wireless.general/88223>
>>
>> So far no-one has been able to bisect the bug (last good was 3.1, so
>> to breakage must have occurred between 3.1 and 3.2). I would have
>> looked into it long ago, but I can't reproduce.
>
> I may see if I can bisect it, since evidently I can reproduce the bug
> quite easily. At the least I can find out if this really is the same
> problem and not present in kernel 3.1. I'll let you know if I find
> anything interesting.
>
>> PS: If your kernel was compiled with CONFIG_MAC80211_DEBUGFS you can
>> "restart" BA/aggregation sessions by
>> echo "tx stop 0" > /sys/kernel/debug/ieee80211/phyX/netdev:wlanXY/stations/AP-MAC/agg_status
>> echo "rx stop 0" > /sys/kernel/debug/ieee80211/phyX/netdev:wlanXY/stations/AP-MAC/agg_status
>> alternatively: you can disable ht by loading the module with 'noht=1' parameter
>
> Thanks for the suggestion; I'll try this. I should note that, although
> I can't change the physical locations of anything here (at least for
> another month or two), I do manage both the machine and the wireless
> router.

2012-06-09 12:20:49

by Helmut Schaa

[permalink] [raw]
Subject: Re: BA session issue due to old BARs?

On Sat, Jun 9, 2012 at 2:18 AM, Christian Lamparter
<[email protected]> wrote:
> On Friday 08 June 2012 09:57:26 Sean Patrick Santos wrote:
>> I believe that I have found the commit that introduced this problem,
>> which was a change in mac80211. However, I'm out of my depth in
>> figuring out what a really "correct" solution is; all I've done is a
>> trial-and-error bisection. The commit in question:
>>
>> commit f0425beda4d404a6e751439b562100b902ba9c98
>> Author: Felix Fietkau <[email protected]>
>> Date: ? Sun Aug 28 21:11:01 2011 +0200
>>
>> ? ? mac80211: retry sending failed BAR frames later instead of tearing down aggr
>>
>> ? ? Unfortunately failed BAR tx attempts happen more frequently than I
>> ? ? expected, and the resulting aggregation teardowns cause performance
>> ? ? issues, as the aggregation session does not always get re-established
>> ? ? properly.
>> ? ? Instead of tearing down the entire aggr session, we can simply store the
>> ? ? SSN of the last failed BAR tx attempt, wait for the first successful
>> ? ? tx status event, and then send another BAR with the same SSN.
>>
>> ? ? Signed-off-by: Felix Fietkau <[email protected]>
>> ? ? Cc: Helmut Schaa <[email protected]>
>> ? ? Signed-off-by: John W. Linville <[email protected]>
>>
>> This looks relevant. As a matter of personal convenience, I might try
>> backing out the change tomorrow if it seems that it'll help.
> Felix,
>
> is there any way we can restore the old behavior of tearing
> down BAs due to BAR transmission failures without breaking
> ath9k (or rt2x00)?

No problem with rt2x00 since it has problems reporting the tx status of
BARs :(

> Or am I misinterpreting the commit and
> this patch was just a temporary fix since back then mac80211
> had problems with setting up BA session (and they might
> have been fixed in the meantime?!).
>
> Quote from the first paragraph of the commit:
>> ... the resulting aggregation teardowns cause performance
>> issues, as the aggregation session does not always get
>> re-established properly.

As far as I understood tearing down the BA session and then starting it
again has a much higher impact onto throughput then just resending the
BAR after the next successfully sent AMPDU.

Helmut

2012-06-10 18:55:51

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [RFC v3] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

>> > > But that notification trap would also have to call some function
>> > > in the stack, and the drivers that don't have it (or don't
>> > > implement it yet) need something more like this patch?
>> >
>> > (see v3 attached to this mail).
>> >
>> > Yes, drivers that don't pass BAs to the stack and won't call
>> > ieee80211_ba_notification will have their BA sessions "removed" in v3.
>> > Of course, we can also retain the old behavoir and reset the
>> > tx ba session timeout until we know we don't break anything.
>>
>> But if the peer acked a packet sent in a BA session, then it must send
>> a BA (spec AFAIK), so when the driver calls ieee80211_tx_status at the
>> originator side, mac80211 can assume that a BA was received, right?
> Not necessarily, some wifi devices (especially USB) just don't provide
> a status report. I think rtlwifi is a example for this case, since it
> just sticks "STAT_ACK" on every package.
>
> (Also, when the signal is bad, some devices fall back to legacy (retry)
> rates and abandon aggregation at least for the time... so no, we
> have to look elsewhere. Furthermore, I haven't found any
> references in the spec that explicitly says that you can't
> do that (e.g. A-MSDU vs A-MPDU mixed, etc...), so do you know
> the paragraph that says BAs and normal acks can't be mixed?)

Frankly I am not an expert at browsing the spec :-)
I just know that Cisco sends BA all the time, but it is Cisco and I
don't remember what other APs do.

>> Even if you still want a specific notification from the driver to
>> mac80211, what about adding it to ieee80211_tx_info.status ?
>> We have a trade-off here. We can have a new notification which is a
>> new API (which is a disadvantage in terms of runtime and design) but
>> can be called once per batch: meaning that we will reset the the timer
>> once every 15 packets or so. Or we can have a bit in the
>> ieee80211_tx_info.status which would be in every packet, but then, no
>> need for new API.
>> Maybe the best way is to have the driver set the new bit in
>> ieee80211_tx_info.status only for the last packet of the batch (or the
>> first, whatever).
> If you look in ieee80211_tx_prep_agg you'll see that in the current
> design the timer is already updated for every frame as well. So if
> we just move this to a different place so the "cost" should remain
> the same, right?
>

That sounds right

2012-06-07 19:31:22

by Sean Patrick Santos

[permalink] [raw]
Subject: Re: carl9170 issue

Thanks for the quick response, Christian,

>> Plus, I'm somewhat stuck with the situation,
>> since I don't have much control over how things are arranged in this
>> space, and because the other adapter I have on hand is even worse off,
>> both in terms of hardware and drivers.
> Fair enough. But what's the other adapter?

It's a Belkin N150 (http://www.wikidevi.com/wiki/Belkin_F9L1001),
which didn't seem to work for more that 30 seconds at a time, and had
low throughput even then. The driver may have improved, or be easier
to fix, since the last time I tried it, but the actual antenna on the
device is not suitable for this environment and so I haven't been
motivated to make it work. The "official" driver on Windows gave
halfway serviceable performance, emphasis firmly on the "halfway". I
may get around to re-testing/troubleshooting it if I want to use it
elsewhere or if I'm bored some time, but that's really a different
question.

> Your problems sound somewhat familiar to
> "Re: carl9170 driver - network connection breaks"
> <http://permalink.gmane.org/gmane.linux.kernel.wireless.general/88223>
>
> So far no-one has been able to bisect the bug (last good was 3.1, so
> to breakage must have occurred between 3.1 and 3.2). I would have
> looked into it long ago, but I can't reproduce.

I may see if I can bisect it, since evidently I can reproduce the bug
quite easily. At the least I can find out if this really is the same
problem and not present in kernel 3.1. I'll let you know if I find
anything interesting.

> PS: If your kernel was compiled with CONFIG_MAC80211_DEBUGFS you can
> "restart" BA/aggregation sessions by
> echo "tx stop 0" > /sys/kernel/debug/ieee80211/phyX/netdev:wlanXY/stations/AP-MAC/agg_status
> echo "rx stop 0" > /sys/kernel/debug/ieee80211/phyX/netdev:wlanXY/stations/AP-MAC/agg_status
> alternatively: you can disable ht by loading the module with 'noht=1' parameter

Thanks for the suggestion; I'll try this. I should note that, although
I can't change the physical locations of anything here (at least for
another month or two), I do manage both the machine and the wireless
router.

2012-06-09 07:59:31

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure


> introduced a TX BA session timeout timer. However the
> timer was reset for each outgoing transmission instead
> what 802.11-2007 11.5.3 specifies:

You should probably reference 802.11-2012 now since some things were
renumbered.

> "The inactivity timer at the originator is reset when a
> BlockAck frame corresponding to the TID for which the
> Block Ack policy is set is received."

> This patch was only compile-tested and the design has
> some rather big problems:

Maybe then we should just revert Nikolay's patch?

> - The spec says we should test for BlockAcks, however
> that is not feasible because it is a control frame
> (so FIF_CONTROL might filter it on some hardware).

In some way, all devices are going to have to report these frames. Maybe
not as the frame itself, but as some other notification, to allow
cleaning up the TX queues accordingly, I think? There's a BA
notification in iwlwifi for example.

> +++ b/net/mac80211/status.c
> @@ -439,6 +439,29 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
> }
> }
>
> + if (ieee80211_is_data_qos(fc) && acked &&
> + info->flags & IEEE80211_TX_CTL_AMPDU &&
> + local->hw.flags & IEEE80211_HW_HAS_RATE_CONTROL) {

I'm confused -- what does this have to do with HW_HAS_RATE_CONTROL?

johannes


2012-06-09 14:01:41

by Christian Lamparter

[permalink] [raw]
Subject: Re: [RFC v3] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

On Saturday 09 June 2012 14:28:12 Johannes Berg wrote:
> On Sat, 2012-06-09 at 14:14 +0200, Christian Lamparter wrote:
> > yeah, I should. But 802.11-2012 is not yet available for
> > free like the old 802.11-2007 through IEEE 802 Get.
>
> Hah, ok. I guess I can fix the references when I apply it.
Yeah that would be nice.
The path is
11. -> MLME
5. -> Block Ack operation
3. -> Error Recovery upon a peer failure

> > > > - The spec says we should test for BlockAcks, however
> > > > that is not feasible because it is a control frame
> > > > (so FIF_CONTROL might filter it on some hardware).
> > >
> > > In some way, all devices are going to have to report these
> > > frames. Maybe not as the frame itself, but as some other
> > > notification, to allow cleaning up the TX queues
> > > accordingly, I think? There's a BA notification in iwlwifi
> > > for example.
> > Alright, I guess this means that we probably need a new HW
> > feature flag like:
> > IEEE80211_HW_REPORTS_BA_NOTIFICATION
> > for hardware/firmware which filter BA frames, but have a BA
> > notification trap. And every other driver need to pass BA
> > to the stack where a new rx handler will update the last_tx
> > accordingly.
>
> But that notification trap would also have to call some function
> in the stack, and the drivers that don't have it (or don't
> implement it yet) need something more like this patch?

(see v3 attached to this mail).

Yes, drivers that don't pass BAs to the stack and won't call
ieee80211_ba_notification will have their BA sessions "removed" in v3.
Of course, we can also retain the old behavoir and reset the
tx ba session timeout until we know we don't break anything.

> I haven't actually looked into the specifics yet at all, will do
> that on Monday.
No problem (after all - I could never reproduce this issue). In
the mean time, I'll try to come up with better names for
ieee80211_ba_notification and a proper docbook text.

Regards,
Chr
---
diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index d152f54..eb66bda 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -3531,6 +3531,21 @@ void ieee80211_resume_disconnect(struct ieee80211_vif *vif);
void ieee80211_disable_dyn_ps(struct ieee80211_vif *vif);

/**
+ * ieee80211_ba_notification - update block ack timeout timer
+ *
+ * @sta: &struct ieee80211_sta pointer of the received BA
+ * @tid: the TID of the received BA.
+ *
+ * Some hardware filters BlockAck and therefore the stack
+ * an alternative way of notifying the stack that a specific
+ * BA session was active, otherwise the session will time
+ * out.
+ *
+ * This function must be called under RCU lock.
+ */
+void ieee80211_ba_notification(struct ieee80211_sta *sta, const u16 tid);
+
+/**
* ieee80211_enable_dyn_ps - restore dynamic psm after being disabled
*
* @vif: &struct ieee80211_vif pointer from the add_interface callback.
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index 6fd2cb0..e14f314 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -2055,6 +2055,23 @@ ieee80211_rx_h_data(struct ieee80211_rx_data *rx)
return RX_QUEUED;
}

+/* needs to be called under rcu_read_lock */
+static void _ieee80211_ba_notification(struct sta_info *sta, u16 tid)
+{
+ struct tid_ampdu_tx *tid_tx;
+ tid_tx = rcu_dereference(sta->ampdu_mlme.tid_tx[tid]);
+ if (tid_tx) {
+ /*
+ * 802.11-2007 11.5.3 Error recovery upon a peer failure
+ *
+ * The inactivity timer at the originator is reset when
+ * a BlockAck frame corresponding to the TID for which
+ * the Block Ack policy is set is received.
+ */
+ tid_tx->last_tx = jiffies;
+ }
+}
+
static ieee80211_rx_result debug_noinline
ieee80211_rx_h_ctrl(struct ieee80211_rx_data *rx)
{
@@ -2069,6 +2086,21 @@ ieee80211_rx_h_ctrl(struct ieee80211_rx_data *rx)
if (likely(!ieee80211_is_ctl(bar->frame_control)))
return RX_CONTINUE;

+ if (ieee80211_is_back(bar->frame_control) && rx->sta) {
+ __le16 ba_control;
+
+ if (!skb_copy_bits(skb,
+ offsetof(struct ieee80211_bar, control),
+ &ba_control, sizeof(ba_control))) {
+ u16 control = le16_to_cpu(ba_control);
+ if (!(control & IEEE80211_BAR_CTRL_MULTI_TID)) {
+ u16 tid = (control & IEEE80211_BAR_CTRL_TID_INFO_MASK)
+ >> IEEE80211_BAR_CTRL_TID_INFO_SHIFT;
+ _ieee80211_ba_notification(rx->sta, tid);
+ }
+ }
+ }
+
if (ieee80211_is_back_req(bar->frame_control)) {
struct {
__le16 control, start_seq_num;
@@ -3101,6 +3133,13 @@ void ieee80211_rx(struct ieee80211_hw *hw, struct sk_buff *skb)
}
EXPORT_SYMBOL(ieee80211_rx);

+void ieee80211_ba_notification(struct ieee80211_sta *pubsta, const u16 tid)
+{
+ struct sta_info *sta = container_of(pubsta, struct sta_info, sta);
+ _ieee80211_ba_notification(sta, tid);
+}
+EXPORT_SYMBOL(ieee80211_ba_notification);
+
/* This is a version of the rx handler that can be called from hard irq
* context. Post the skb on the queue and schedule the tasklet */
void ieee80211_rx_irqsafe(struct ieee80211_hw *hw, struct sk_buff *skb)
diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
index 3bb24a1..fbe84a7 100644
--- a/net/mac80211/sta_info.h
+++ b/net/mac80211/sta_info.h
@@ -103,7 +103,7 @@ enum ieee80211_sta_info_flags {
* @dialog_token: dialog token for aggregation session
* @timeout: session timeout value to be filled in ADDBA requests
* @state: session state (see above)
- * @last_tx: jiffies of last tx activity
+ * @last_tx: jiffies of last successful tx activity
* @stop_initiator: initiator of a session stop
* @tx_stop: TX DelBA frame when stopping
* @buf_size: reorder buffer size at receiver
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index af25c4e..dd02a1f 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1061,12 +1061,10 @@ static bool ieee80211_tx_prep_agg(struct ieee80211_tx_data *tx,
int tid)
{
bool queued = false;
- bool reset_agg_timer = false;
struct sk_buff *purge_skb = NULL;

if (test_bit(HT_AGG_STATE_OPERATIONAL, &tid_tx->state)) {
info->flags |= IEEE80211_TX_CTL_AMPDU;
- reset_agg_timer = true;
} else if (test_bit(HT_AGG_STATE_WANT_START, &tid_tx->state)) {
/*
* nothing -- this aggregation session is being started
@@ -1098,7 +1096,6 @@ static bool ieee80211_tx_prep_agg(struct ieee80211_tx_data *tx,
/* do nothing, let packet pass through */
} else if (test_bit(HT_AGG_STATE_OPERATIONAL, &tid_tx->state)) {
info->flags |= IEEE80211_TX_CTL_AMPDU;
- reset_agg_timer = true;
} else {
queued = true;
info->control.vif = &tx->sdata->vif;
@@ -1112,11 +1109,6 @@ static bool ieee80211_tx_prep_agg(struct ieee80211_tx_data *tx,
if (purge_skb)
dev_kfree_skb(purge_skb);
}
-
- /* reset session timer */
- if (reset_agg_timer && tid_tx->timeout)
- tid_tx->last_tx = jiffies;
-
return queued;
}


2012-06-09 01:11:26

by Christian Lamparter

[permalink] [raw]
Subject: [RFC] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

The commit:

commit 285fa6958c1d56469ec8a0e879ae7487a4e62840
Author: Nikolay Martynov <[email protected]>
Date: Tue Nov 22 21:50:28 2011 -0500

mac80211: timeout tx agg sessions in way similar to rx agg sessions

introduced a TX BA session timeout timer. However the
timer was reset for each outgoing transmission instead
what 802.11-2007 11.5.3 specifies:
"The inactivity timer at the originator is reset when a
BlockAck frame corresponding to the TID for which the
Block Ack policy is set is received."

Reported-by: Sean Patrick Santos <[email protected]>
Reported-by: MikoĊ‚aj Kuligowski <[email protected]>
Reported-by: Per-Erik Westerberg <[email protected]>
Cc: [email protected]
---
Note:

This patch was only compile-tested and the design has
some rather big problems:

- The spec says we should test for BlockAcks, however
that is not feasible because it is a control frame
(so FIF_CONTROL might filter it on some hardware).

- The IEEE80211_HW_REPORTS_TX_ACK_STATUS feature flag
does not necessairly extends to BlockAck as well, so
this would need to be changed as well.

(- The session timeout adds a delay.)

Regards,
Chr
---
diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
index 3bb24a1..fbe84a7 100644
--- a/net/mac80211/sta_info.h
+++ b/net/mac80211/sta_info.h
@@ -103,7 +103,7 @@ enum ieee80211_sta_info_flags {
* @dialog_token: dialog token for aggregation session
* @timeout: session timeout value to be filled in ADDBA requests
* @state: session state (see above)
- * @last_tx: jiffies of last tx activity
+ * @last_tx: jiffies of last successful tx activity
* @stop_initiator: initiator of a session stop
* @tx_stop: TX DelBA frame when stopping
* @buf_size: reorder buffer size at receiver
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 6b4f425..96eebb6 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -439,6 +439,29 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
}
}

+ if (ieee80211_is_data_qos(fc) && acked &&
+ info->flags & IEEE80211_TX_CTL_AMPDU &&
+ local->hw.flags & IEEE80211_HW_HAS_RATE_CONTROL) {
+ struct tid_ampdu_tx *tid_tx;
+ u8 *qc;
+ u16 tid;
+
+ qc = ieee80211_get_qos_ctl(hdr);
+ tid = qc[0] & 0xf;
+ tid_tx = rcu_dereference(sta->ampdu_mlme.tid_tx[tid]);
+ if (!tid_tx)
+ return;
+
+ /*
+ * 802.11-2007 11.5.3 Error recovery upon a peer failure
+ *
+ * The inactivity timer at the originator is reset when
+ * a BlockAck frame corresponding to the TID for which
+ * the Block Ack policy is set is received.
+ */
+ tid_tx->last_tx = jiffies;
+ }
+
if (info->flags & IEEE80211_TX_STAT_TX_FILTERED) {
ieee80211_handle_filtered_frame(local, sta, skb);
rcu_read_unlock();
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index af25c4e..d3abee9 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1098,7 +1098,8 @@ static bool ieee80211_tx_prep_agg(struct ieee80211_tx_data *tx,
/* do nothing, let packet pass through */
} else if (test_bit(HT_AGG_STATE_OPERATIONAL, &tid_tx->state)) {
info->flags |= IEEE80211_TX_CTL_AMPDU;
- reset_agg_timer = true;
+ if (!(tx->local->hw.flags & IEEE80211_HW_HAS_RATE_CONTROL))
+ reset_agg_timer = true;
} else {
queued = true;
info->control.vif = &tx->sdata->vif;


2012-06-09 00:18:34

by Christian Lamparter

[permalink] [raw]
Subject: Re: BA session issue due to old BARs?

On Friday 08 June 2012 09:57:26 Sean Patrick Santos wrote:
> I believe that I have found the commit that introduced this problem,
> which was a change in mac80211. However, I'm out of my depth in
> figuring out what a really "correct" solution is; all I've done is a
> trial-and-error bisection. The commit in question:
>
> commit f0425beda4d404a6e751439b562100b902ba9c98
> Author: Felix Fietkau <[email protected]>
> Date: Sun Aug 28 21:11:01 2011 +0200
>
> mac80211: retry sending failed BAR frames later instead of tearing down aggr
>
> Unfortunately failed BAR tx attempts happen more frequently than I
> expected, and the resulting aggregation teardowns cause performance
> issues, as the aggregation session does not always get re-established
> properly.
> Instead of tearing down the entire aggr session, we can simply store the
> SSN of the last failed BAR tx attempt, wait for the first successful
> tx status event, and then send another BAR with the same SSN.
>
> Signed-off-by: Felix Fietkau <[email protected]>
> Cc: Helmut Schaa <[email protected]>
> Signed-off-by: John W. Linville <[email protected]>
>
> This looks relevant. As a matter of personal convenience, I might try
> backing out the change tomorrow if it seems that it'll help.
Felix,

is there any way we can restore the old behavior of tearing
down BAs due to BAR transmission failures without breaking
ath9k (or rt2x00)? Or am I misinterpreting the commit and
this patch was just a temporary fix since back then mac80211
had problems with setting up BA session (and they might
have been fixed in the meantime?!).

Quote from the first paragraph of the commit:
> ... the resulting aggregation teardowns cause performance
> issues, as the aggregation session does not always get
> re-established properly.

Regards,
Christian

2012-06-09 12:14:47

by Christian Lamparter

[permalink] [raw]
Subject: Re: [RFC v2] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

On Saturday 09 June 2012 09:59:27 Johannes Berg wrote:
> > introduced a TX BA session timeout timer. However the
> > timer was reset for each outgoing transmission instead
> > what 802.11-2007 11.5.3 specifies:
>
> You should probably reference 802.11-2012 now since some
> things were renumbered.
yeah, I should. But 802.11-2012 is not yet available for
free like the old 802.11-2007 through IEEE 802 Get.

> > "The inactivity timer at the originator is reset when a
> > BlockAck frame corresponding to the TID for which the
> > Block Ack policy is set is received."
>
> > This patch was only compile-tested and the design has
> > some rather big problems:
>
> Maybe then we should just revert Nikolay's patch?
(I was referring to this RFC patch and not his patch)
I don't think we can/should revert it, just change it
a bit so the timeout reset is triggered by a received
BA from the peer.

> > - The spec says we should test for BlockAcks, however
> > that is not feasible because it is a control frame
> > (so FIF_CONTROL might filter it on some hardware).
>
> In some way, all devices are going to have to report these
> frames. Maybe not as the frame itself, but as some other
> notification, to allow cleaning up the TX queues
> accordingly, I think? There's a BA notification in iwlwifi
> for example.
Alright, I guess this means that we probably need a new HW
feature flag like:
IEEE80211_HW_REPORTS_BA_NOTIFICATION
for hardware/firmware which filter BA frames, but have a BA
notification trap. And every other driver need to pass BA
to the stack where a new rx handler will update the last_tx
accordingly.

(yeah, that IEEE80211_HW_HAS_RATE_CONTROL vs. ACK_REPORT mix-up
was a stupid error - anyway here's the last updated version
of this approach - v3 will use IEEE80211_HW_REPORTS_BA_NOTIFICATION)
---
diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
index 3bb24a1..fbe84a7 100644
--- a/net/mac80211/sta_info.h
+++ b/net/mac80211/sta_info.h
@@ -103,7 +103,7 @@ enum ieee80211_sta_info_flags {
* @dialog_token: dialog token for aggregation session
* @timeout: session timeout value to be filled in ADDBA requests
* @state: session state (see above)
- * @last_tx: jiffies of last tx activity
+ * @last_tx: jiffies of last successful tx activity
* @stop_initiator: initiator of a session stop
* @tx_stop: TX DelBA frame when stopping
* @buf_size: reorder buffer size at receiver
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 6b4f425..dfc61de 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -439,6 +439,28 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
}
}

+ if (ieee80211_is_data_qos(fc) && acked &&
+ info->flags & IEEE80211_TX_CTL_AMPDU &&
+ local->hw.flags & IEEE80211_HW_REPORTS_TX_ACK_STATUS) {
+ struct tid_ampdu_tx *tid_tx;
+ u8 *qc;
+ u16 tid;
+
+ qc = ieee80211_get_qos_ctl(hdr);
+ tid = qc[0] & 0xf;
+ tid_tx = rcu_dereference(sta->ampdu_mlme.tid_tx[tid]);
+ if (tid_tx) {
+ /*
+ * 802.11-2007 11.5.3 Error recovery upon a peer failure
+ *
+ * The inactivity timer at the originator is reset when
+ * a BlockAck frame corresponding to the TID for which
+ * the Block Ack policy is set is received.
+ */
+ tid_tx->last_tx = jiffies;
+ }
+ }
+
if (info->flags & IEEE80211_TX_STAT_TX_FILTERED) {
ieee80211_handle_filtered_frame(local, sta, skb);
rcu_read_unlock();
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index af25c4e..67887c5 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1098,7 +1098,9 @@ static bool ieee80211_tx_prep_agg(struct ieee80211_tx_data *tx,
/* do nothing, let packet pass through */
} else if (test_bit(HT_AGG_STATE_OPERATIONAL, &tid_tx->state)) {
info->flags |= IEEE80211_TX_CTL_AMPDU;
- reset_agg_timer = true;
+ if (!(tx->local->hw.flags &
+ IEEE80211_HW_REPORTS_TX_ACK_STATUS))
+ reset_agg_timer = true;
} else {
queued = true;
info->control.vif = &tx->sdata->vif;

2012-06-09 14:55:39

by Felix Fietkau

[permalink] [raw]
Subject: Re: BA session issue due to old BARs?

On 2012-06-09 4:23 PM, Christian Lamparter wrote:
> On Saturday 09 June 2012 14:20:48 Helmut Schaa wrote:
>> On Sat, Jun 9, 2012 at 2:18 AM, Christian Lamparter
>> <[email protected]> wrote:
>> > On Friday 08 June 2012 09:57:26 Sean Patrick Santos wrote:
>> >> I believe that I have found the commit that introduced this problem,
>> >> which was a change in mac80211. However, I'm out of my depth in
>> >> figuring out what a really "correct" solution is; all I've done is a
>> >> trial-and-error bisection. The commit in question:
>> >>
>> >> commit f0425beda4d404a6e751439b562100b902ba9c98
>> >> Author: Felix Fietkau <[email protected]>
>> >> Date: Sun Aug 28 21:11:01 2011 +0200
>> >>
>> >> mac80211: retry sending failed BAR frames later instead of tearing down aggr
>> >>
>> >> Unfortunately failed BAR tx attempts happen more frequently than I
>> >> expected, and the resulting aggregation teardowns cause performance
>> >> issues, as the aggregation session does not always get re-established
>> >> properly.
>> >> Instead of tearing down the entire aggr session, we can simply store the
>> >> SSN of the last failed BAR tx attempt, wait for the first successful
>> >> tx status event, and then send another BAR with the same SSN.
>> >>
>> >> Signed-off-by: Felix Fietkau <[email protected]>
>> >> Cc: Helmut Schaa <[email protected]>
>> >> Signed-off-by: John W. Linville <[email protected]>
>> >>
>> >> This looks relevant. As a matter of personal convenience, I might try
>> >> backing out the change tomorrow if it seems that it'll help.
>> > Felix,
>> >
>> > is there any way we can restore the old behavior of tearing
>> > down BAs due to BAR transmission failures without breaking
>> > ath9k (or rt2x00)?
>>
>> No problem with rt2x00 since it has problems reporting the
>> tx status of BARs :(
> hm, does the hardware pass BA to the driver? Because, the
> BA which results from a BAR usually contains the ssn from
> the BAR (just the ra / ta is switched). Come to think of it
> the ssn and the bitmap might also be useful to map tx_status
> to a specific skb.
>
>> > Or am I misinterpreting the commit and
>> > this patch was just a temporary fix since back then mac80211
>> > had problems with setting up BA session (and they might
>> > have been fixed in the meantime?!).
>> >
>> > Quote from the first paragraph of the commit:
>> >> ... the resulting aggregation teardowns cause performance
>> >> issues, as the aggregation session does not always get
>> >> re-established properly.
>>
>> As far as I understood tearing down the BA session and then
>> starting it again has a much higher impact onto throughput
>> then just resending the BAR after the next successfully sent
>> AMPDU.
> Well, at least in case of carl9170 it looks like the resending
> BARs are confusing the receiver reorder buffer on the other
> HT peer. Since it looks like the HT peer dump hardware is still
> ACKs incoming frames from a carl9170 station, but the software
> reorder buffer on the HT peer is silently dropping the data frames.
> [Of course, data frames from other TIDs, MGMT (probes) or null-
> frames are not reordered... so the connection polling with
> null-frames does not detect the bad state and won't try to
> recover).
I think we need to figure out what exactly confuses the receiver reorder
buffer of a HT peer. It sounds to me like something is causing a gap in
sequence number allocation, but I don't know what would do that. Where
exactly is the connection between BAR retransmission and those reorder
issues?

- Felix

2012-06-09 12:28:15

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC v2] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

On Sat, 2012-06-09 at 14:14 +0200, Christian Lamparter wrote:

> yeah, I should. But 802.11-2012 is not yet available for
> free like the old 802.11-2007 through IEEE 802 Get.

Hah, ok. I guess I can fix the references when I apply it.

> > > This patch was only compile-tested and the design has
> > > some rather big problems:
> >
> > Maybe then we should just revert Nikolay's patch?
> (I was referring to this RFC patch and not his patch)

Oh.

> I don't think we can/should revert it, just change it
> a bit so the timeout reset is triggered by a received
> BA from the peer.

Makes sense I guess.

> > > - The spec says we should test for BlockAcks, however
> > > that is not feasible because it is a control frame
> > > (so FIF_CONTROL might filter it on some hardware).
> >
> > In some way, all devices are going to have to report these
> > frames. Maybe not as the frame itself, but as some other
> > notification, to allow cleaning up the TX queues
> > accordingly, I think? There's a BA notification in iwlwifi
> > for example.
> Alright, I guess this means that we probably need a new HW
> feature flag like:
> IEEE80211_HW_REPORTS_BA_NOTIFICATION
> for hardware/firmware which filter BA frames, but have a BA
> notification trap. And every other driver need to pass BA
> to the stack where a new rx handler will update the last_tx
> accordingly.

But that notification trap would also have to call some function in the
stack, and the drivers that don't have it (or don't implement it yet)
need something more like this patch?

I haven't actually looked into the specifics yet at all, will do that on
Monday.

johannes


2012-06-09 17:40:09

by Christian Lamparter

[permalink] [raw]
Subject: Re: BA session issue due to old BARs?

On Saturday 09 June 2012 16:55:36 Felix Fietkau wrote:
> On 2012-06-09 4:23 PM, Christian Lamparter wrote:
>> On Saturday 09 June 2012 14:20:48 Helmut Schaa wrote:
>>> On Sat, Jun 9, 2012 at 2:18 AM, Christian Lamparter wrote:
>>>> On Friday 08 June 2012 09:57:26 Sean Patrick Santos wrote:
>>>>> commit f0425beda4d404a6e751439b562100b902ba9c98
>>>>> Author: Felix Fietkau <[email protected]>
>>>>> Date: Sun Aug 28 21:11:01 2011 +0200
>>>>>
>>>>> mac80211: retry sending failed BAR frames later instead of tearing down aggr
>>>>>
>>>> Or am I misinterpreting the commit and
>>>> this patch was just a temporary fix since back then mac80211
>>>> had problems with setting up BA session (and they might
>>>> have been fixed in the meantime?!).
>>> As far as I understood tearing down the BA session and then
>>> starting it again has a much higher impact onto throughput
>>> then just resending the BAR after the next successfully sent
>>> AMPDU.
>> Well, at least in case of carl9170 it looks like the resending
>> BARs are confusing the receiver reorder buffer on the other
>> HT peer. Since it looks like the HT peer dump hardware is still
>> ACKs incoming frames from a carl9170 station, but the software
>> reorder buffer on the HT peer is silently dropping the data frames.
>> [Of course, data frames from other TIDs, MGMT (probes) or null-
>> frames are not reordered... so the connection polling with
>> null-frames does not detect the bad state and won't try to
>> recover).
> I think we need to figure out what exactly confuses the receiver reorder
> buffer of a HT peer. It sounds to me like something is causing a gap in
> sequence number allocation, but I don't know what would do that. Where
> exactly is the connection between BAR retransmission and those reorder
> issues?
The connection is that before the patch the BA session
was torn down when a BAR was not delivered successfully.
Now the code tries to revive the stalled BA sessions by
sending BARs (which at least according to the spec
should work fine as long as the BA session is still active?!).

I guess part of the problem is the implementation of
"11.5.3 Error recovery upon a peer failure" in both peers.
In order to debug this, it would be great to know how the
peers handle a injected frame that has the ACK policy
set to Block ACK but without an active session.
802.11-2007 11.5.3 says (under figure 11-14) that the
frame should be discarded and the receiver has to send
a DELBA. However in mac80211 case (if I didn't miss any
code), the frame is not discarded (in fact it will go
though the rx path unreordered - code is in mac80211/rx.c,
func: ieee80211_rx_reorder_ampdu) and no delba is sent
either... so there's some potential from similar
"undefinied" behavior in other stacks as well.

Regards,
Christian

2012-06-09 14:23:30

by Christian Lamparter

[permalink] [raw]
Subject: Re: BA session issue due to old BARs?

On Saturday 09 June 2012 14:20:48 Helmut Schaa wrote:
> On Sat, Jun 9, 2012 at 2:18 AM, Christian Lamparter
> <[email protected]> wrote:
> > On Friday 08 June 2012 09:57:26 Sean Patrick Santos wrote:
> >> I believe that I have found the commit that introduced this problem,
> >> which was a change in mac80211. However, I'm out of my depth in
> >> figuring out what a really "correct" solution is; all I've done is a
> >> trial-and-error bisection. The commit in question:
> >>
> >> commit f0425beda4d404a6e751439b562100b902ba9c98
> >> Author: Felix Fietkau <[email protected]>
> >> Date: Sun Aug 28 21:11:01 2011 +0200
> >>
> >> mac80211: retry sending failed BAR frames later instead of tearing down aggr
> >>
> >> Unfortunately failed BAR tx attempts happen more frequently than I
> >> expected, and the resulting aggregation teardowns cause performance
> >> issues, as the aggregation session does not always get re-established
> >> properly.
> >> Instead of tearing down the entire aggr session, we can simply store the
> >> SSN of the last failed BAR tx attempt, wait for the first successful
> >> tx status event, and then send another BAR with the same SSN.
> >>
> >> Signed-off-by: Felix Fietkau <[email protected]>
> >> Cc: Helmut Schaa <[email protected]>
> >> Signed-off-by: John W. Linville <[email protected]>
> >>
> >> This looks relevant. As a matter of personal convenience, I might try
> >> backing out the change tomorrow if it seems that it'll help.
> > Felix,
> >
> > is there any way we can restore the old behavior of tearing
> > down BAs due to BAR transmission failures without breaking
> > ath9k (or rt2x00)?
>
> No problem with rt2x00 since it has problems reporting the
> tx status of BARs :(
hm, does the hardware pass BA to the driver? Because, the
BA which results from a BAR usually contains the ssn from
the BAR (just the ra / ta is switched). Come to think of it
the ssn and the bitmap might also be useful to map tx_status
to a specific skb.

> > Or am I misinterpreting the commit and
> > this patch was just a temporary fix since back then mac80211
> > had problems with setting up BA session (and they might
> > have been fixed in the meantime?!).
> >
> > Quote from the first paragraph of the commit:
> >> ... the resulting aggregation teardowns cause performance
> >> issues, as the aggregation session does not always get
> >> re-established properly.
>
> As far as I understood tearing down the BA session and then
> starting it again has a much higher impact onto throughput
> then just resending the BAR after the next successfully sent
> AMPDU.
Well, at least in case of carl9170 it looks like the resending
BARs are confusing the receiver reorder buffer on the other
HT peer. Since it looks like the HT peer dump hardware is still
ACKs incoming frames from a carl9170 station, but the software
reorder buffer on the HT peer is silently dropping the data frames.
[Of course, data frames from other TIDs, MGMT (probes) or null-
frames are not reordered... so the connection polling with
null-frames does not detect the bad state and won't try to
recover).

Regards,
Christian

2012-06-10 05:13:28

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [RFC v3] mac80211: follow 802.11-2007 11.5.3 Error recovery upon HT BA failure

> The path is
> 11. -> MLME
> ? ? ? ?5. -> Block Ack operation
> ? ? ? ? ? ? ? ?3. -> Error Recovery upon a peer failure
>
> > > > > ?- The spec says we should test for BlockAcks, however
> > > > > ? ?that is not feasible because it is a control frame
> > > > > ? ?(so FIF_CONTROL might filter it on some hardware).
> > > >
> > > > In some way, all devices are going to have to report these
> > > > frames. Maybe not as the frame itself, but as some other
> > > > notification, to allow cleaning up the TX queues
> > > > accordingly, I think? There's a BA notification in iwlwifi
> > > > for example.
> > > Alright, I guess this means that we probably need a new HW
> > > feature flag like:
> > > ? ? IEEE80211_HW_REPORTS_BA_NOTIFICATION
> > > for hardware/firmware which filter BA frames, but have a BA
> > > notification trap. And every other driver need to pass BA
> > > to the stack where a new rx handler will update the last_tx
> > > accordingly.
> >
> > But that notification trap would also have to call some function
> > in the stack, and the drivers that don't have it (or don't
> > implement it yet) need something more like this patch?
>
> (see v3 attached to this mail).
>
> Yes, drivers that don't pass BAs to the stack and won't call
> ieee80211_ba_notification will have their BA sessions "removed" in v3.
> Of course, we can also retain the old behavoir and reset the
> tx ba session timeout until we know we don't break anything.

But if the peer acked a packet sent in a BA session, then it must send
a BA (spec AFAIK), so when the driver calls ieee80211_tx_status at the
originator side, mac80211 can assume that a BA was received, right ?

Even if you still want a specific notification from the driver to
mac80211, what about adding it to ieee80211_tx_info.status ?
We have a trade-off here. We can have a new notification which is a
new API (which is a disadvantage in terms of runtime and design) but
can be called once per batch: meaning that we will reset the the timer
once every 15 packets or so. Or we can have a bit in the
ieee80211_tx_info.status which would be in every packet, but then, no
need for new API.
Maybe the best way is to have the driver set the new bit in
ieee80211_tx_info.status only for the last packet of the batch (or the
first, whatever).

But I don't see why we cannot rely on the assumption I made in the
beginning of this mail to start with.

2012-06-07 13:18:27

by Christian Lamparter

[permalink] [raw]
Subject: Re: carl9170 issue

On Thursday 07 June 2012 08:53:32 Sean Patrick Santos wrote:
> The first "problem", which is actually mild enough that I wouldn't
> bother writing in about it if it was the only issue, is that I get a
> lot of messages like these:
>
> [13215.559590] ieee80211 phy2: channel change: 2432 -> 2437 failed (2).
> [13215.844041] ieee80211 phy2: channel change: -1 -> 2437 failed (2).
> [13215.844044] usb 3-1.2: restart device (7)
> [13216.983593] usb 3-1.2: device restarted successfully.
> [13216.988414] ieee80211 phy2: Hardware restart was requested
>
> Also occasional blocks like these:
>
> [13548.136457] ieee80211 phy2: invalid plcp cck rate (0).
> [13597.224429] ieee80211 phy2: invalid plcp cck rate (0).
> [13601.512838] ieee80211 phy2: invalid plcp cck rate (0).
>
> I gather from a previous post on this mailing list that these are
> signs of interference in the area, which doesn't surprise me. I have a
> draft-N router that can only use the 2.4Ghz range, and there are three
> cordless phones, a wall, a microwave, and several other wireless
> devices between the adapter and the router. This doesn't bother me
> that much, because when the above messages are being printed
> performance is usually still OK, and when a restart does happen the
> device recovers rapidly. Plus, I'm somewhat stuck with the situation,
> since I don't have much control over how things are arranged in this
> space, and because the other adapter I have on hand is even worse off,
> both in terms of hardware and drivers.
Fair enough. But what's the other adapter?

> The second, more troubling problem is that I seem to get a "silent"
> failure (at least I can't find any errors) if I start a large download
> or set of downloads that take more than 10 seconds to a minute (in
> particular, trying to clone large directories using mercurial is
> impossible, because it will always trigger this problem, though for
> some reason git and subversion work most, but not all, of the time).
> What I mean by "failure" is that one of these two things will happen:
>
> 1. The device will simply fail to receive anything (or trickle out to
> a rate of 500 bytes/min), at which point it will remain in that state
> for hours, occasionally registering minuscule amounts of activity,
> unless it is dis/reconnected to the wireless network (toggling power
> to the device or reloading the module also work, but do neither better
> nor worse than just reassociating). Upon reconnecting it immediately
> works fine, as long as I don't trigger the same problem again.
>
> 2. Less often, the device will fail as above, but then suddenly start
> working again another minute or so later, allowing the process that
> had been overworking it (mercurial, wget, firefox, whatever) to
> continue what it was doing for another 10-50 seconds, at which point
> the connection trickles out again. This cycle keeps happening until
> the process either completes successfully, errors and dies, hangs, or
> is killed, at which point everything seems fine again. (Killing the
> process does not solve the problem, it just prevents the same process
> from causing the problem *again* in the event that the device
> spontaneously recovers within a minute or two).
>
> I know that it's physically possible to get a stable connection here,
> because the Windows installation on the same machine can almost always
> get fairly good speeds with the same device in the same place on the
> same network at the same time of day (~25Mbps, lose connection maybe
> once per 40 hours of use). I also know this because I can always fix
> the problem manually on Linux by reconnecting the device. What I'm not
> sure about is what the problem is with carl9170, or how to convince it
> to be more fault-tolerant. (Is this behavior an overreaction to the
> noise level? Is it hanging while waiting for some event that may never
> happen?) I'm afraid I'm not even sure how to diagnose the problem
> further; wireless adapters are not familiar territory for me.
Thanks for your extensive report on this.

Your problems sound somewhat familiar to
"Re: carl9170 driver - network connection breaks"
<http://permalink.gmane.org/gmane.linux.kernel.wireless.general/88223>

So far no-one has been able to bisect the bug (last good was 3.1, so
to breakage must have occurred between 3.1 and 3.2). I would have
looked into it long ago, but I can't reproduce.

Regards,
Christian

PS: If your kernel was compiled with CONFIG_MAC80211_DEBUGFS you can
"restart" BA/aggregation sessions by
echo "tx stop 0" > /sys/kernel/debug/ieee80211/phyX/netdev:wlanXY/stations/AP-MAC/agg_status
echo "rx stop 0" > /sys/kernel/debug/ieee80211/phyX/netdev:wlanXY/stations/AP-MAC/agg_status
alternatively: you can disable ht by loading the module with 'noht=1' parameter