2013-05-05 20:38:06

by Jake Edge

[permalink] [raw]
Subject: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3


After building 3.9 for my HP/Compaq 2510p laptop, the wireless (iwl4965)
would seemingly no longer connect. Actually, it seems that it *will*
eventually, where that is somewhere between 4 and 21+ retries (from KDE
network manager). Going back to 3.8 (and earlier) and the problem goes
away. (All on Fedora 18, fwiw, but I don't think that's significant).

I bisected the problem to:

commit 1672c0e31917f49d31d30d79067103432bc20cc7
Author: Johannes Berg <[email protected]>
Date: Tue Jan 29 15:02:27 2013 +0100

mac80211: start auth/assoc timeout on frame status

but I can't (easily) revert that in 3.9 (maybe I can in -rc1 or
something? haven't tried that)

What more information is needed from me? I may still mess around with
trying to revert that patch just to nail it down for sure, but two
separate bisection exercises ended up at the same place.

There are a few traps for the unwary if anyone else ends up bisecting
in through here -- the first bisection (3.8 - 3.9-rc1) ran into an IPv6
build error along the way, which I "fixed" by configuring it out --
also, the amount of retries goes from *many* (15-21+) to relatively few
(4) somewhere in there, which confused me and led to the second
bisection (3.8-3.9) ...

jake

--
Jake Edge - LWN - [email protected] - http://lwn.net


2013-05-06 15:31:25

by Johannes Berg

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, 2013-05-06 at 17:30 +0200, Stanislaw Gruszka wrote:

> I thought this happen because 4965 firmware does not mark tx status
> ack for PROBE_REQUEST frame, because it is not acked by standard
> ACK frame, but by PROBE_RESPONSE frame.

Actually, it is ACKed normally, but also gets a probe response :-)

(Or do we send it as multicast? I don't think we do)

> But if so, I would also see
> the breakage on my setup, but I don't - it works quite well here.

Are you testing on a passive channel? Try with a large beacon interval.

johannes


2013-05-07 15:33:58

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Tue, May 07, 2013 at 03:53:30PM +0200, Johannes Berg wrote:
> > > I think the best way to solve this would be to do such a thing in
> > > iwlegacy as well, but until then and for stable maybe we should
> > > introduce another HW flag to restore the previous mac80211 behaviour?
> >
> > I'm not sure if I like to add passive_no_rx to iwlegacy. Stopping queues
> > and waiting for beacon looks sticky, what happen if beacon will not be
> > received?
>
> Good question, do we get stuck? I was assuming we'd time out, but maybe
> that's not the case?

AFICT, we wake queues only if beacon arrives or mac80211 call drv_config
with BSS_CHANGED_IDLE. I'm not sure if the latter prevent stuck.

> > Perhaps I will just remove IEEE80211_HW_REPORTS_TX_ACK_STATUS from 4965,
> > it's simpler workaround ?
>
> Sure, but maybe that loses other semantics that you want?
>
> And anyway it's not complete. If you have a very long beacon interval
> (say 1 second) then this could still lead to all probe/auth retries
> going out inbetween two beacons since the timeout is just 3*100ms.

Let's make that change as temporary regression workaround, I'll add
passive_no_rx workaround latter. I'll also think if it can stuck or
not.

Stanislaw

2013-05-06 15:44:15

by Johannes Berg

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, 2013-05-06 at 17:31 +0200, Johannes Berg wrote:

> > But if so, I would also see
> > the breakage on my setup, but I don't - it works quite well here.
>
> Are you testing on a passive channel? Try with a large beacon interval.

I think most likely what happens is that it's on a passive channel, and
the firmware drops the TX packet with a bad status. Before the patch,
we'd just wait sitting on the channel for HZ/5 (200ms) before trying
again, with the patch we immediately retransmit the packet, which will
fail again and again until the firmware received a beacon.

If you look at iwlwifi/dvm/, it has some passive_no_rx workaround for
this, which I don't see in iwlegacy.

I think the best way to solve this would be to do such a thing in
iwlegacy as well, but until then and for stable maybe we should
introduce another HW flag to restore the previous mac80211 behaviour?

johannes


2013-05-07 09:42:28

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

>> > > But if so, I would also see
>> > > the breakage on my setup, but I don't - it works quite well here.
>> >
>> > Are you testing on a passive channel? Try with a large beacon interval.
>>
>> I think most likely what happens is that it's on a passive channel, and
>> the firmware drops the TX packet with a bad status. Before the patch,
>> we'd just wait sitting on the channel for HZ/5 (200ms) before trying
>> again, with the patch we immediately retransmit the packet, which will
>> fail again and again until the firmware received a beacon.
>>
>> If you look at iwlwifi/dvm/, it has some passive_no_rx workaround for
>> this, which I don't see in iwlegacy.
>
> Can you explain why it is named passive_no_rx instead passive_no_tx ?

Well, it is basically - passive channel with no rx :-) This means we can't tx.

>> I think the best way to solve this would be to do such a thing in
>> iwlegacy as well, but until then and for stable maybe we should
>> introduce another HW flag to restore the previous mac80211 behaviour?
>
> I'm not sure if I like to add passive_no_rx to iwlegacy. Stopping queues
> and waiting for beacon looks sticky, what happen if beacon will not be
> received?
>
> Perhaps I will just remove IEEE80211_HW_REPORTS_TX_ACK_STATUS from 4965,
> it's simpler workaround ?

2013-05-07 08:41:16

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, May 06, 2013 at 05:44:06PM +0200, Johannes Berg wrote:
> On Mon, 2013-05-06 at 17:31 +0200, Johannes Berg wrote:
>
> > > But if so, I would also see
> > > the breakage on my setup, but I don't - it works quite well here.
> >
> > Are you testing on a passive channel? Try with a large beacon interval.
>
> I think most likely what happens is that it's on a passive channel, and
> the firmware drops the TX packet with a bad status. Before the patch,
> we'd just wait sitting on the channel for HZ/5 (200ms) before trying
> again, with the patch we immediately retransmit the packet, which will
> fail again and again until the firmware received a beacon.
>
> If you look at iwlwifi/dvm/, it has some passive_no_rx workaround for
> this, which I don't see in iwlegacy.

Can you explain why it is named passive_no_rx instead passive_no_tx ?

> I think the best way to solve this would be to do such a thing in
> iwlegacy as well, but until then and for stable maybe we should
> introduce another HW flag to restore the previous mac80211 behaviour?

I'm not sure if I like to add passive_no_rx to iwlegacy. Stopping queues
and waiting for beacon looks sticky, what happen if beacon will not be
received?

Perhaps I will just remove IEEE80211_HW_REPORTS_TX_ACK_STATUS from 4965,
it's simpler workaround ?

Stanislaw

2013-05-06 14:38:01

by Jake Edge

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, 6 May 2013 14:38:06 +0200 Stanislaw Gruszka wrote:

> > I bisected the problem to:
> >
> > commit 1672c0e31917f49d31d30d79067103432bc20cc7

> > What more information is needed from me? I may still mess around
> > with trying to revert that patch just to nail it down for sure, but
> > two separate bisection exercises ended up at the same place.
>
> Below patch should restore old mac80211 behaviour, by stop
> telling mac that 4965 supports TX ACK status. Does it help?

Yup, that fixed the problem, thanks. So did reverting Johannes's
commit above, fwiw, but just turning off that flag in 3.9 as your patch
does fixes it.

seems like that needs to head to Linus and the stable folks ...

jake

--
Jake Edge - LWN - [email protected] - http://lwn.net

2013-05-06 12:36:38

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Sun, May 05, 2013 at 02:38:03PM -0600, Jake Edge wrote:
>
> After building 3.9 for my HP/Compaq 2510p laptop, the wireless (iwl4965)
> would seemingly no longer connect. Actually, it seems that it *will*
> eventually, where that is somewhere between 4 and 21+ retries (from KDE
> network manager). Going back to 3.8 (and earlier) and the problem goes
> away. (All on Fedora 18, fwiw, but I don't think that's significant).
>
> I bisected the problem to:
>
> commit 1672c0e31917f49d31d30d79067103432bc20cc7
> Author: Johannes Berg <[email protected]>
> Date: Tue Jan 29 15:02:27 2013 +0100
>
> mac80211: start auth/assoc timeout on frame status
>
> but I can't (easily) revert that in 3.9 (maybe I can in -rc1 or
> something? haven't tried that)
>
> What more information is needed from me? I may still mess around with
> trying to revert that patch just to nail it down for sure, but two
> separate bisection exercises ended up at the same place.

Below patch should restore old mac80211 behaviour, by stop
telling mac that 4965 supports TX ACK status. Does it help?

diff --git a/drivers/net/wireless/iwlegacy/4965-mac.c b/drivers/net/wireless/iwlegacy/4965-mac.c
index b8f82e6..eaa756d 100644
--- a/drivers/net/wireless/iwlegacy/4965-mac.c
+++ b/drivers/net/wireless/iwlegacy/4965-mac.c
@@ -5741,7 +5741,7 @@ il4965_mac_setup_register(struct il_priv *il, u32 max_probe_length)
hw->flags =
IEEE80211_HW_SIGNAL_DBM | IEEE80211_HW_AMPDU_AGGREGATION |
IEEE80211_HW_NEED_DTIM_BEFORE_ASSOC | IEEE80211_HW_SPECTRUM_MGMT |
- IEEE80211_HW_REPORTS_TX_ACK_STATUS | IEEE80211_HW_SUPPORTS_PS |
+ IEEE80211_HW_SUPPORTS_PS |
IEEE80211_HW_SUPPORTS_DYNAMIC_PS;
if (il->cfg->sku & IL_SKU_N)
hw->flags |=


2013-05-07 16:05:46

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH 3.10] iwl4965: workaround connection regression on passive channel

Jake reported that since commit 1672c0e31917f49d31d30d79067103432bc20cc7
"mac80211: start auth/assoc timeout on frame status", he is unable to
connect to his AP, which is configured to use passive channel.

After switch to passive channel 4965 firmware drops any TX packet until
it receives beacon. Before commit 1672c0e3 we waited on channel and
retransmit packet after 200ms, that makes we receive beacon on the
meantime and association process succeed. New mac80211 behaviour cause
that any ASSOC frame fail immediately on iwl4965 and we can not
associate.

This patch restore old mac80211 behaviour for iwl4965, by removing
IEEE80211_HW_REPORTS_TX_ACK_STATUS feature. This feature will be
added again to iwl4965 driver, when different, more complex
workaround for this firmware issue, will be added to the driver.

Cc: [email protected] # 3.9
Bisected-by: Jake Edge <[email protected]>
Reported-and-tested-by: Jake Edge <[email protected]>
Signed-off-by: Stanislaw Gruszka <[email protected]>
---
drivers/net/wireless/iwlegacy/4965-mac.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/wireless/iwlegacy/4965-mac.c b/drivers/net/wireless/iwlegacy/4965-mac.c
index b8f82e6..9a95045 100644
--- a/drivers/net/wireless/iwlegacy/4965-mac.c
+++ b/drivers/net/wireless/iwlegacy/4965-mac.c
@@ -5741,8 +5741,7 @@ il4965_mac_setup_register(struct il_priv *il, u32 max_probe_length)
hw->flags =
IEEE80211_HW_SIGNAL_DBM | IEEE80211_HW_AMPDU_AGGREGATION |
IEEE80211_HW_NEED_DTIM_BEFORE_ASSOC | IEEE80211_HW_SPECTRUM_MGMT |
- IEEE80211_HW_REPORTS_TX_ACK_STATUS | IEEE80211_HW_SUPPORTS_PS |
- IEEE80211_HW_SUPPORTS_DYNAMIC_PS;
+ IEEE80211_HW_SUPPORTS_PS | IEEE80211_HW_SUPPORTS_DYNAMIC_PS;
if (il->cfg->sku & IL_SKU_N)
hw->flags |=
IEEE80211_HW_SUPPORTS_DYNAMIC_SMPS |
--
1.7.11.7


2013-05-06 15:29:12

by Jake Edge

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, 06 May 2013 17:24:19 +0200 Johannes Berg wrote:
> On Mon, 2013-05-06 at 09:21 -0600, Jake Edge wrote:
>
> > > Thanks for the report. Is the AP on a passive channel by any
> > > chance (5 GHz, or channels 12/13)?
> >
> > No, from what I can see on my router, it is using channel 36 (5.180
> > GHz) for the SSID in question
>
> So yes, that is a passive channel for the 4965 device :-)

oops, sorry :) (obviously I don't know what a passive channel is :)

> > but Stanislaw seems to have pinpointed the issue.
>
> Yeah that was a workaround, but I'm not sure we'd really want to do
> that. I'd rather see what really caused this issue. I have a feeling
> it's the passive-no-RX workaround, or lack thereof maybe? I need to
> look a bit closer.

Ok, I'm happy to test things out at this end ... change channels on the
router, build new kernels, etc.

thanks,

jake

--
Jake Edge - LWN - [email protected] - http://lwn.net

2013-05-06 15:12:00

by Johannes Berg

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

Hi Jake,

> commit 1672c0e31917f49d31d30d79067103432bc20cc7
> Author: Johannes Berg <[email protected]>
> Date: Tue Jan 29 15:02:27 2013 +0100
>
> mac80211: start auth/assoc timeout on frame status
>
> but I can't (easily) revert that in 3.9 (maybe I can in -rc1 or
> something? haven't tried that)
>
> What more information is needed from me? I may still mess around with
> trying to revert that patch just to nail it down for sure, but two
> separate bisection exercises ended up at the same place.

Thanks for the report. Is the AP on a passive channel by any chance (5
GHz, or channels 12/13)?

johannes


2013-05-06 15:21:45

by Jake Edge

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, 06 May 2013 17:11:52 +0200 Johannes Berg wrote:
> Hi Jake,
>
> > commit 1672c0e31917f49d31d30d79067103432bc20cc7
> > Author: Johannes Berg <[email protected]>
> > Date: Tue Jan 29 15:02:27 2013 +0100
> >
> > mac80211: start auth/assoc timeout on frame status
> >
> > but I can't (easily) revert that in 3.9 (maybe I can in -rc1 or
> > something? haven't tried that)
> >
> > What more information is needed from me? I may still mess around
> > with trying to revert that patch just to nail it down for sure, but
> > two separate bisection exercises ended up at the same place.
>
> Thanks for the report. Is the AP on a passive channel by any chance (5
> GHz, or channels 12/13)?

No, from what I can see on my router, it is using channel 36 (5.180
GHz) for the SSID in question (and channel 11 2.462 GHz for the other
SSID) ... but Stanislaw seems to have pinpointed the issue.

thanks,

jake

--
Jake Edge - LWN - [email protected] - http://lwn.net

2013-05-22 11:57:29

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Tue, May 07, 2013 at 05:35:25PM +0200, Stanislaw Gruszka wrote:
> On Tue, May 07, 2013 at 03:53:30PM +0200, Johannes Berg wrote:
> > > I'm not sure if I like to add passive_no_rx to iwlegacy. Stopping queues
> > > and waiting for beacon looks sticky, what happen if beacon will not be
> > > received?
> >
> > Good question, do we get stuck? I was assuming we'd time out, but maybe
> > that's not the case?
>
> AFICT, we wake queues only if beacon arrives or mac80211 call drv_config
> with BSS_CHANGED_IDLE. I'm not sure if the latter prevent stuck.

It should prevent stuck. When we fail to auth, drv_config() with BSS_CHANGED_IDLE
is called via:

ieee80211_destroy_auth_data ->
ieee80211_vif_release_channel ->
__ieee80211_vif_release_channel ->
ieee80211_unassign_vif_chanctx ->
ieee80211_bss_info_change_notify

But there is need to have ->vif.chanctx_conf valid in
__ieee80211_vif_release_channel(), where is below condition:

conf = rcu_dereference_protected(sdata->vif.chanctx_conf,
lockdep_is_held(&local->chanctx_mtx));
if (!conf)
return;

I'm not sure if that always happen. Perhaps would be better to change
BSS_CHANGED_IDLE to BSS_CHANGED_BSSID, which is called directly from
ieee80211_destroy_auth_data() ?

Stanislaw


2013-05-24 20:28:48

by Johannes Berg

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Wed, 2013-05-22 at 13:59 +0200, Stanislaw Gruszka wrote:

> > AFICT, we wake queues only if beacon arrives or mac80211 call drv_config
> > with BSS_CHANGED_IDLE. I'm not sure if the latter prevent stuck.
>
> It should prevent stuck. When we fail to auth, drv_config() with BSS_CHANGED_IDLE
> is called via:
>
> ieee80211_destroy_auth_data ->
> ieee80211_vif_release_channel ->
> __ieee80211_vif_release_channel ->
> ieee80211_unassign_vif_chanctx ->
> ieee80211_bss_info_change_notify
>
> But there is need to have ->vif.chanctx_conf valid in
> __ieee80211_vif_release_channel(), where is below condition:
>
> conf = rcu_dereference_protected(sdata->vif.chanctx_conf,
> lockdep_is_held(&local->chanctx_mtx));
> if (!conf)
> return;
>
> I'm not sure if that always happen. Perhaps would be better to change
> BSS_CHANGED_IDLE to BSS_CHANGED_BSSID, which is called directly from
> ieee80211_destroy_auth_data() ?

I don't think the "!conf" can hit in this case, since to even try to
associate you have to have a channel context assigned.

johannes


2013-05-07 13:53:37

by Johannes Berg

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Tue, 2013-05-07 at 10:42 +0200, Stanislaw Gruszka wrote:

> Can you explain why it is named passive_no_rx instead passive_no_tx ?

Emmanuel already commented on this, basically the error codes are all
for "I couldn't transmit this frame", so here we have "I couldn't
transmit this frame because it was on a _passive_ channel and there was
_no rx_ yet."

> > I think the best way to solve this would be to do such a thing in
> > iwlegacy as well, but until then and for stable maybe we should
> > introduce another HW flag to restore the previous mac80211 behaviour?
>
> I'm not sure if I like to add passive_no_rx to iwlegacy. Stopping queues
> and waiting for beacon looks sticky, what happen if beacon will not be
> received?

Good question, do we get stuck? I was assuming we'd time out, but maybe
that's not the case?

> Perhaps I will just remove IEEE80211_HW_REPORTS_TX_ACK_STATUS from 4965,
> it's simpler workaround ?

Sure, but maybe that loses other semantics that you want?

And anyway it's not complete. If you have a very long beacon interval
(say 1 second) then this could still lead to all probe/auth retries
going out inbetween two beacons since the timeout is just 3*100ms.

johannes


2013-05-06 15:24:24

by Johannes Berg

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, 2013-05-06 at 09:21 -0600, Jake Edge wrote:

> > Thanks for the report. Is the AP on a passive channel by any chance (5
> > GHz, or channels 12/13)?
>
> No, from what I can see on my router, it is using channel 36 (5.180
> GHz) for the SSID in question

So yes, that is a passive channel for the 4965 device :-)

> but Stanislaw seems to have pinpointed the issue.

Yeah that was a workaround, but I'm not sure we'd really want to do
that. I'd rather see what really caused this issue. I have a feeling
it's the passive-no-RX workaround, or lack thereof maybe? I need to look
a bit closer.

johannes


2013-05-06 15:29:17

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Bisected 3.9 regression for iwl4965 connection problem to 1672c0e3

On Mon, May 06, 2013 at 08:37:59AM -0600, Jake Edge wrote:
> On Mon, 6 May 2013 14:38:06 +0200 Stanislaw Gruszka wrote:
>
> > > I bisected the problem to:
> > >
> > > commit 1672c0e31917f49d31d30d79067103432bc20cc7
>
> > > What more information is needed from me? I may still mess around
> > > with trying to revert that patch just to nail it down for sure, but
> > > two separate bisection exercises ended up at the same place.
> >
> > Below patch should restore old mac80211 behaviour, by stop
> > telling mac that 4965 supports TX ACK status. Does it help?
>
> Yup, that fixed the problem, thanks. So did reverting Johannes's
> commit above, fwiw, but just turning off that flag in 3.9 as your patch
> does fixes it.
>
> seems like that needs to head to Linus and the stable folks ...

We can fix the problem that way, but I'm not sure if that is the best
way. IEEE80211_HW_REPORTS_TX_ACK_STATUS is nice to have as long hardware
really supports that feature.

I thought this happen because 4965 firmware does not mark tx status
ack for PROBE_REQUEST frame, because it is not acked by standard
ACK frame, but by PROBE_RESPONSE frame. But if so, I would also see
the breakage on my setup, but I don't - it works quite well here. So
perhaps something different is broken on commit
1672c0e31917f49d31d30d79067103432bc20cc7 , that make we can not
associate with some APs.

Let me think a bit longer about this issue. Also maybe Johannes will
have some thoughts.

Stanislaw