2012-06-11 03:15:50

by Jonathan Nieder

[permalink] [raw]
Subject: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Hi Arend et al,

Quick puzzle for you. You like puzzles, right? ;-)

As discussed at [1], Camaleón has been experiencing unwanted random
wireless reconnects with various 3.2.y kernels up to and including
3.2.19:

Camaleón wrote[3]:

> What I get is the Network Manager window requesting for the
> password confirm, randomly. If I delay the password confirmation, the
> wireless connection drops.

Newer kernels seemed to work better than old, so I asked her to apply
the following patches agains the 3.2.y tree (the exact patches used
are at [2]):

6b1a89afbf97 brcm80211: smac: drop "40MHz intolerant" flag from HT
capability info
c261bdf8acad brcm80211: smac: indicate severe problems to Mac80211
0bf1f883fd0a brcm80211: smac: removed MPC related code
4412953061de brcm80211: smac: removed MPC related variables
28237002e726 brcm80211: smac: removed down-on-watchdog MPC functionality
43ac09722f8e brcm80211: smac: removed down-on-rf-kill functionality
a8bc4917ed6b brcm80211: smac: bugfix for tx mute in brcms_b_init()
c6c44893c864 brcm80211: smac: fixed inconsistency in transmit mute
2646c46d5679 brcm80211: smac: modified Mac80211 callback interface
dc460127898c brcm80211: smac: mute transmit on ops_start
1525662ac280 brcm80211: smac: changed check to confirm STA only support
b7eec4233c34 brcm80211: smac: replace own access category definitions with
mac80211 enum
e9ca530a7b18 brcm80211: smac: don't modify sta parameters when adding sta
8906c43cb160 brcm80211: smac: fix channel frequency
02a588a2e3b9 brcm80211: smac: combine promiscuous mode functionality
be667669ec01 brcm80211: smac: added support for mac80211 filter flags
[d3f311349add brcm80211: fix usage of set tx power] --- unrelated, my mistake
aa1f2f0a3218 brcm80211: smac: precendence bug in wlc_phy_attach()
1570e53c14ff brcm80211: smac: fix unintended fallthru in
wlc_phy_radio_init_2057()
137dabed34a1 brcm80211: smac: remove smatch warnings from brcmsmac code
2b0a53d51b5f brcm80211: smac: only print block-ack timeout message at
trace level
6b8da423315b brcm80211: smac: do not use US as fallback regulatory hint
94a2ca311cf4 brcm80211: smac: only provide valid regulatory hint

The result was very nice --- the random reconnects went away.

Here comes the puzzle --- which of those patches are responsible for
the improvement?

If it is not too invasive, we would like to test the responsible patch
separately and submit it for inclusion in stable trees, hence the
question. Incidentally, if any unrelated patches also seem like good
stable candidates, that would be interesting to hear, too. (In other
words, a brief description of the symptoms addressed by _any_ of the
listed patches would be welcome.)

Example log of a reconnect at [3].

Thanks,
Jonathan

[1] http://thread.gmane.org/gmane.linux.kernel.wireless.general/87873
[2] http://bugs.debian.org/664767#99
[3] http://bugs.debian.org/664767#196


2012-06-20 10:02:27

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Arend van Spriel wrote:
> On 06/19/2012 09:28 PM, Jonathan Nieder wrote:

>> I had been hoping that was mostly orthogonal until Camaleón mentioned
>> that 3.2.2 doesn't seem to trigger the random reconnects.
>
> I missed this piece of info. So the following statements are true?
>
> 1. v3.2.2 and earlier did not show the issue.
> 2. v3.2.9 until 3.2.17 have random reconnects.
> 3. v3.2.18 does not have random reconnects (or less).

(1) and (2) are true. I don't think (3) is.

2012-06-19 19:28:27

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Arend van Spriel wrote:
> On 06/19/2012 08:15 PM, Jonathan Nieder wrote:

>> This was first reproduced on a kernel closely based on 3.2.9. It
>> would typically happen pretty reliably once a day or so. Four days of
>> testing a kernel close to 3.2.2 haven't triggered it again[1].
>>
>> The only brcm80211 change in that range is
>>
>> f96b08a7e6f6 brcmsmac: fix tx queue flush infinite loop
>
> The WARN_ONCE added by the commit above still triggers sometimes. Two
> recent commits I did regarding this are in 3.4-stable. Not sure if they
> have been ported to 3.2 as well.
>
> 85091fc brcm80211: smac: fix endless retry of A-MPDU transmissions
> badc4f0 brcm80211: smac: resume transmit fifo upon receiving frames

Yep, both are in 3.2-stable (added in 3.2.18 and 3.2.17, respectively).

[...]
> However, I still observe the warning so I am looking what other event
> trigger this issue.

Feel free to contact Touko Korpela <[email protected]> and Camaleón
if you need recent logs or other information about that[1].

I had been hoping that was mostly orthogonal until Camaleón mentioned
that 3.2.2 doesn't seem to trigger the random reconnects.

Thanks,
Jonathan

[1] tracked here: http://bugs.debian.org/672891

2012-06-19 18:52:05

by Arend van Spriel

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

On 06/11/2012 05:15 AM, Jonathan Nieder wrote:

Decided to lookup this message and reply. Noticed it, but forgot to
follow up.

> Hi Arend et al,
>
> Quick puzzle for you. You like puzzles, right? ;-)

>From time to time, I do. When I get enough sleep ;-)

> As discussed at [1], Camaleón has been experiencing unwanted random
> wireless reconnects with various 3.2.y kernels up to and including
> 3.2.19:
>
> Camaleón wrote[3]:
>
>> What I get is the Network Manager window requesting for the
>> password confirm, randomly. If I delay the password confirmation, the
>> wireless connection drops.
>
> Newer kernels seemed to work better than old, so I asked her to apply
> the following patches agains the 3.2.y tree (the exact patches used
> are at [2]):
>
> 6b1a89afbf97 brcm80211: smac: drop "40MHz intolerant" flag from HT
> capability info
> c261bdf8acad brcm80211: smac: indicate severe problems to Mac80211
> 0bf1f883fd0a brcm80211: smac: removed MPC related code
> 4412953061de brcm80211: smac: removed MPC related variables
> 28237002e726 brcm80211: smac: removed down-on-watchdog MPC functionality
> 43ac09722f8e brcm80211: smac: removed down-on-rf-kill functionality
> a8bc4917ed6b brcm80211: smac: bugfix for tx mute in brcms_b_init()
> c6c44893c864 brcm80211: smac: fixed inconsistency in transmit mute
> 2646c46d5679 brcm80211: smac: modified Mac80211 callback interface
> dc460127898c brcm80211: smac: mute transmit on ops_start
> 1525662ac280 brcm80211: smac: changed check to confirm STA only support
> b7eec4233c34 brcm80211: smac: replace own access category definitions with
> mac80211 enum
> e9ca530a7b18 brcm80211: smac: don't modify sta parameters when adding sta
> 8906c43cb160 brcm80211: smac: fix channel frequency
> 02a588a2e3b9 brcm80211: smac: combine promiscuous mode functionality
> be667669ec01 brcm80211: smac: added support for mac80211 filter flags
> [d3f311349add brcm80211: fix usage of set tx power] --- unrelated, my mistake
> aa1f2f0a3218 brcm80211: smac: precendence bug in wlc_phy_attach()
> 1570e53c14ff brcm80211: smac: fix unintended fallthru in
> wlc_phy_radio_init_2057()
> 137dabed34a1 brcm80211: smac: remove smatch warnings from brcmsmac code
> 2b0a53d51b5f brcm80211: smac: only print block-ack timeout message at
> trace level
> 6b8da423315b brcm80211: smac: do not use US as fallback regulatory hint
> 94a2ca311cf4 brcm80211: smac: only provide valid regulatory hint
>
> The result was very nice --- the random reconnects went away.

If I remember correctly the random reconnects were caused by mac80211
flush callback.

> Here comes the puzzle --- which of those patches are responsible for
> the improvement?

My hunch based on what I found so far is that the patches mentioning the
word 'mute' could be key here.

> If it is not too invasive, we would like to test the responsible patch
> separately and submit it for inclusion in stable trees, hence the
> question. Incidentally, if any unrelated patches also seem like good
> stable candidates, that would be interesting to hear, too. (In other
> words, a brief description of the symptoms addressed by _any_ of the
> listed patches would be welcome.)

I really need to dive into the patches individually so it may take some
time.

Gr. AvS


2012-06-20 07:12:13

by Arend van Spriel

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

On 06/19/2012 09:28 PM, Jonathan Nieder wrote:
> Arend van Spriel wrote:
>> However, I still observe the warning so I am looking what other event
>> trigger this issue.
>
> Feel free to contact Touko Korpela <[email protected]> and Camaleón
> if you need recent logs or other information about that[1].
>
> I had been hoping that was mostly orthogonal until Camaleón mentioned
> that 3.2.2 doesn't seem to trigger the random reconnects.

I missed this piece of info. So the following statements are true?

1. v3.2.2 and earlier did not show the issue.
2. v3.2.9 until 3.2.17 have random reconnects.
3. v3.2.18 does not have random reconnects (or less).

Unfortunately, between v3.2.2 and v3.2.9 the only commit was the
infinite loop fix from Stanislaw so that does not solve the puzzle.
It only bails out with a warning, but there probably still is a problem
in v3.2.2.

Gr. AvS


2012-06-19 19:15:46

by Arend van Spriel

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

On 06/19/2012 08:15 PM, Jonathan Nieder wrote:
> Hi again,
>
> Jonathan Nieder wrote:
>
>> As discussed at [1], Camaleón has been experiencing unwanted random
>> wireless reconnects with various 3.2.y kernels up to and including
>> 3.2.19:
>
> This was first reproduced on a kernel closely based on 3.2.9. It
> would typically happen pretty reliably once a day or so. Four days of
> testing a kernel close to 3.2.2 haven't triggered it again[1].
>
> The only brcm80211 change in that range is
>
> f96b08a7e6f6 brcmsmac: fix tx queue flush infinite loop
>

The WARN_ONCE added by the commit above still triggers sometimes. Two
recent commits I did regarding this are in 3.4-stable. Not sure if they
have been ported to 3.2 as well.

85091fc brcm80211: smac: fix endless retry of A-MPDU transmissions
badc4f0 brcm80211: smac: resume transmit fifo upon receiving frames

However, I still observe the warning so I am looking what other event
trigger this issue.

> So maybe the timeout is too short and this safety is tripping when it
> shouldn't. I've asked Camaleón to try a recent 3.2.y kernel with and
> without that commit reverted to test this guess.
>
> That leaves another mystery: which of the 22 changes listed at [2] was
> providing relief in earlier tests? E.g., does
>
>> c261bdf8acad brcm80211: smac: indicate severe problems to Mac80211
>
> make it easier to recover from this kind of error? Are there commands
> we should run or diagnostics to try to get a better sense of what is
> going on?

Not commends yet. We want to add debugfs support. The commit above only
notifies mac80211 that we have a problem. However, the recovery scenario
that mac80211 initiates upon this notification turns out to be killing
for brcmsmac.

Gr. AvS


2012-06-19 18:15:28

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Hi again,

Jonathan Nieder wrote:

> As discussed at [1], Camaleón has been experiencing unwanted random
> wireless reconnects with various 3.2.y kernels up to and including
> 3.2.19:

This was first reproduced on a kernel closely based on 3.2.9. It
would typically happen pretty reliably once a day or so. Four days of
testing a kernel close to 3.2.2 haven't triggered it again[1].

The only brcm80211 change in that range is

f96b08a7e6f6 brcmsmac: fix tx queue flush infinite loop

So maybe the timeout is too short and this safety is tripping when it
shouldn't. I've asked Camaleón to try a recent 3.2.y kernel with and
without that commit reverted to test this guess.

That leaves another mystery: which of the 22 changes listed at [2] was
providing relief in earlier tests? E.g., does

> c261bdf8acad brcm80211: smac: indicate severe problems to Mac80211

make it easier to recover from this kind of error? Are there commands
we should run or diagnostics to try to get a better sense of what is
going on?

Grasping at straws,
Jonathan

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=220;bug=664767
[2] http://thread.gmane.org/gmane.linux.kernel.wireless.general/92452

2012-06-20 12:13:26

by Arend van Spriel

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

On 06/20/2012 12:02 PM, Jonathan Nieder wrote:
> Arend van Spriel wrote:
>> On 06/19/2012 09:28 PM, Jonathan Nieder wrote:
>
>>> I had been hoping that was mostly orthogonal until Camale?n mentioned
>>> that 3.2.2 doesn't seem to trigger the random reconnects.
>>
>> I missed this piece of info. So the following statements are true?
>>
>> 1. v3.2.2 and earlier did not show the issue.
>> 2. v3.2.9 until 3.2.17 have random reconnects.
>> 3. v3.2.18 does not have random reconnects (or less).
>
> (1) and (2) are true. I don't think (3) is.
>

I have my doubts on (3) as well, but I think the likelihood of the
random reconnects has reduced by earlier mentioned patches. I will work
with Camale?n and/or Touko Korpela investigating this (and keep you posted).

Gr. AvS


2012-07-23 00:28:18

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Arend van Spriel wrote:
> On 07/16/2012 11:31 PM, Jonathan Nieder wrote:

>> With all the above patches applied on top of 3.2.21, Camaleón quickly
>> gets the reconnects and gnome-shell segfaults if she does not supply
>> the password to network-manager in time[1]. That means the patch that
>> prevents trouble is presumably one of the four listed below.
>>
>>>> 137dabed34a1 brcm80211: smac: remove smatch warnings from brcmsmac code
>>>> 2b0a53d51b5f brcm80211: smac: only print block-ack timeout message at
>>>> trace level
>>>> 6b8da423315b brcm80211: smac: do not use US as fallback regulatory hint
>>>> 94a2ca311cf4 brcm80211: smac: only provide valid regulatory hint
>>
>> Thanks,
>> Jonathan
>>
>> [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=295;bug=664767
>
> Thanks, Jonathan
>
> I guess the behaviour improved due to the last two patches listed above,
> but I also suspect that it is only an improvement and the root cause
> still exists.

I think you're right. Camaleón tried with all patches except the last
one and still got the unwanted reconnects.

Would 6b8da423315b and 94a2ca311cf4 be good candidates for stable@ in
the meantime (for symptom relief and saner regulatory code) until the
root cause is found?

Thanks,
Jonathan

2012-07-07 04:56:29

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Hi,

Arend van Spriel wrote:
> On 06/11/2012 05:15 AM, Jonathan Nieder wrote:

>> Newer kernels seemed to work better than old, so I asked her to apply
>> the following patches agains the 3.2.y tree (the exact patches used
>> are at [2]):

I asked Camaleón to test the collection again to rule out a fluke.
Still looks ok[1]:

| I've been running this test (kernel 3.2.21 with
| all the patches applied) since last Saturday which has been running
| quite stable (only a couple reconnects in 7 days

Phew.

[...]
> My hunch based on what I found so far is that the patches mentioning the
> word 'mute' could be key here.

Unfortunately a collection including all of those (patches 1-10)
produces lots of reconnects and gnome-shell segfaults:

| Applied the ten first patches, compiled 3.2.21-1 from Debian sources
| and... well, I'm attaching log.

[...]
> I really need to dive into the patches individually so it may take some
> time.

Thanks.

Camaleón also tried changing the regulatory domain:

| >> I can try Arend's suggestion of setting "US" for the CRDA instead ES :-?
| >
| > This would be interesting, too (as an independent test).
|
| I tried but got no successful results so I restored the ES setting.

If you have questions for Camaleón or tests that it would be useful to
run, I'm all ears. Unless we hear from you, the next test will
probably be to try patches 1-17, to cut the list of patches that might
have fixed it in half.

Ciao,
Jonathan

[1] http://bugs.debian.org/664767

2012-07-29 22:49:06

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Jonathan Nieder wrote:
> Arend van Spriel wrote:

>> I guess the behaviour improved due to the last two patches listed above,
>> but I also suspect that it is only an improvement and the root cause
>> still exists.
>
> I think you're right. Camaleón tried with all patches except the last
> one and still got the unwanted reconnects.
>
> Would 6b8da423315b and 94a2ca311cf4 be good candidates for stable@ in
> the meantime (for symptom relief and saner regulatory code) until the
> root cause is found?

To answer my own question: I guess not, at least not on the basis of
Camaleón's experience.

Results so far:

* mainline >= 3.4.y: seemed to be ok for a few days, never showed the
problem.
* 3.2.19 + the 23 patches discussed in this thread: worked fine for about
a week, never showed the problem.
* 3.2.18: failed quickly.
* 3.2.19 + first patch: failed quickly.
* 3.2.2: worked ok for a while, then acted up again. Affected.
* 3.2.21 + 10 patches: failed quickly.
* 3.2.21 + all 23 patches: worked ok-ish (?) for a week ("only a
couple reconnects in 7 days, but when that happened gnome-shell was
not segfaulting at least..."). I should have paid more attention:
that meant it was affected!
* 3.2.21 + 17 patches: failed quickly.
* 3.2.21 + 19 patches: failed quickly.
* 3.2.22 + 22 patches: failed quickly.
* 3.2.21 + patch #23 alone: failed quickly.
* 3.2.21 + all 23 patches: failed quickly.

Arend, you're very good at this guessing game. ;-)

I imagine the problem is still present in mainline. Camaleón, the
next useful test would probably be mainline or wireless-testing, and
we should probably stop being so lazy and try to figure out _what_ is
happening when it reconnects and whether it is even a kernel bug. It
might be something normal (like rekeying) not being handled well by
userspace.

Whatever it is, it seems that the kernel can help, since the
proprietary Broadcom wl driver works better if I have understood
Camaleón correctly.

Thanks,
Jonathan

2012-07-16 21:31:30

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

Hi again,

Quick update.

Arend van Spriel wrote:
> On 06/11/2012 05:15 AM, Jonathan Nieder wrote:

>> Newer kernels seemed to work better than old, so I asked her to apply
>> the following patches agains the 3.2.y tree (the exact patches used
>> are at [2]):
>>
>> 6b1a89afbf97 brcm80211: smac: drop "40MHz intolerant" flag from HT
>> capability info
>> c261bdf8acad brcm80211: smac: indicate severe problems to Mac80211
>> 0bf1f883fd0a brcm80211: smac: removed MPC related code
>> 4412953061de brcm80211: smac: removed MPC related variables
>> 28237002e726 brcm80211: smac: removed down-on-watchdog MPC functionality
>> 43ac09722f8e brcm80211: smac: removed down-on-rf-kill functionality
>> a8bc4917ed6b brcm80211: smac: bugfix for tx mute in brcms_b_init()
>> c6c44893c864 brcm80211: smac: fixed inconsistency in transmit mute
>> 2646c46d5679 brcm80211: smac: modified Mac80211 callback interface
>> dc460127898c brcm80211: smac: mute transmit on ops_start
>> 1525662ac280 brcm80211: smac: changed check to confirm STA only support
>> b7eec4233c34 brcm80211: smac: replace own access category definitions with
>> mac80211 enum
>> e9ca530a7b18 brcm80211: smac: don't modify sta parameters when adding sta
>> 8906c43cb160 brcm80211: smac: fix channel frequency
>> 02a588a2e3b9 brcm80211: smac: combine promiscuous mode functionality
>> be667669ec01 brcm80211: smac: added support for mac80211 filter flags
>> [d3f311349add brcm80211: fix usage of set tx power] --- unrelated, my mistake
>> aa1f2f0a3218 brcm80211: smac: precendence bug in wlc_phy_attach()
>> 1570e53c14ff brcm80211: smac: fix unintended fallthru in
>> wlc_phy_radio_init_2057()

With all the above patches applied on top of 3.2.21, Camaleón quickly
gets the reconnects and gnome-shell segfaults if she does not supply
the password to network-manager in time[1]. That means the patch that
prevents trouble is presumably one of the four listed below.

>> 137dabed34a1 brcm80211: smac: remove smatch warnings from brcmsmac code
>> 2b0a53d51b5f brcm80211: smac: only print block-ack timeout message at
>> trace level
>> 6b8da423315b brcm80211: smac: do not use US as fallback regulatory hint
>> 94a2ca311cf4 brcm80211: smac: only provide valid regulatory hint

Thanks,
Jonathan

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=295;bug=664767

2012-07-17 07:54:38

by Arend van Spriel

[permalink] [raw]
Subject: Re: [3.2.y] Re: Brcmsmac driver woes, possible regression?

On 07/16/2012 11:31 PM, Jonathan Nieder wrote:
>
> With all the above patches applied on top of 3.2.21, Camaleón quickly
> gets the reconnects and gnome-shell segfaults if she does not supply
> the password to network-manager in time[1]. That means the patch that
> prevents trouble is presumably one of the four listed below.
>
>>> 137dabed34a1 brcm80211: smac: remove smatch warnings from brcmsmac code
>>> 2b0a53d51b5f brcm80211: smac: only print block-ack timeout message at
>>> trace level
>>> 6b8da423315b brcm80211: smac: do not use US as fallback regulatory hint
>>> 94a2ca311cf4 brcm80211: smac: only provide valid regulatory hint
>
> Thanks,
> Jonathan
>
> [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=295;bug=664767
>

Thanks, Jonathan

I guess the behaviour improved due to the last two patches listed above,
but I also suspect that it is only an improvement and the root cause
still exists.

Gr. AvS