2011-11-28 04:23:49

by Norbert Preining

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

On Mo, 28 Nov 2011, Emmanuel Grumbach wrote:
> If the problem is solved, then no :-)

I will report back. I have now reverted all the patches and additions
by stashing them away, and applied only the last patch send. THe only
thing out of default I see is:
[ 1074.038324] wlan0: associated
[ 1076.157989] Rx A-MPDU request on tid 0 result 0
[ 1098.215008] Open BA session requested for 00:24:c4:ab:bd:ef tid 0
[ 1098.232249] activated addBA response timer on tid 0
[ 1098.232861] switched off addBA timer for tid 0
[ 1098.232868] Aggregation is on for tid 0
[ 1098.233027] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:24:c4:ab:bd:ef tid = 0
but it seems to work smoothly.

I will report more after a few suspend/resume iteations.

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
HUBY (n.)
A half-erection large enough to be a publicly embarrassing bulge in
the trousers, not large enough to be of any use to anybody.
--- Douglas Adams, The Meaning of Liff


2011-11-29 06:59:24

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

>
> Ok, here we go. First of all, it seems that the problem is solved
> inthe sense that the connection stays alive all the time and does not
> disappear. OTOH, I now get quite a lot of messages, that might point
> either to a broken AP here, or something else:

The code that actually fixes the problem you were seeing deals with a
case were the AP is a bit "late". We request something from the AP,
give up and only then the AP replies to our request. This flow wasn't
handled properly. So yes, your AP is still replying late, and debug
message will get printed (especially if you have enabled the
MAC80211_HT_DEBUG because I asked you to), but at least you will be
able to continue working.
When yesterday I told you that your log looks too good, I meant that I
didn't see that your AP was late and basically I didn't see you
entered the flows that we weren't handling properly before the fix.

>
> After resume I see:
> [ 3995.266230] iwlwifi 0000:06:00.0: L1 Enabled; Disabling L0S
> [ 3995.270239] iwlwifi 0000:06:00.0: Radio type=0x1-0x2-0x0
> [ 4002.943037] wlan0: authenticate with 00:0a:79:eb:56:10 (try 1)
> [ 4002.945672] wlan0: authenticated
> [ 4002.951806] wlan0: associate with 00:0a:79:eb:56:10 (try 1)
> [ 4002.964652] wlan0: RX AssocResp from 00:0a:79:eb:56:10 (capab=0x411 status=0 aid=2)
> [ 4002.964662] wlan0: associated
> [ 4007.088934] Rx A-MPDU request on tid 0 result 0
>
> then it starts with these, and there are *many* of them, literally
> several hundreds
> [ 4019.340036] ieee80211 phy1: release an RX reorder frame due to timeout on earlier frames
>
> Intersperesed I see some other messages that are new to me:
> [ 4019.443129] Open BA session requested for 00:0a:79:eb:56:10 tid 0
> [ 4019.500149] activated addBA response timer on tid 0
> [ 4020.500033] addBA response timer expired on tid 0

Here we are giving up because the AP is late.

> [ 4020.501626] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
> [ 4023.740570] switched off addBA timer for tid 0
> [ 4023.740578] got addBA resp for tid 0 but we already gave up

Here is the AP is finally replying

> [ 4023.740619] Stopping Tx BA session for 00:0a:79:eb:56:10 tid 0
> [ 4023.768544] Open BA session requested for 00:0a:79:eb:56:10 tid 0

Here we are trying again

> [ 4023.784292] activated addBA response timer on tid 0
> [ 4023.786294] switched off addBA timer for tid 0
> [ 4023.786301] Aggregation is on for tid 0
> [ 4023.786480] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:0a:79:eb:56:10 tid = 0

Successfully this time.

>
>
> [ 4107.143615] Open BA session requested for 00:0a:79:eb:56:10 tid 6

Are you using VoIP ?

> [ 4107.180359] activated addBA response timer on tid 6
> [ 4107.182333] switched off addBA timer for tid 6
> [ 4107.182338] Aggregation is on for tid 6
> [ 4107.182567] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:0a:79:eb:56:10 tid = 6
> [ 4124.093846] delba from 00:0a:79:eb:56:10 (initiator) tid 0 reason code 39
> [ 4124.093856] Rx BA session stop requested for 00:0a:79:eb:56:10 tid 0
> [ 4124.094270] Rx A-MPDU request on tid 0 result 0
>
>
> At the end there are mostly two lines, the
> [ 5983.321525] net_ratelimit: 1 callbacks suppressed
> [ 5983.321537] ieee80211 phy1: release an RX reorder frame due to timeout on earlier frames
>
> Where for every 10-20 of the second one there is one of the first one.

Yes, your AP doesn't seem to resend lost frames, or we don't send BAR
properly ?

>
> =============================
> Anyway, the connection might stumble now and then a bit, but it does
> not break down, so I don't complain ;-)
>
> BIG THANKS to everyone!!!
>

Thanks for you testing and patience.

2011-11-28 06:33:25

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

On Mon, Nov 28, 2011 at 06:23, Norbert Preining <[email protected]> wrote:
> On Mo, 28 Nov 2011, Emmanuel Grumbach wrote:
>> If the problem is solved, then no :-)
>
> I will report back. I have now reverted all the patches and additions
> by stashing them away, and applied only the last patch send. THe only
> thing out of default I see is:
> [ 1074.038324] wlan0: associated
> [ 1076.157989] Rx A-MPDU request on tid 0 result 0
> [ 1098.215008] Open BA session requested for 00:24:c4:ab:bd:ef tid 0
> [ 1098.232249] activated addBA response timer on tid 0
> [ 1098.232861] switched off addBA timer for tid 0
> [ 1098.232868] Aggregation is on for tid 0
> [ 1098.233027] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:24:c4:ab:bd:ef tid = 0
> but it seems to work smoothly.
>
> I will report more after a few suspend/resume iteations.
>

Ok
It is a bit odd though since these prints seem to point to a
successful flow, while you were likely to experience a problem in the
handshake between the AP and your machine. I am basically saying that
this log is too good, you didn't get into the corner case that we
didn't handle properly.

2011-11-29 09:19:02

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

On Tue, Nov 29, 2011 at 10:31, Johannes Berg <[email protected]> wrote:
> I noticed that the logs are a bit odd wrt. timing.
>
>> > Intersperesed I see some other messages that are new to me:
>> > [ 4019.443129] Open BA session requested for 00:0a:79:eb:56:10 tid 0
>> > [ 4019.500149] activated addBA response timer on tid 0
>> > [ 4020.500033] addBA response timer expired on tid 0
>
> I guess the delay here is due to the synchronize_net()? That can take a
> while, 57ms seems a lot but I suppose it's possible.
>
>> > [ 4020.501626] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
>> > [ 4023.740570] switched off addBA timer for tid 0
>> > [ 4023.740578] got addBA resp for tid 0 but we already gave up
>>
>> Here is the AP is finally replying
>
> It's kinda hard to believe that the AP took 4 seconds (!) to reply to
> the frame. Where could the frame get stuck? I don't see any other work
> processing happening etc. either. It's also curious that in those 3
> seconds between these messages, we didn't actually get around to
> stopping the session, that only happens just after:

Yeah you are right, didn't look at the timestamps. Not sure you would
see work being processed though.

>
>> > [ 4023.740619] Stopping Tx BA session for 00:0a:79:eb:56:10 tid 0
>
> (here)
>
>> > [ 4023.768544] Open BA session requested for 00:0a:79:eb:56:10 tid 0
>>
>> Here we are trying again
>>
>> > [ 4023.784292] activated addBA response timer on tid 0
>> > [ 4023.786294] switched off addBA timer for tid 0
>
> 20ms response time here, that's much more reasonable.
>
>
> Could something be hogging the workqueues?
>

Frankly, I am seeing issues that seem to point to workqueues too.
Sometimes mac80211 seems just not responsive.
Sometimes I come back to mac80211 for the AGG callback (start or
stop), and it takes ages (5 seconds !) until it actually move to
operationnal / stopped state.

It might that we are holding the mac80211 workqueue in the driver too...
I guess we could try to enable MAC80211 debug flag with timestamps to check.

> johannes
>
>

2011-11-29 08:34:27

by Johannes Berg

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

I noticed that the logs are a bit odd wrt. timing.

> > Intersperesed I see some other messages that are new to me:
> > [ 4019.443129] Open BA session requested for 00:0a:79:eb:56:10 tid 0
> > [ 4019.500149] activated addBA response timer on tid 0
> > [ 4020.500033] addBA response timer expired on tid 0

I guess the delay here is due to the synchronize_net()? That can take a
while, 57ms seems a lot but I suppose it's possible.

> > [ 4020.501626] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
> > [ 4023.740570] switched off addBA timer for tid 0
> > [ 4023.740578] got addBA resp for tid 0 but we already gave up
>
> Here is the AP is finally replying

It's kinda hard to believe that the AP took 4 seconds (!) to reply to
the frame. Where could the frame get stuck? I don't see any other work
processing happening etc. either. It's also curious that in those 3
seconds between these messages, we didn't actually get around to
stopping the session, that only happens just after:

> > [ 4023.740619] Stopping Tx BA session for 00:0a:79:eb:56:10 tid 0

(here)

> > [ 4023.768544] Open BA session requested for 00:0a:79:eb:56:10 tid 0
>
> Here we are trying again
>
> > [ 4023.784292] activated addBA response timer on tid 0
> > [ 4023.786294] switched off addBA timer for tid 0

20ms response time here, that's much more reasonable.


Could something be hogging the workqueues?

johannes


2011-11-29 07:48:27

by Jeff Chua

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

On Tue, Nov 29, 2011 at 2:59 PM, Emmanuel Grumbach <[email protected]> wrote:
>>
>> Ok, here we go. First of all, it seems that the problem is solved
>> inthe sense that the connection stays alive all the time and does not
>> disappear. OTOH, I now get quite a lot of messages, that might point
>> either to a broken AP here, or something else:
>> =============================
>> Anyway, the connection might stumble now and then a bit, but it does
>> not break down, so I don't complain ;-)
>>
>> BIG THANKS to everyone!!!
>>
>
> Thanks for you testing and patience.

Does this have anything to do with another subject "MAC is in deep
sleep!" ... Not trying to interrupt the discussion here, but my iwlagn
is unstable at 3.2.0-rc3. I'm seeing a lot of "MAC is in deep sleep!"

The WEP connection will work for a while, then gets stuck and goes to
deep sleep.


011-11-27T16:24:23.791863+08:00 boston kernel: wlan0: associated
2011-11-27T17:12:46.371856+08:00 boston kernel: iwlwifi 0000:02:00.0:
Queue 0 stuck for 2000 ms.
2011-11-27T17:12:46.371887+08:00 boston kernel: iwlwifi 0000:02:00.0:
Current read_ptr 46 write_ptr 47
2011-11-27T17:12:46.371895+08:00 boston kernel: iwlwifi 0000:02:00.0:
On demand firmware reload
2011-11-27T17:12:48.371858+08:00 boston kernel: iwlwifi 0000:02:00.0:
Error sending REPLY_RXON: time out after 2000ms.
2011-11-27T17:12:48.371870+08:00 boston kernel: iwlwifi 0000:02:00.0:
Current CMD queue read_ptr 133 write_ptr 137
2011-11-27T17:12:48.371874+08:00 boston kernel: iwlwifi 0000:02:00.0:
Error clearing ASSOC_MSK on BSS (-110)
2011-11-27T17:12:48.681210+08:00 boston kernel: iwlwifi 0000:02:00.0:
MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF


Thanks,
Jeff

2011-11-29 07:50:30

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

>>>
>>> Ok, here we go. First of all, it seems that the problem is solved
>>> inthe sense that the connection stays alive all the time and does not
>>> disappear. OTOH, I now get quite a lot of messages, that might point
>>> either to a broken AP here, or something else:
>>> =============================
>>> Anyway, the connection might stumble now and then a bit, but it does
>>> not break down, so I don't complain ;-)
>>>
>>> BIG THANKS to everyone!!!
>>>
>>
>> Thanks for you testing and patience.
>
> Does this have anything to do with another subject "MAC is in deep
> sleep!" ... Not trying to interrupt the discussion here, but my iwlagn
> is unstable at 3.2.0-rc3. I'm seeing a lot of "MAC is in deep sleep!"
>

Probably not.

2011-11-29 09:32:34

by Norbert Preining

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

Hi all,

> It might that we are holding the mac80211 workqueue in the driver too...
> I guess we could try to enable MAC80211 debug flag with timestamps to check.

Actually I got some strange behaviour today at work, but that might be a university temporary problem.

Let me know what I can send you?
I am currently using stock git plus that one patch.

If you want more patches send me them, and which module parameters.

All the best

Norbert


2011-11-28 23:25:43

by Norbert Preining

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

Hi Emmanuel, hi all,

On Mo, 28 Nov 2011, Norbert Preining wrote:
> I will report more after a few suspend/resume iteations.

Ok, here we go. First of all, it seems that the problem is solved
inthe sense that the connection stays alive all the time and does not
disappear. OTOH, I now get quite a lot of messages, that might point
either to a broken AP here, or something else:

After resume I see:
[ 3995.266230] iwlwifi 0000:06:00.0: L1 Enabled; Disabling L0S
[ 3995.270239] iwlwifi 0000:06:00.0: Radio type=0x1-0x2-0x0
[ 4002.943037] wlan0: authenticate with 00:0a:79:eb:56:10 (try 1)
[ 4002.945672] wlan0: authenticated
[ 4002.951806] wlan0: associate with 00:0a:79:eb:56:10 (try 1)
[ 4002.964652] wlan0: RX AssocResp from 00:0a:79:eb:56:10 (capab=0x411 status=0 aid=2)
[ 4002.964662] wlan0: associated
[ 4007.088934] Rx A-MPDU request on tid 0 result 0

then it starts with these, and there are *many* of them, literally
several hundreds
[ 4019.340036] ieee80211 phy1: release an RX reorder frame due to timeout on earlier frames

Intersperesed I see some other messages that are new to me:
[ 4019.443129] Open BA session requested for 00:0a:79:eb:56:10 tid 0
[ 4019.500149] activated addBA response timer on tid 0
[ 4020.500033] addBA response timer expired on tid 0
[ 4020.501626] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4023.740570] switched off addBA timer for tid 0
[ 4023.740578] got addBA resp for tid 0 but we already gave up
[ 4023.740619] Stopping Tx BA session for 00:0a:79:eb:56:10 tid 0
[ 4023.768544] Open BA session requested for 00:0a:79:eb:56:10 tid 0
[ 4023.784292] activated addBA response timer on tid 0
[ 4023.786294] switched off addBA timer for tid 0
[ 4023.786301] Aggregation is on for tid 0
[ 4023.786480] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:0a:79:eb:56:10 tid = 0


[ 4107.143615] Open BA session requested for 00:0a:79:eb:56:10 tid 6
[ 4107.180359] activated addBA response timer on tid 6
[ 4107.182333] switched off addBA timer for tid 6
[ 4107.182338] Aggregation is on for tid 6
[ 4107.182567] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:0a:79:eb:56:10 tid = 6
[ 4124.093846] delba from 00:0a:79:eb:56:10 (initiator) tid 0 reason code 39
[ 4124.093856] Rx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4124.094270] Rx A-MPDU request on tid 0 result 0
[ 4210.670921] wlan0: deauthenticated from 00:0a:79:eb:56:10 (Reason: 6)
[ 4210.670941] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4210.671083] Rx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4210.671258] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 6
[ 4210.685078] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4210.685091] iwlwifi 0000:06:00.0: Stopping AGG while state not ON or starting for 0 on 0 (0)
[ 4210.685105] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 6
[ 4210.685114] iwlwifi 0000:06:00.0: Stopping AGG while state not ON or starting for 0 on 6 (0)
[ 4210.808184] Stopping Tx BA session for 00:0a:79:eb:56:10 tid 0
[ 4210.808192] Could not find station: 00:0a:79:eb:56:10
[ 4210.808200] Stopping Tx BA session for 00:0a:79:eb:56:10 tid 6
[ 4210.808205] Could not find station: 00:0a:79:eb:56:10
[ 4210.808257] cfg80211: Calling CRDA for country: JP
[ 4214.051615] wlan0: authenticate with 00:0a:79:eb:56:10 (try 1)
[ 4214.054149] wlan0: authenticated
[ 4214.057681] wlan0: associate with 00:0a:79:eb:56:10 (try 1)
[ 4214.063430] wlan0: RX ReassocResp from 00:0a:79:eb:56:10 (capab=0x411 status=0 aid=1)
[ 4214.063434] wlan0: associated
[ 4218.780329] Rx A-MPDU request on tid 0 result 0
[ 4219.800527] Open BA session requested for 00:0a:79:eb:56:10 tid 0
[ 4219.816203] activated addBA response timer on tid 0
[ 4219.818148] switched off addBA timer for tid 0
[ 4219.818156] Aggregation is on for tid 0
[ 4219.818365] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:0a:79:eb:56:10 tid = 0


[ 4329.425781] delba from 00:0a:79:eb:56:10 (initiator) tid 0 reason code 39
[ 4329.425791] Rx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4334.433017] Rx A-MPDU request on tid 0 result 0
[ 4438.143959] Open BA session requested for 00:0a:79:eb:56:10 tid 6
[ 4438.176193] activated addBA response timer on tid 6
[ 4438.178140] switched off addBA timer for tid 6
[ 4438.178146] Aggregation is on for tid 6
[ 4438.178343] iwlwifi 0000:06:00.0: Tx aggregation enabled on ra = 00:0a:79:eb:56:10 tid = 6



[ 4966.073860] delba from 00:0a:79:eb:56:10 (initiator) tid 0 reason code 39
[ 4966.073870] Rx BA session stop requested for 00:0a:79:eb:56:10 tid 0
[ 4966.116071] Rx A-MPDU request on tid 0 result 0


At the end there are mostly two lines, the
[ 5983.321525] net_ratelimit: 1 callbacks suppressed
[ 5983.321537] ieee80211 phy1: release an RX reorder frame due to timeout on earlier frames

Where for every 10-20 of the second one there is one of the first one.

=============================
Anyway, the connection might stumble now and then a bit, but it does
not break down, so I don't complain ;-)

BIG THANKS to everyone!!!


Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
Prostetnic Vogon Jeltz smiled very slowly. This was done
not so much for effect as because he was trying to remember
the sequence of muscle movements.
--- Douglas Adams, The Hitchhikers Guide to the Galaxy

2011-12-11 19:56:49

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

>> Could something be hogging the workqueues?
>>
>
>
> So I tried to understand what is going on with the workqueue and ended
> up to see that if we are lucky, we can need the workqueue for the BA
> handshake (could be AddBA / DelBA handling, or driver callback) while
> we are scanning. Which basically means that we will need to wait until
> the scan is over to handle these frames / callbacks. I got these
> measurements while stopping the BA session:
>
> * scanning working for roughly 3 seconds (pardon me not being precise,
> but with this order of magnitude I don't care much about the single
> millisecond..)
> * when scanning is over, the while loop in ieee80211_iface_work
> consumes 73 mgmt for about 34ms.
> ( how come we have so many beacons during those 3 seconds..., or maybe
> all the BCAST probe request ?, my network is quite busy...)
> * then the finally my stop_tx_ba_cb was served which took 10ms (time
> takes by the driver).
> * another series of beacons (10ms).

What about flushing the workqueue before we scan ?
This is not a bullet proof solution of course, we will still encounter
bad races, but at least we would flush what we can before the
workqueue becomes unable for 4 seconds (!).
We can also delay the scan if we are in the middle of {add,del}BA
handshake, which is the only flow I can think about that needs
responsiveness. The other frame exchanges are MLME ones and involve
the wpa_supplicant (unless we are using the late WEXT). Hopefully the
wpa_supplicant won't request to scan in the middle of association or
so. There might be other features (mesh or whatever), that may be
hidden from the wpa_supplicant and require good responsiveness from
the wq too.

2011-12-01 03:17:46

by Norbert Preining

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

On Di, 29 Nov 2011, Norbert Preining wrote:
> > It might that we are holding the mac80211 workqueue in the driver too...
> > I guess we could try to enable MAC80211 debug flag with timestamps to check.
>
> Actually I got some strange behaviour today at work, but that might be a university temporary problem.

Yeah, I have to say that the situation is much better, but not solved.
Today I got a completely stuch driver again. I will reactivate debugging
and see what I get.

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
...[Arthur] leapt to his feet like an author hearing the
phone ring...
--- Who says that the character of Arthur isn't
--- autobiographical?
--- Douglas Adams, The Hitchhikers Guide to the Galaxy

2011-12-11 09:24:08

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

>> > Intersperesed I see some other messages that are new to me:
>> > [ 4019.443129] Open BA session requested for 00:0a:79:eb:56:10 tid 0
>> > [ 4019.500149] activated addBA response timer on tid 0
>> > [ 4020.500033] addBA response timer expired on tid 0
>
> I guess the delay here is due to the synchronize_net()? That can take a
> while, 57ms seems a lot but I suppose it's possible.
>
>> > [ 4020.501626] Tx BA session stop requested for 00:0a:79:eb:56:10 tid 0
>> > [ 4023.740570] switched off addBA timer for tid 0
>> > [ 4023.740578] got addBA resp for tid 0 but we already gave up
>>
>> Here is the AP is finally replying
>
> It's kinda hard to believe that the AP took 4 seconds (!) to reply to
> the frame. Where could the frame get stuck? I don't see any other work
> processing happening etc. either. It's also curious that in those 3
> seconds between these messages, we didn't actually get around to
> stopping the session, that only happens just after:
>

<snip>

>
>
> Could something be hogging the workqueues?
>


So I tried to understand what is going on with the workqueue and ended
up to see that if we are lucky, we can need the workqueue for the BA
handshake (could be AddBA / DelBA handling, or driver callback) while
we are scanning. Which basically means that we will need to wait until
the scan is over to handle these frames / callbacks. I got these
measurements while stopping the BA session:

* scanning working for roughly 3 seconds (pardon me not being precise,
but with this order of magnitude I don't care much about the single
millisecond..)
* when scanning is over, the while loop in ieee80211_iface_work
consumes 73 mgmt for about 34ms.
( how come we have so many beacons during those 3 seconds..., or maybe
all the BCAST probe request ?, my network is quite busy...)
* then the finally my stop_tx_ba_cb was served which took 10ms (time
takes by the driver).
* another series of beacons (10ms).

2011-12-15 14:07:43

by Johannes Berg

[permalink] [raw]
Subject: Re: iwlagn is getting very shaky

So I finally looked at this ...

On Sun, 2011-12-11 at 21:56 +0200, Emmanuel Grumbach wrote:
> >> Could something be hogging the workqueues?

> > So I tried to understand what is going on with the workqueue and ended
> > up to see that if we are lucky, we can need the workqueue for the BA
> > handshake (could be AddBA / DelBA handling, or driver callback) while
> > we are scanning. Which basically means that we will need to wait until
> > the scan is over to handle these frames / callbacks. I got these
> > measurements while stopping the BA session:
> >
> > * scanning working for roughly 3 seconds (pardon me not being precise,
> > but with this order of magnitude I don't care much about the single
> > millisecond..)

Oh. I see, while scanning we won't process the work queue.

> > * when scanning is over, the while loop in ieee80211_iface_work
> > consumes 73 mgmt for about 34ms.
> > ( how come we have so many beacons during those 3 seconds..., or maybe
> > all the BCAST probe request ?, my network is quite busy...)
> > * then the finally my stop_tx_ba_cb was served which took 10ms (time
> > takes by the driver).
> > * another series of beacons (10ms).
>
> What about flushing the workqueue before we scan ?
> This is not a bullet proof solution of course, we will still encounter
> bad races, but at least we would flush what we can before the
> workqueue becomes unable for 4 seconds (!).

Yeah, that seems like a good thing. Actually I had an idea about this
before -- drain & stop the workqueue for any functions in mac80211/cfg.c
so that mac80211 essentially becomes single-threaded.

> We can also delay the scan if we are in the middle of {add,del}BA
> handshake, which is the only flow I can think about that needs
> responsiveness. The other frame exchanges are MLME ones and involve
> the wpa_supplicant (unless we are using the late WEXT). Hopefully the
> wpa_supplicant won't request to scan in the middle of association or
> so. There might be other features (mesh or whatever), that may be
> hidden from the wpa_supplicant and require good responsiveness from
> the wq too.

Hm, yeah, that would be an idea too, but I'm not sure it's easy to do
right now.

johannes