2013-02-11 18:18:27

by Ben Greear

[permalink] [raw]
Subject: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

We have a test where we have 30 or so stations each passing slow
speed UDP traffic. Between AP and station is an attenuator.

We automatically take the attenuation from 0 to 95 (actual rx signal
goes to around -80 due to RF leakage).

When the test is complete, attenuation is set back to 20, which is near
perfect for throughput. However, none of the station interfaces can transmit
anything. It seems the ath9k is hung with packets waiting to
be transmitted but no progress.

Kernel is moderately hacked 3.7.6+.

Bouncing the driver (resetting all interfaces on it, which causes
driver reset of one type or another), 'fixes' the problem.


[root@ct521-5359-F17x64 lanforge]# cat /debug/ieee80211/wiphy0/ath9k/xmit
Num-Tx-Queues: 10 tx-queues-setup: 0x10f poll-work-seen: 351348
BE BK VI VO

MPDUs Queued: 169897 0 0 322175
MPDUs Completed: 168456 0 0 317321
MPDUs XRetried: 1441 0 0 4854
Aggregates: 309356 0 0 0
AMPDUs Queued HW: 419263 0 0 0
AMPDUs Queued SW: 2438246 0 0 0
AMPDUs Completed: 2850615 0 0 0
AMPDUs Retried: 151074 0 0 0
AMPDUs XRetried: 6060 0 0 0
TXERR Filtered: 0 0 0 0
FIFO Underrun: 5 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 4 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 5 0 0 0
TX-Pkts-All: 3026572 0 0 322175
TX-Bytes-All: 2199375974 0 0 14207045
hw-put-tx-buf: 1268553 0 0 306333
hw-tx-start: 0 0 0 0
hw-tx-proc-desc: 0 0 0 0
TX-Failed: 0 0 0 0
txq-memory-address:ffff88021583b018ffff88021583b128ffff88021583af08ffff88021583adf8
axq-qnum: 2 3 1 0
axq-depth: 0 0 0 0
axq-ampdu_depth: 0 0 0 0
axq-stopped 1 0 0 0
tx-in-progress 0 0 0 0
pending-frames 125 0 0 0
txq_headidx: 1 0 0 5
txq_tailidx: 1 0 0 5
axq_q empty: 1 1 1 1
axq_acq empty: 1 1 1 1
txq_fifo[0] empty: 1 1 1 1
txq_fifo[1] empty: 1 1 1 1
txq_fifo[2] empty: 1 1 1 1
txq_fifo[3] empty: 1 1 1 1
txq_fifo[4] empty: 1 1 1 1
txq_fifo[5] empty: 1 1 1 1
txq_fifo[6] empty: 1 1 1 1
txq_fifo[7] empty: 1 1 1 1
[root@ct521-5359-F17x64 lanforge]#


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com



2013-02-21 01:39:51

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Ben Greear wrote:
> For instance, in this case, why do we have pending frames, the axq-stopped,
> and no axq depth? Is that an invalid state to begin with? Once
> it gets in the hung state, those numbers never change. I'd assume
> something should be poking more packets out of the pending frames
> down into the axq logic?

Something is broken in the xmit path, definitely.

Sujith

2013-02-21 23:11:21

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/21/2013 09:34 AM, Ben Greear wrote:
> On 02/20/2013 09:41 PM, Ben Greear wrote:
>> On 02/20/2013 05:38 PM, Sujith Manoharan wrote:
>>> Ben Greear wrote:
>>>> For instance, in this case, why do we have pending frames, the axq-stopped,
>>>> and no axq depth? Is that an invalid state to begin with? Once
>>>> it gets in the hung state, those numbers never change. I'd assume
>>>> something should be poking more packets out of the pending frames
>>>> down into the axq logic?
>>>
>>> Something is broken in the xmit path, definitely.
>>
>> Ok, so here's a question: In the ath_tx_complete method,
>> the pending_frames counter is only decremented if txq == c->tx.txq_map[q].
>>
>> Maybe it should always be decremented?
>>
>> What kinds of things could cause txq to not equal the txq-map[q]?
>
> I put in debugging code to check for that...and I can still reproduce
> the hang without ever failing the txq == ct->tx.txq_map[q]
> test..so problem is elsewhere it seems...

Ok, I think I see the problem, or at least some of it.

When the attenuation goes very high (signal of -80 or lower),
all transmit basically stops, at least for a bit (possibly while
rate-control algorithms adjust).

During this time, the ath_tx_complete_poll_work logic can hit,
causing a reset of the NIC.

I am seeing at the end of ath_draintxq that axq->pending_frames
reports 53 (in one example).

Shouldn't pending_frames be zero after finishing the ath_draintxq?


I added some logic to force pending_frames to be zero at the end
of that method (and also added some extra logic to reset when I
detect pending_frames type hang, and now my system appears to
recover when attenuation goes back to normal levels....

I'll post my hackings as RFC shortly.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-12 00:12:43

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Ben Greear wrote:
> axq-qnum: 2 3 1 0
> axq-depth: 0 0 0 0
> axq-ampdu_depth: 0 0 0 0
> axq-stopped 1 0 0 0
> tx-in-progress 0 0 0 0
> pending-frames 125 0 0 0

The frames appear to be stuck in the pending queue.

Sujith

2013-02-19 02:15:42

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Ben Greear wrote:
> On 02/11/2013 04:11 PM, Sujith Manoharan wrote:
> > Ben Greear wrote:
> >> axq-qnum: 2 3 1 0
> >> axq-depth: 0 0 0 0
> >> axq-ampdu_depth: 0 0 0 0
> >> axq-stopped 1 0 0 0
> >> tx-in-progress 0 0 0 0
> >> pending-frames 125 0 0 0
> >
> > The frames appear to be stuck in the pending queue.
>
> Any ideas on what code to go poking at?

xmit.c would be a good place. Can you also post the kernel log with some
debugging information ? debug=0x8f49 would be good.

Sujith

2013-02-19 05:18:15

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/18/2013 06:36 PM, Sujith Manoharan wrote:
> Sujith Manoharan wrote:
>> xmit.c would be a good place. Can you also post the kernel log with some
>> debugging information ? debug=0x8f49 would be good.
>
> We came across an issue where slowly reducing attenuation would eventually
> cause transmission to fail. This was seen only with the ath9k rate control module
> and was fixed by the commit: "ath9k: Fix TX filter usage". With minstrel_ht,
> things were smooth. Can you also try changing the RC and see if the issue is still seen ?

That tx-filter patch is already in my kernel. I believe this test was using ath9k
RC, so will try changing to minstrel_ht and see if that fixes anything.

My first concern is fixing the kernel crash related to removing
interfaces, so might be a day or two before I can get back to this
particular issue...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-21 02:02:14

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/20/2013 05:36 PM, Sujith Manoharan wrote:
> Ben Greear wrote:
>> From the AP lockup case, dmesg is full of stuff like this:
>>
>> [ 1819.538577] vap1: delba from 00:3f:d1:e4:32:51 (initiator) tid 0 reason code 37
>> [ 1819.640602] vap1: Rx A-MPDU request on 00:6f:f9:68:b1:31 tid 0 result 0
>> [ 1819.690601] vap1: Rx A-MPDU request on 00:01:ab:ba:da:d3 tid 0 result 0
>> [ 1819.778013] vap1: RX session timer expired on 00:59:82:d1:d8:8c tid 0
>> [ 1819.778040] vap1: Rx BA session stop requested for 00:59:82:d1:d8:8c tid 0 recipient reason: 39
>> [ 1819.814600] vap1: delba from 00:59:82:d1:d8:8c (initiator) tid 0 reason code 37
>> [ 1819.820010] vap1: RX session timer expired on 00:eb:69:fc:4b:db tid 0
>> [ 1819.820034] vap1: Rx BA session stop requested for 00:eb:69:fc:4b:db tid 0 recipient reason: 39
>> [ 1819.863631] vap1: delba from 00:eb:69:fc:4b:db (initiator) tid 0 reason code 37
>> [ 1819.978012] vap1: RX session timer expired on 00:39:9a:d1:0e:10 tid 0
>> [ 1819.978035] vap1: Rx BA session stop requested for 00:39:9a:d1:0e:10 tid 0 recipient reason: 39
>> [ 1820.014580] vap1: delba from 00:39:9a:d1:0e:10 (initiator) tid 0 reason code 37
>
> The ADDBA sessions are being torn down aggressively.
>
>> TXERR Filtered: 224 0 0 0
>
> The recipient has gone out of range and frames are being filtered here.
> As a quick test, can you disable PowerSave on the client and check if the issue
> still happens ?

Powersave is always disabled in our tests...can't use it with multiple VIFS
as far as I know...

System doesn't recover if we drop the attenuation back to optimal levels, btw.

Ben

>
> Sujith
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-21 01:06:16

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/20/2013 02:51 PM, Ben Greear wrote:
> On 02/18/2013 06:14 PM, Sujith Manoharan wrote:
>> Ben Greear wrote:
>>> On 02/11/2013 04:11 PM, Sujith Manoharan wrote:
>>>> Ben Greear wrote:
>>>>> axq-qnum: 2 3 1 0
>>>>> axq-depth: 0 0 0 0
>>>>> axq-ampdu_depth: 0 0 0 0
>>>>> axq-stopped 1 0 0 0
>>>>> tx-in-progress 0 0 0 0
>>>>> pending-frames 125 0 0 0
>>>>
>>>> The frames appear to be stuck in the pending queue.
>>>
>>> Any ideas on what code to go poking at?
>>
>> xmit.c would be a good place. Can you also post the kernel log with some
>> debugging information ? debug=0x8f49 would be good.

I was 2/2 hitting the problem without debug set, but
had two clean runs with debugging enabled. Either I'm
unlucky, or enabling debugging messes with things enough
to make the problem go away :P

Maybe looking at the xmit number can give some info?

> axq-qnum: 2 3 1 0
> axq-depth: 0 0 0 0
> axq-ampdu_depth: 0 0 0 0
> axq-stopped 1 0 0 0
> tx-in-progress 0 0 0 0
> pending-frames 123 0 0 0
> txq_headidx: 7 0 0 6
> txq_tailidx: 7 0 0 6

For instance, in this case, why do we have pending frames, the axq-stopped,
and no axq depth? Is that an invalid state to begin with? Once
it gets in the hung state, those numbers never change. I'd assume
something should be poking more packets out of the pending frames
down into the axq logic?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-21 05:42:00

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/20/2013 05:38 PM, Sujith Manoharan wrote:
> Ben Greear wrote:
>> For instance, in this case, why do we have pending frames, the axq-stopped,
>> and no axq depth? Is that an invalid state to begin with? Once
>> it gets in the hung state, those numbers never change. I'd assume
>> something should be poking more packets out of the pending frames
>> down into the axq logic?
>
> Something is broken in the xmit path, definitely.

Ok, so here's a question: In the ath_tx_complete method,
the pending_frames counter is only decremented if txq == c->tx.txq_map[q].

Maybe it should always be decremented?

What kinds of things could cause txq to not equal the txq-map[q]?

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-22 04:24:44

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Ben Greear wrote:
> Ok, I think I see the problem, or at least some of it.
>
> When the attenuation goes very high (signal of -80 or lower),
> all transmit basically stops, at least for a bit (possibly while
> rate-control algorithms adjust).
>
> During this time, the ath_tx_complete_poll_work logic can hit,
> causing a reset of the NIC.

Doing a chip reset because a station goes out of range doesn't seem correct.

Sujith

2013-02-21 23:49:36

by Adrian Chadd

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

.. so the current code correctly soft fails all of those frames and
puts them back onto the relevant software queues, right?


Adrian

2013-02-20 22:51:17

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/18/2013 06:14 PM, Sujith Manoharan wrote:
> Ben Greear wrote:
>> On 02/11/2013 04:11 PM, Sujith Manoharan wrote:
>>> Ben Greear wrote:
>>>> axq-qnum: 2 3 1 0
>>>> axq-depth: 0 0 0 0
>>>> axq-ampdu_depth: 0 0 0 0
>>>> axq-stopped 1 0 0 0
>>>> tx-in-progress 0 0 0 0
>>>> pending-frames 125 0 0 0
>>>
>>> The frames appear to be stuck in the pending queue.
>>
>> Any ideas on what code to go poking at?
>
> xmit.c would be a good place. Can you also post the kernel log with some
> debugging information ? debug=0x8f49 would be good.

I started working on this again. Here's a better description of the test
case:

I have 32 stations connected to an AP, both ath9k, 3.7.9+ kernel.

Each station is configured to pass 64kbps traffic bi-directional
UDP traffic, 1500 byte packets. The traffic is generated externally
(the AP is bridged to an ethernet device, the stations are bridged
in software using AF_PACKET sockets.

At attenuator sits between the two systems.

I ramp up the attenuation in 10db increments from 0 to 95.

At some point, the AP wedges it's xmit queue. It appears it still
receives traffic fine (I see lots of packets coming in the AP
interface and going out the ethernet port).

On a second run, it was the station machine that locked up.
One thing that may be interesting: queue V0 appears to be
passing traffic fine (at least as far as the ath9k xmit debugfs
file shows), but queue BE is wedged.

From the AP lockup case, dmesg is full of stuff like this:

[ 1819.538577] vap1: delba from 00:3f:d1:e4:32:51 (initiator) tid 0 reason code 37
[ 1819.640602] vap1: Rx A-MPDU request on 00:6f:f9:68:b1:31 tid 0 result 0
[ 1819.690601] vap1: Rx A-MPDU request on 00:01:ab:ba:da:d3 tid 0 result 0
[ 1819.778013] vap1: RX session timer expired on 00:59:82:d1:d8:8c tid 0
[ 1819.778040] vap1: Rx BA session stop requested for 00:59:82:d1:d8:8c tid 0 recipient reason: 39
[ 1819.814600] vap1: delba from 00:59:82:d1:d8:8c (initiator) tid 0 reason code 37
[ 1819.820010] vap1: RX session timer expired on 00:eb:69:fc:4b:db tid 0
[ 1819.820034] vap1: Rx BA session stop requested for 00:eb:69:fc:4b:db tid 0 recipient reason: 39
[ 1819.863631] vap1: delba from 00:eb:69:fc:4b:db (initiator) tid 0 reason code 37
[ 1819.978012] vap1: RX session timer expired on 00:39:9a:d1:0e:10 tid 0
[ 1819.978035] vap1: Rx BA session stop requested for 00:39:9a:d1:0e:10 tid 0 recipient reason: 39
[ 1820.014580] vap1: delba from 00:39:9a:d1:0e:10 (initiator) tid 0 reason code 37
[ 1820.184596] vap1: Rx A-MPDU request on 00:3f:d1:e4:32:51 tid 0 result 0
[ 1820.192554] vap1: Rx A-MPDU request on 00:00:c5:89:c5:93 tid 0 result 0
[ 1820.578013] vap1: RX session timer expired on 00:dc:23:64:e2:f5 tid 0
[ 1820.578041] vap1: Rx BA session stop requested for 00:dc:23:64:e2:f5 tid 0 recipient reason: 39
[ 1820.610580] vap1: delba from 00:dc:23:64:e2:f5 (initiator) tid 0 reason code 37
[ 1820.635016] vap1: RX session timer expired on 00:fb:a4:41:62:1a tid 0
[ 1820.635045] vap1: Rx BA session stop requested for 00:fb:a4:41:62:1a tid 0 recipient reason: 39
[ 1820.664631] vap1: Rx BA session stop requested for 00:22:b6:3c:26:13 tid 0 inititator reason: 0
[ 1820.763599] vap1: Rx A-MPDU request on 00:22:b6:3c:26:13 tid 0 result 0
[ 1820.857649] vap1: Rx A-MPDU request on 00:e3:11:12:3a:14 tid 0 result 0


# From the AP lockup case:

[root@lf0301-demo1 ~]# cat /debug/ieee80211/wiphy0/ath9k/xmit ;sleep 10; cat /debug/ieee80211/wiphy0/ath9k/xmit ;
Num-Tx-Queues: 10 tx-queues-setup: 0x10f poll-work-seen: 1823
BE BK VI VO

MPDUs Queued: 500 0 0 4102
MPDUs Completed: 473 0 0 4102
MPDUs XRetried: 27 0 0 0
Aggregates: 200 0 0 0
AMPDUs Queued HW: 97039 0 0 0
AMPDUs Queued SW: 11299 0 0 0
AMPDUs Completed: 107982 0 0 0
AMPDUs Retried: 410 0 0 0
AMPDUs XRetried: 116 0 0 0
TXERR Filtered: 224 0 0 0
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 108598 0 0 4102
TX-Bytes-All: 164436206 0 0 304838
hw-put-tx-buf: 107863 0 0 4096
hw-tx-start: 0 0 0 0
hw-tx-proc-desc: 0 0 0 0
TX-Failed: 0 0 0 0
txq-memory-address: f2b724e8 f2b725a4 f2b7242c f2b72370
axq-qnum: 2 3 1 0
axq-depth: 0 0 0 0
axq-ampdu_depth: 0 0 0 0
axq-stopped 1 0 0 0
tx-in-progress 0 0 0 0
pending-frames 123 0 0 0
txq_headidx: 7 0 0 0
txq_tailidx: 7 0 0 0
axq_q empty: 1 1 1 1
axq_acq empty: 1 1 1 1
txq_fifo[0] empty: 1 1 1 1
txq_fifo[1] empty: 1 1 1 1
txq_fifo[2] empty: 1 1 1 1
txq_fifo[3] empty: 1 1 1 1
txq_fifo[4] empty: 1 1 1 1
txq_fifo[5] empty: 1 1 1 1
txq_fifo[6] empty: 1 1 1 1
txq_fifo[7] empty: 1 1 1 1
Num-Tx-Queues: 10 tx-queues-setup: 0x10f poll-work-seen: 1833
BE BK VI VO

MPDUs Queued: 500 0 0 4140
MPDUs Completed: 473 0 0 4140
MPDUs XRetried: 27 0 0 0
Aggregates: 200 0 0 0
AMPDUs Queued HW: 97039 0 0 0
AMPDUs Queued SW: 11299 0 0 0
AMPDUs Completed: 107982 0 0 0
AMPDUs Retried: 410 0 0 0
AMPDUs XRetried: 116 0 0 0
TXERR Filtered: 224 0 0 0
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 108598 0 0 4140
TX-Bytes-All: 164436206 0 0 306788
hw-put-tx-buf: 107863 0 0 4134
hw-tx-start: 0 0 0 0
hw-tx-proc-desc: 0 0 0 0
TX-Failed: 0 0 0 0
txq-memory-address: f2b724e8 f2b725a4 f2b7242c f2b72370
axq-qnum: 2 3 1 0
axq-depth: 0 0 0 0
axq-ampdu_depth: 0 0 0 0
axq-stopped 1 0 0 0
tx-in-progress 0 0 0 0
pending-frames 123 0 0 0
txq_headidx: 7 0 0 6
txq_tailidx: 7 0 0 6
axq_q empty: 1 1 1 1
axq_acq empty: 1 1 1 1
txq_fifo[0] empty: 1 1 1 1
txq_fifo[1] empty: 1 1 1 1
txq_fifo[2] empty: 1 1 1 1
txq_fifo[3] empty: 1 1 1 1
txq_fifo[4] empty: 1 1 1 1
txq_fifo[5] empty: 1 1 1 1
txq_fifo[6] empty: 1 1 1 1
txq_fifo[7] empty: 1 1 1 1

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-13 16:41:49

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/11/2013 04:11 PM, Sujith Manoharan wrote:
> Ben Greear wrote:
>> axq-qnum: 2 3 1 0
>> axq-depth: 0 0 0 0
>> axq-ampdu_depth: 0 0 0 0
>> axq-stopped 1 0 0 0
>> tx-in-progress 0 0 0 0
>> pending-frames 125 0 0 0
>
> The frames appear to be stuck in the pending queue.

Any ideas on what code to go poking at?

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com



2013-02-21 02:06:19

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Ben Greear wrote:
> Powersave is always disabled in our tests...can't use it with multiple VIFS
> as far as I know...

Ah, yes.

> System doesn't recover if we drop the attenuation back to optimal levels, btw.

Ok.

Sujith

2013-02-21 23:53:21

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/21/2013 03:49 PM, Adrian Chadd wrote:
> .. so the current code correctly soft fails all of those frames and
> puts them back onto the relevant software queues, right?

I might have been mistaken. Further testing makes me think the
pending-frames are properly dealt with. I see them show back up
in the tx-complete logic about 30 second after the first chip reset logic hits.

That seems quite long to me, but then again, it appears that without a chip
reset, they never show back up in the tx-complete logic at all (and thus
the hang).

I'm testing now with just the pending-frames lockup check with a plain
chip reset as work-around (and no fiddling with pending-frames counter).

Thanks,
Ben

>
>
> Adrian
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-19 02:37:52

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Sujith Manoharan wrote:
> xmit.c would be a good place. Can you also post the kernel log with some
> debugging information ? debug=0x8f49 would be good.

We came across an issue where slowly reducing attenuation would eventually
cause transmission to fail. This was seen only with the ath9k rate control module
and was fixed by the commit: "ath9k: Fix TX filter usage". With minstrel_ht,
things were smooth. Can you also try changing the RC and see if the issue is still seen ?

Sujith

2013-02-21 01:38:09

by Sujith Manoharan

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Ben Greear wrote:
> From the AP lockup case, dmesg is full of stuff like this:
>
> [ 1819.538577] vap1: delba from 00:3f:d1:e4:32:51 (initiator) tid 0 reason code 37
> [ 1819.640602] vap1: Rx A-MPDU request on 00:6f:f9:68:b1:31 tid 0 result 0
> [ 1819.690601] vap1: Rx A-MPDU request on 00:01:ab:ba:da:d3 tid 0 result 0
> [ 1819.778013] vap1: RX session timer expired on 00:59:82:d1:d8:8c tid 0
> [ 1819.778040] vap1: Rx BA session stop requested for 00:59:82:d1:d8:8c tid 0 recipient reason: 39
> [ 1819.814600] vap1: delba from 00:59:82:d1:d8:8c (initiator) tid 0 reason code 37
> [ 1819.820010] vap1: RX session timer expired on 00:eb:69:fc:4b:db tid 0
> [ 1819.820034] vap1: Rx BA session stop requested for 00:eb:69:fc:4b:db tid 0 recipient reason: 39
> [ 1819.863631] vap1: delba from 00:eb:69:fc:4b:db (initiator) tid 0 reason code 37
> [ 1819.978012] vap1: RX session timer expired on 00:39:9a:d1:0e:10 tid 0
> [ 1819.978035] vap1: Rx BA session stop requested for 00:39:9a:d1:0e:10 tid 0 recipient reason: 39
> [ 1820.014580] vap1: delba from 00:39:9a:d1:0e:10 (initiator) tid 0 reason code 37

The ADDBA sessions are being torn down aggressively.

> TXERR Filtered: 224 0 0 0

The recipient has gone out of range and frames are being filtered here.
As a quick test, can you disable PowerSave on the client and check if the issue
still happens ?

Sujith

2013-02-22 00:11:18

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/21/2013 03:56 PM, Adrian Chadd wrote:
> Right. As you increase attenuation, two things happen:
>
> * You're going to see lots more filtered frames pop up; which may
> involve pause/unpause of the queue;
> * You're going to see BAR frames go out, which will involve
> pausing/unpausing the queue.
>
> So you may have found some very nice way to reproduce pause/unpause
> related problems that are brought on by lots of TX failures leading to
> BARs and filtered frames.
>
> (Having your BAR frames be filtered .. god I hope not.)
>
> So see whether there's some odd scenario where a queue is unpaused and
> things aren't drained from the pending queue?

The hang I see is where the stats sit like this (below) and never change (for the first queue)


To me, the interesting part is that the BE queue is stopped while
axq-depth and axq-ampdu depth are zero. I don't fully understand
this code..but I sort of expected the depths to be non-zero when
packets were being processed by the hardware..maybe that's totally
incorrect...

It seems to hang forever once it gets stuck in this case unless
I reset the chip.

I'm not sure if it's related, but the tx rates do not drop as low
as I believe they should. These systems are using ath9k rate control,
and I plan to change that to minstrel_ht once I can get the current
config to at least recover when attenuation gets better....


[root@lf0301-demo1 ~]# cat /debug/ieee80211/wiphy0/ath9k/xmit ;sleep 10; cat /debug/ieee80211/wiphy0/ath9k/xmit ;
Num-Tx-Queues: 10 tx-queues-setup: 0x10f poll-work-seen: 1823
BE BK VI VO

MPDUs Queued: 500 0 0 4102
MPDUs Completed: 473 0 0 4102
MPDUs XRetried: 27 0 0 0
Aggregates: 200 0 0 0
AMPDUs Queued HW: 97039 0 0 0
AMPDUs Queued SW: 11299 0 0 0
AMPDUs Completed: 107982 0 0 0
AMPDUs Retried: 410 0 0 0
AMPDUs XRetried: 116 0 0 0
TXERR Filtered: 224 0 0 0
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 108598 0 0 4102
TX-Bytes-All: 164436206 0 0 304838
hw-put-tx-buf: 107863 0 0 4096
hw-tx-start: 0 0 0 0
hw-tx-proc-desc: 0 0 0 0
TX-Failed: 0 0 0 0
txq-memory-address: f2b724e8 f2b725a4 f2b7242c f2b72370
axq-qnum: 2 3 1 0
axq-depth: 0 0 0 0
axq-ampdu_depth: 0 0 0 0
axq-stopped 1 0 0 0
tx-in-progress 0 0 0 0
pending-frames 123 0 0 0
txq_headidx: 7 0 0 0
txq_tailidx: 7 0 0 0
axq_q empty: 1 1 1 1
axq_acq empty: 1 1 1 1
txq_fifo[0] empty: 1 1 1 1
txq_fifo[1] empty: 1 1 1 1
txq_fifo[2] empty: 1 1 1 1
txq_fifo[3] empty: 1 1 1 1
txq_fifo[4] empty: 1 1 1 1
txq_fifo[5] empty: 1 1 1 1
txq_fifo[6] empty: 1 1 1 1
txq_fifo[7] empty: 1 1 1 1
Num-Tx-Queues: 10 tx-queues-setup: 0x10f poll-work-seen: 1833
BE BK VI VO

MPDUs Queued: 500 0 0 4140
MPDUs Completed: 473 0 0 4140
MPDUs XRetried: 27 0 0 0
Aggregates: 200 0 0 0
AMPDUs Queued HW: 97039 0 0 0
AMPDUs Queued SW: 11299 0 0 0
AMPDUs Completed: 107982 0 0 0
AMPDUs Retried: 410 0 0 0
AMPDUs XRetried: 116 0 0 0
TXERR Filtered: 224 0 0 0
FIFO Underrun: 0 0 0 0
TXOP Exceeded: 0 0 0 0
TXTIMER Expiry: 0 0 0 0
DESC CFG Error: 0 0 0 0
DATA Underrun: 0 0 0 0
DELIM Underrun: 0 0 0 0
TX-Pkts-All: 108598 0 0 4140
TX-Bytes-All: 164436206 0 0 306788
hw-put-tx-buf: 107863 0 0 4134
hw-tx-start: 0 0 0 0
hw-tx-proc-desc: 0 0 0 0
TX-Failed: 0 0 0 0
txq-memory-address: f2b724e8 f2b725a4 f2b7242c f2b72370
axq-qnum: 2 3 1 0
axq-depth: 0 0 0 0
axq-ampdu_depth: 0 0 0 0
axq-stopped 1 0 0 0
tx-in-progress 0 0 0 0
pending-frames 123 0 0 0
txq_headidx: 7 0 0 6
txq_tailidx: 7 0 0 6
axq_q empty: 1 1 1 1
axq_acq empty: 1 1 1 1
txq_fifo[0] empty: 1 1 1 1
txq_fifo[1] empty: 1 1 1 1
txq_fifo[2] empty: 1 1 1 1
txq_fifo[3] empty: 1 1 1 1
txq_fifo[4] empty: 1 1 1 1
txq_fifo[5] empty: 1 1 1 1
txq_fifo[6] empty: 1 1 1 1
txq_fifo[7] empty: 1 1 1 1

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-21 17:35:02

by Ben Greear

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

On 02/20/2013 09:41 PM, Ben Greear wrote:
> On 02/20/2013 05:38 PM, Sujith Manoharan wrote:
>> Ben Greear wrote:
>>> For instance, in this case, why do we have pending frames, the axq-stopped,
>>> and no axq depth? Is that an invalid state to begin with? Once
>>> it gets in the hung state, those numbers never change. I'd assume
>>> something should be poking more packets out of the pending frames
>>> down into the axq logic?
>>
>> Something is broken in the xmit path, definitely.
>
> Ok, so here's a question: In the ath_tx_complete method,
> the pending_frames counter is only decremented if txq == c->tx.txq_map[q].
>
> Maybe it should always be decremented?
>
> What kinds of things could cause txq to not equal the txq-map[q]?

I put in debugging code to check for that...and I can still reproduce
the hang without ever failing the txq == ct->tx.txq_map[q]
test..so problem is elsewhere it seems...

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-02-21 23:56:07

by Adrian Chadd

[permalink] [raw]
Subject: Re: 3.7.6+: ath9k: tx logic locks up after taking attenuation very high.

Right. As you increase attenuation, two things happen:

* You're going to see lots more filtered frames pop up; which may
involve pause/unpause of the queue;
* You're going to see BAR frames go out, which will involve
pausing/unpausing the queue.

So you may have found some very nice way to reproduce pause/unpause
related problems that are brought on by lots of TX failures leading to
BARs and filtered frames.

(Having your BAR frames be filtered .. god I hope not.)

So see whether there's some odd scenario where a queue is unpaused and
things aren't drained from the pending queue?



Adrian