2010-11-01 15:17:24

by Björn Smedman

[permalink] [raw]
Subject: ath9k: race conditions in dma

Hi all,

I have an application that creates and destroys a lot of ap vifs and
does a lot of monitor frame injection. The recent ath9k rx locking
fixes have helped with stability in this use-case but there still
seems to be some tx/beacon related race condition(s). These manifests
themselves as follows on an AR913x based router running
compat-wireless-2010-10-19 (with locking fixes etc from openwrt):

1. TX DMA hangs under simultaneous high RX and TX load

This can happen within minutes but only seems to happen if there is
high load on both RX and TX. These hangs take several seconds to fully
recover from and seem more severe than the usual ones we used to see
before the pcu locking fixes. Log output looks like this:

Jan 1 00:08:47 user.debug kernel: ath: Failed to stop TX DMA in 100
msec after killing last frame
Jan 1 00:08:47 user.debug kernel: ath: Failed to stop TX DMA in 100
msec after killing last frame
Jan 1 00:08:47 user.debug kernel: ath: Failed to stop TX DMA in 100
msec after killing last frame
Jan 1 00:08:47 user.debug kernel: ath: Failed to stop TX DMA in 100
msec after killing last frame
Jan 1 00:08:47 user.debug kernel: ath: Failed to stop TX DMA.
Resetting hardware!
Jan 1 00:08:47 user.debug kernel: ath: DMA failed to stop in 10 ms
AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Jan 1 00:08:47 user.debug kernel: ath: ah->misc_mode 0xc
Jan 1 00:08:47 user.debug kernel: ath: Setting CFG 0x10a
Jan 1 00:08:47 user.debug kernel: ath: ah->misc_mode 0xc
Jan 1 00:08:47 user.debug kernel: ath: Setting CFG 0x10a

Also note that in my use-case there is more processing done on
ieee80211_rx() and ieee80211_tx_status() than perhaps normal.

2. TX is completely hung but chip is never reset

Another similar failure mode under the same conditions as above (when
TX and RX load is high) is that the TX pipeline is somehow hung
(nothing coming out on radio) but there is no log output to suggest
that anything is seriously wrong. My guess here is that the tx queue
might be stopped but I have not been able to verify that.

3. Interrupts completely stop coming

The last failure mode happens when the driver is not RX/TX loaded but
instead left running for a longer period of time (about 12 hours is
enough in most cases but 48 hours basically always does the trick).
The system is fine but it seems ath9k is not receiving any interrupts
("cat /sys/kernel/debug/ath9k/phy0/interrupts" produces the same
result over and over again). If full debug is enabled ("echo 0xffff >
/sys/kernel/debug/ath9k/phy0/debug") only shortcal and longcal related
prints appear (driven by a timer). Bringing the interface down and
then up again with ifconfig does not bring it back to life, but
restarting hostapd does.

Help in tracking these down would be much appreciated. I will follow
up below with some thoughts on contributing factors.

/Bj?rn


2010-11-01 15:43:49

by Ben Gamari

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On Mon, 1 Nov 2010 16:17:23 +0100, Björn Smedman <[email protected]> wrote:
> Hi all,
>
> I have an application that creates and destroys a lot of ap vifs and
> does a lot of monitor frame injection. The recent ath9k rx locking
> fixes have helped with stability in this use-case but there still
> seems to be some tx/beacon related race condition(s). These manifests
> themselves as follows on an AR913x based router running
> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>
> 1. TX DMA hangs under simultaneous high RX and TX load
> 2. TX is completely hung but chip is never reset

I have also observed both of these behaviors with just a standard
hostapd single VIF configuration. Quite annoying. It seems to be better
with recent wireless-testing trees.

- Ben

2010-11-03 17:47:45

by Ben Gamari

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On Tue, 2 Nov 2010 17:55:22 +0100, Björn Smedman <[email protected]> wrote:
> Ben, if you can easily trigger these problems on wireless-testing,
> could you test with my patch and see if it helps? I'm especially
> interested to see if it really fixes problem 1.
>
The only time I've been able to reproduce the issue with
wireless-testing is when using my work laptop. I'll bring it home
tonight and see if your patch makes any difference. Thanks,

- Ben


2010-11-01 15:50:50

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
> On Mon, 1 Nov 2010 16:17:23 +0100, Bj?rn Smedman <[email protected]> wrote:
>> Hi all,
>>
>> I have an application that creates and destroys a lot of ap vifs and
>> does a lot of monitor frame injection. The recent ath9k rx locking
>> fixes have helped with stability in this use-case but there still
>> seems to be some tx/beacon related race condition(s). These manifests
>> themselves as follows on an AR913x based router running
>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>
>> 1. TX DMA hangs under simultaneous high RX and TX load
>> 2. TX is completely hung but chip is never reset
>
> I have also observed both of these behaviors with just a standard
> hostapd single VIF configuration. Quite annoying. It seems to be better
> with recent wireless-testing trees.
>
> - Ben

Thanx Ben, it's a relief to know I'm not the only one suffering from this.

Unfortunately I can't run wireless-testing (built-in system with
out-of-tree arch). Could this be fixed in later compat-wireless
snapshots? Can you recommend a specific snapshot?

/Bj?rn

2010-11-01 17:12:36

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

2010/11/1 Felix Fietkau <[email protected]>
> > My TX PCU patches for ath9k are not merged yet, try those or wait
> > until John merges them.
> They are merged in OpenWrt. Bj?rn, which OpenWrt revision did you use in
> your tests?
>
> - Felix

I'm based on openwrt/trunk@23720 when I run code. But when I read code
I'm looking at the latest wireless-testing (and trying to keep track
of pending patches on linux-wireless). I will apply the TX PCU patch
and see if that changes my bad fuzzy feeling.

/Bj?rn

2010-11-03 16:41:06

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

2010/11/2 Bj?rn Smedman <[email protected]>:
> On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
>> On Mon, 1 Nov 2010 16:17:23 +0100, Bj?rn Smedman <[email protected]> wrote:
>>> Hi all,
>>>
>>> I have an application that creates and destroys a lot of ap vifs and
>>> does a lot of monitor frame injection. The recent ath9k rx locking
>>> fixes have helped with stability in this use-case but there still
>>> seems to be some tx/beacon related race condition(s). These manifests
>>> themselves as follows on an AR913x based router running
>>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>>
>>> 1. TX DMA hangs under simultaneous high RX and TX load
>>> 2. TX is completely hung but chip is never reset
>>
>> I have also observed both of these behaviors with just a standard
>> hostapd single VIF configuration. Quite annoying. It seems to be better
>> with recent wireless-testing trees.
>>
>> - Ben
>
> I just posted "[RFC] ath9k: fix tx queue selection" with a patch that
> fixes (or at least reduces) these two for me. I'm not sure it is the
> whole story but at least in theory 1 could be caused by locking one tx
> queue and actually transmitting on another. 2 is probably caused by
> stopping one mac80211 queue and then starting another.

Problem 1 is still there. After 5-15 hours of varying rx/tx frame
injection load something like this happens and the chip goes
deaf/mute:

Jan 1 00:18:33 user.debug kernel: ath: DMA failed to stop in 10
ms AR_CR=0x00000024 AR_DIAG_SW=0x40000020
Jan 1 00:18:33 user.debug kernel: ath: DMA failed to stop in 10
ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Jan 1 00:18:33 user.debug kernel: ath: ah->misc_mode 0xc
Jan 1 00:18:33 user.debug kernel: ath: Setting CFG 0x10a
Jan 1 00:18:43 user.debug kernel: ath: Timeout while waiting
for nf to load: AR_PHY_AGC_CONTROL=0x40d22
Jan 1 00:18:44 user.debug kernel: ath: DMA failed to stop in 10
ms AR_CR=0x00000024 AR_DIAG_SW=0x40000020
Jan 1 00:18:44 user.debug kernel: ath: DMA failed to stop in 10
ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Jan 1 00:18:44 user.debug kernel: ath: ah->misc_mode 0xc
Jan 1 00:18:44 user.debug kernel: ath: Setting CFG 0x10a

Problem 2 seems gone though.

/Bj?rn

2010-11-01 16:28:35

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
> On Mon, 1 Nov 2010 16:17:23 +0100, Bj?rn Smedman <[email protected]> wrote:
>> Hi all,
>>
>> I have an application that creates and destroys a lot of ap vifs and
>> does a lot of monitor frame injection. The recent ath9k rx locking
>> fixes have helped with stability in this use-case but there still
>> seems to be some tx/beacon related race condition(s). These manifests
>> themselves as follows on an AR913x based router running
>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>
>> 1. TX DMA hangs under simultaneous high RX and TX load
>> 2. TX is completely hung but chip is never reset
>
> I have also observed both of these behaviors with just a standard
> hostapd single VIF configuration. Quite annoying. It seems to be better
> with recent wireless-testing trees.
>
> - Ben

Looking at the code here is the first passage that triggers a bad
fuzzy feeling for me (beacon.c):

skb = ieee80211_get_buffered_bc(hw, vif);

/*
* if the CABQ traffic from previous DTIM is pending and the current
* beacon is also a DTIM.
* 1) if there is only one vif let the cab traffic continue.
* 2) if there are more than one vif and we are using staggered
* beacons, then drain the cabq by dropping all the frames in
* the cabq so that the current vifs cab traffic can be scheduled.
*/
spin_lock_bh(&cabq->axq_lock);
cabq_depth = cabq->axq_depth;
spin_unlock_bh(&cabq->axq_lock);

if (skb && cabq_depth) {
if (sc->nvifs > 1) {
ath_print(common, ATH_DBG_BEACON,
"Flushing previous cabq traffic\n");
ath_draintxq(sc, cabq, false);
}
}

ath_beacon_setup(sc, avp, bf, info->control.rates[0].idx);

while (skb) {
ath_tx_cabq(hw, skb);
skb = ieee80211_get_buffered_bc(hw, vif);
}

>From what I can tell there is no guarantee that CABQ TX DMA is stopped
when ath_draintxq() is called. From ath_draintxq() point of view that
looks like a bad idea (race between CPU and DMA).

Also, that looking around "cabq_depth = cabq->axq_depth;" looks very
peculiar. I believe it's correct (because nobody else puts anything
into this queue and we don't care if it's shorter later on when we
drain it) but I think it would be nice with a comment.

Any thoughts? I can whip up and test a patch if there are no objections.

/Bj?rn

2010-11-02 16:55:23

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
> On Mon, 1 Nov 2010 16:17:23 +0100, Bj?rn Smedman <[email protected]> wrote:
>> Hi all,
>>
>> I have an application that creates and destroys a lot of ap vifs and
>> does a lot of monitor frame injection. The recent ath9k rx locking
>> fixes have helped with stability in this use-case but there still
>> seems to be some tx/beacon related race condition(s). These manifests
>> themselves as follows on an AR913x based router running
>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>
>> 1. TX DMA hangs under simultaneous high RX and TX load
>> 2. TX is completely hung but chip is never reset
>
> I have also observed both of these behaviors with just a standard
> hostapd single VIF configuration. Quite annoying. It seems to be better
> with recent wireless-testing trees.
>
> - Ben

I just posted "[RFC] ath9k: fix tx queue selection" with a patch that
fixes (or at least reduces) these two for me. I'm not sure it is the
whole story but at least in theory 1 could be caused by locking one tx
queue and actually transmitting on another. 2 is probably caused by
stopping one mac80211 queue and then starting another.

Ben, if you can easily trigger these problems on wireless-testing,
could you test with my patch and see if it helps? I'm especially
interested to see if it really fixes problem 1.

/Bj?rn

2010-11-01 16:44:37

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

2010/11/1 Björn Smedman <[email protected]>:
> On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
>> On Mon, 1 Nov 2010 16:17:23 +0100, Björn Smedman <[email protected]> wrote:
>>> Hi all,
>>>
>>> I have an application that creates and destroys a lot of ap vifs and
>>> does a lot of monitor frame injection. The recent ath9k rx locking
>>> fixes have helped with stability in this use-case but there still
>>> seems to be some tx/beacon related race condition(s). These manifests
>>> themselves as follows on an AR913x based router running
>>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>>
>>> 1. TX DMA hangs under simultaneous high RX and TX load
>>> 2. TX is completely hung but chip is never reset
>>
>> I have also observed both of these behaviors with just a standard
>> hostapd single VIF configuration. Quite annoying. It seems to be better
>> with recent wireless-testing trees.
>>
>> - Ben
>
> The next thing that looks racy to me is ath_beacon_alloc() vs
> ath_beacon_tasklet() in beacon.c. Beacon queue TX DMA is always
> stopped in main.c before calling ath_beacon_alloc() but
> ath_beacon_tasklet() is scheduled when we get an SWBA interrupt. My
> guess is that these keep coming even if we stop TX DMA on the beacon
> queue, no?

My TX PCU patches for ath9k are not merged yet, try those or wait
until John merges them.

Luis

2010-11-01 23:12:22

by Peter Stuge

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

Björn Smedman wrote:
> >> 1. TX DMA hangs under simultaneous high RX and TX load
> >> 2. TX is completely hung but chip is never reset
> >
> > I have also observed both of these behaviors with just a standard
> > hostapd single VIF configuration.
>
> Thanx Ben, it's a relief to know I'm not the only one suffering
> from this.

Just a note to confirm that I have also seen many different failures
related to this. The lasting impression is that it's a big mess.

I bought my first ath9k hardware roughly a year ago. That was totally
useless as STA up until kernels a few months ago. I am now using an
AR9280 card and for the very first time ath9k hardware and driver is
actually working at all for me, but there are still issues as I noted
in the other email.

Unfortunately they're the kind of issues which can't be debugged much
lacking strong knowledge of device internals.


//Peter

2010-11-01 16:53:06

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On 2010-11-01 5:44 PM, Luis R. Rodriguez wrote:
> 2010/11/1 Björn Smedman <[email protected]>:
>> On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
>>> On Mon, 1 Nov 2010 16:17:23 +0100, Björn Smedman <[email protected]> wrote:
>>>> Hi all,
>>>>
>>>> I have an application that creates and destroys a lot of ap vifs and
>>>> does a lot of monitor frame injection. The recent ath9k rx locking
>>>> fixes have helped with stability in this use-case but there still
>>>> seems to be some tx/beacon related race condition(s). These manifests
>>>> themselves as follows on an AR913x based router running
>>>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>>>
>>>> 1. TX DMA hangs under simultaneous high RX and TX load
>>>> 2. TX is completely hung but chip is never reset
>>>
>>> I have also observed both of these behaviors with just a standard
>>> hostapd single VIF configuration. Quite annoying. It seems to be better
>>> with recent wireless-testing trees.
>>>
>>> - Ben
>>
>> The next thing that looks racy to me is ath_beacon_alloc() vs
>> ath_beacon_tasklet() in beacon.c. Beacon queue TX DMA is always
>> stopped in main.c before calling ath_beacon_alloc() but
>> ath_beacon_tasklet() is scheduled when we get an SWBA interrupt. My
>> guess is that these keep coming even if we stop TX DMA on the beacon
>> queue, no?
>
> My TX PCU patches for ath9k are not merged yet, try those or wait
> until John merges them.
They are merged in OpenWrt. Björn, which OpenWrt revision did you use in
your tests?

- Felix

2010-11-01 16:39:50

by Björn Smedman

[permalink] [raw]
Subject: Re: [ath9k-devel] ath9k: race conditions in dma

On Mon, Nov 1, 2010 at 4:43 PM, Ben Gamari <[email protected]> wrote:
> On Mon, 1 Nov 2010 16:17:23 +0100, Bj?rn Smedman <[email protected]> wrote:
>> Hi all,
>>
>> I have an application that creates and destroys a lot of ap vifs and
>> does a lot of monitor frame injection. The recent ath9k rx locking
>> fixes have helped with stability in this use-case but there still
>> seems to be some tx/beacon related race condition(s). These manifests
>> themselves as follows on an AR913x based router running
>> compat-wireless-2010-10-19 (with locking fixes etc from openwrt):
>>
>> 1. TX DMA hangs under simultaneous high RX and TX load
>> 2. TX is completely hung but chip is never reset
>
> I have also observed both of these behaviors with just a standard
> hostapd single VIF configuration. Quite annoying. It seems to be better
> with recent wireless-testing trees.
>
> - Ben

The next thing that looks racy to me is ath_beacon_alloc() vs
ath_beacon_tasklet() in beacon.c. Beacon queue TX DMA is always
stopped in main.c before calling ath_beacon_alloc() but
ath_beacon_tasklet() is scheduled when we get an SWBA interrupt. My
guess is that these keep coming even if we stop TX DMA on the beacon
queue, no?

/Bj?rn