[permalink] [raw]

Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:

>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>> *pde = 00000000
>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>
>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>
> Please use
>
> gdb drivers/net/wireless/ath/ath9k/
> l *(ath_tx_start+0x461)
>
> Luis

I managed to hit that ath_tx_start crash again, and this time there were no obvious
DMA or irq errors immediately preceding it. So, it might be a real bug
after all. I'll add some extra checks to see if tid->ac is NULL.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-05 02:41:45

by Felix Fietkau

[permalink] [raw]

Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 2010-12-03 9:14 AM, Ben Greear wrote:
> On 12/01/2010 03:22 PM, Ben Greear wrote:
>> On 11/29/2010 04:44 PM, Luis R. Rodriguez wrote:
>>> On Mon, Nov 29, 2010 at 04:28:51PM -0800, Ben Greear wrote:
>>
>>>> BUG: unable to handle kernel NULL pointer dereference at 00000040
>>>> IP: [<f933470a>] ath_tx_start+0x461/0x5ef [ath9k]
>>>> *pde = 00000000
>>>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>>> last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:08:01.0/irq
>>>> Modules linked in: aes_i586 aes_generic fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb ath9k mac80211 ath9k_common ath9k_hw mi]
>>>>
>>>> Pid: 38, comm: kworker/u:1 Tainted: G W 2.6.37-rc3-wl+ #53 PDSBM/PDSBM
>>>> EIP: 0060:[<f933470a>] EFLAGS: 00010246 CPU: 1
>>>> EIP is at ath_tx_start+0x461/0x5ef [ath9k]
>>>
>>> Please use
>>>
>>> gdb drivers/net/wireless/ath/ath9k/
>>> l *(ath_tx_start+0x461)
>>>
>>> Luis
>>
>> I managed to hit that ath_tx_start crash again, and this time there were no obvious
>> DMA or irq errors immediately preceding it. So, it might be a real bug
>> after all. I'll add some extra checks to see if tid->ac is NULL.
>
> I've made some small progress on this general issue.
>
> First, I added all sorts of debugging to try to figure out ath_tx_start crash.
> As best as I can tell, 'tid' is not NULL, but also is not a valid pointer,
> and probably something close to 0x0. I've added yet more debugging, but haven't
> hit the problem again.
>
> I also tried stopping DMA in a loop up to 5 times if it failed to stop
> previously in the loop. This did not appear to help at all.
>
> I also managed to make both the ath_tx_start crash and the DMA errors very hard to reproduce
> (I dare not say fixed, yet).
>
> It appears that this small patch (and possibly, the fact that I set debugging to 0x600
> instead of 0x400) makes the problems go away. This makes me wonder if a root cause is
> something to do with repeatedly resetting the hardware too fast, as setting channels rapidly
> would tend to do that, and channels are set on association by supplicant, it appears.
Please try this patch while leaving the unnecessary resets in place.
I found that when ath_drain_all_txq finds tx dma not stopped, it will
issue a reset at a point in time where it is both useless (since it's
right before a reset anyway) and dangerous (since the rx dma engine
isn't even disabled yet), so IMHO the right thing to do is to drop
this extra reset.

--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -1194,18 +1194,8 @@ void ath_drain_all_txq(struct ath_softc
}
}

- if (npend) {
- int r;
-
- ath_print(common, ATH_DBG_FATAL,
- "Failed to stop TX DMA. Resetting hardware!\n");
-
- r = ath9k_hw_reset(ah, sc->sc_ah->curchan, ah->caldata, false);
- if (r)
- ath_print(common, ATH_DBG_FATAL,
- "Unable to reset hardware; reset status %d\n",
- r);
- }
+ if (npend)
+ ath_print(common, ATH_DBG_FATAL, "Failed to stop TX DMA!\n");

for (i = 0; i < ATH9K_NUM_TX_QUEUES; i++) {
if (ATH_TXQ_SETUP(sc, i))

2010-12-06 20:22:33

by Ben Greear

[permalink] [raw]

Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/06/2010 12:11 PM, Bj?rn Smedman wrote:
> On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear<[email protected]> wrote:
>> With 16 properly configured non-encrypted stations, running with
>> wpa-supplicant
>> with netlink driver& sharing scan results, the interfaces quickly
>> associate.
>>
>> However, I do continue to see DMA warnings such as these (I had picked up my
>> portable phone, and it knocked all the interfaces offline ..here
>> they are coming back up after I hung up the phone).
>
> Is there some theory as to why using multiple interfaces cause so many
> problems with DMA?

Seems pretty directly related to channel changes and/or resets, and exacerbated
by other interfaces sending data while another is scanning, for instance.

Other issues we've found in the past have been various races that you wouldn't
normally see with a single VIF.

Thanks,
Ben

>
> /Bj?rn

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2010-12-06 20:11:38

by Björn Smedman

[permalink] [raw]

Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear <[email protected]> wrote:
> With 16 properly configured non-encrypted stations, running with
> wpa-supplicant
> with netlink driver & sharing scan results, ?the interfaces quickly
> associate.
>
> However, I do continue to see DMA warnings such as these (I had picked up my
> portable phone, and it knocked all the interfaces offline ..here
> they are coming back up after I hung up the phone).

Is there some theory as to why using multiple interfaces cause so many
problems with DMA?

/Bj?rn

2010-12-06 21:00:15

by Ben Greear

[permalink] [raw]

Subject: Re: [ath9k-devel] Script to crash ath9k with DMA errors.

On 12/06/2010 12:42 PM, Luis R. Rodriguez wrote:
> On Mon, Dec 06, 2010 at 12:22:26PM -0800, Ben Greear wrote:
>> On 12/06/2010 12:11 PM, Bj?rn Smedman wrote:
>>> On Mon, Dec 6, 2010 at 8:47 PM, Ben Greear<[email protected]> wrote:
>>>> With 16 properly configured non-encrypted stations, running with
>>>> wpa-supplicant
>>>> with netlink driver& sharing scan results, the interfaces quickly
>>>> associate.
>>>>
>>>> However, I do continue to see DMA warnings such as these (I had picked up my
>>>> portable phone, and it knocked all the interfaces offline ..here
>>>> they are coming back up after I hung up the phone).
>>>
>>> Is there some theory as to why using multiple interfaces cause so many
>>> problems with DMA?
>>
>> Seems pretty directly related to channel changes and/or resets, and exacerbated
>> by other interfaces sending data while another is scanning, for instance.
>>
>> Other issues we've found in the past have been various races that you wouldn't
>> normally see with a single VIF.
>
> Right, there might be some other hot path we need to lock around over.
> Not sure what it could be though we should be locking stopping RX
> over resets already though. These should all be atomic, in fact
> starting TX too IIRC, hence the name change of the lock to be
> specific to the PCU together. There may be other PCU changes
> we may need to contend against.

Maybe the hardware/firmware guys could give us some clues as to what
types of things can cause stopping RMA to fail? Maybe that could
point us to what might be racing with the attempts to stop RMA?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com