Return-path: Received: from mail.candelatech.com ([208.74.158.172]:46956 "EHLO ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755691Ab0JNVba (ORCPT ); Thu, 14 Oct 2010 17:31:30 -0400 Message-ID: <4CB776AF.6090504@candelatech.com> Date: Thu, 14 Oct 2010 14:31:27 -0700 From: Ben Greear MIME-Version: 1.0 To: "Luis R. Rodriguez" CC: "linux-wireless@vger.kernel.org" Subject: Re: memory clobber in rx path, maybe related to ath9k. References: <4CAB59B2.5050106@candelatech.com> <4CAE13F6.2010003@candelatech.com> <4CAE1DFB.303@candelatech.com> <1286479642.20974.32.camel@jlt3.sipsolutions.net> <4CB378CD.1080800@candelatech.com> <4CB3D598.7050904@candelatech.com> <4CB4AA89.1070009@candelatech.com> <4CB5E885.7090305@candelatech.com> <4CB5F0D4.4020907@candelatech.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-wireless-owner@vger.kernel.org List-ID: On 10/14/2010 02:25 PM, Luis R. Rodriguez wrote: > On Wed, Oct 13, 2010 at 10:48 AM, Ben Greear wrote: >> On 10/13/2010 10:29 AM, Luis R. Rodriguez wrote: >>> >>> On Wed, Oct 13, 2010 at 10:12 AM, Ben Greear >>> wrote: >>>> >>>> On 10/12/2010 11:40 AM, Luis R. Rodriguez wrote: >>>>> >>>>> On Tue, Oct 12, 2010 at 11:35 AM, Ben Greear >>>>> wrote: >>>>>> >>>>>> On 10/11/2010 11:10 PM, Luis R. Rodriguez wrote: >>>>>>> >>>>>>> On Mon, Oct 11, 2010 at 8:27 PM, Ben Greear >>>>>>> wrote: >>>>>> >>>>>>>> Another thing I was thinking about: Maybe the queue of skbs and dma >>>>>>>> addresses >>>>>>>> in ath9k is getting corrupted by multiple VIFs trying to write at >>>>>>>> once? >>>>>>>> Maybe >>>>>>>> some locking is needed in the xmit path? >>>>>>> >>>>>>> That was my second hunch. My first shot was to use spin_lock_irqsave() >>>>>>> over the the uses of the rxbuf list and that seemed to help but I >>>>>>> still managed to get a poison eventually. My next item to check for is >>>>>>> of the permissibility of creating too much pressure to the point we >>>>>>> end up looping over the rxbuf list and race against mac80211 free'ing >>>>>>> a buffer. Will test that tomorrow if nothing else comes up creeping my >>>>>>> priority queue. >>>>>> >>>>>> This code looks weird to me. One of the paprd branches >>>>>> deletes the skb, the other doesn't appear to. Neither >>>>>> null out bf->bf_mpdu, which would appear to leave a dangling >>>>>> pointer in at least the dev_kfree_skb_any() branch. >>>>>> >>>>>> ath_tx_complete frees it's skb in all cases, so another >>>>>> bf->bf_mpdu dangling pointer issue. >>>>>> >>>>>> Maybe at the least we should null out bf->bf_mpdu when >>>>>> skb is consumed? >>>>> >>>>> You're reading my mind, that was what I was going to test today. Still >>>>> doing e-mail sweep though. >>>> >>>> At least in the xmit path, it seems cards that have EDMA support do >>>> things a bit different. Out of curiosity, on the system(s), you >>>> reproduce >>>> this, are any of yours supporting EDMA? Mine appear to not support EDMA. >>> >>> EDMA is used on>= AR9003 families by Atheros. And no, I am not >>> testing with an EDMA card, I am testing with an AR9002 family card, >>> the AR9280 card. I am going to disregard the TX stuff as the bug is an >>> RX issue :) I was able to more easily reproduce by doing an skb_copy() >>> and free'ing the buffer right afterwards on the ath_send_to_mac80211() >>> thingy, So it does appear that the poison check just happens more >>> often when we do an skb_copy(). One reason this is easy to reproduce >>> with multiple STAs is mac80211 uses skb_copy() to process each >>> received skb for each STA. >>> >>> In my tests so far, protecting the rxbuf list with spin_lock_irqsave() >>> did not help, and the wmb(); didn't either, something else is going on >>> here. It would be nice to hack slab to keep an entire trace of the >>> place the buffer was last free'd at instead of just the caller that >>> freed it. >> >> I instrumented slub a while back and got the backtrace. It >> was always in the same place for my testing. >> >> Here's the slub patch if you are interested in using it yourself: >> https://patchwork.kernel.org/patch/236921/ > > when compiling this patch I get: > > arch/x86/built-in.o: In function `store_stack': > /home/mcgrof/wireless-testing/arch/x86/kernel/dumpstack.c:259: > undefined reference to `store_trace' You are compiling on 32-bit system? I see the method in the patch, but probably only for 32-bit x86... Ben > > Luis -- Ben Greear Candela Technologies Inc http://www.candelatech.com