2012-07-06 04:37:27

by Andrew Chant

[permalink] [raw]
Subject: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

Hello linux-wireless,
while performance testing ath9k -> ath9k performance in 3.4.4, I got
a nasty kernel panic. My performance testing involved filling the air
with 1410-byte UDP packets between the machines, and switching the
frequencies of the two cards to see how frequency affected
performance. I had switched between channels 36, 40, 44, and 48.
Oops was on the transmitting machine, which was acting as the AP.

Very clear screen image of the oops is at
https://picasaweb.google.com/lh/photo/CjBdHLZH0up5PrnmCySJidMTjNZETYmyPJy0liipFm0?feat=directlink

Rough transcription:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<fffffffff8125103a>] skb_dequeue+0x3a/0x58
PGD 0
Oops: 0002 [#1] SMP
CPU 4
Modules linked in: vfat fat usb_storage loop hid_microsoft usbhid
snd_hda_codec_hdmi snd_hda_codec_via i915 cfbimgblt arc4 cfbcopyarea
cfbfillarea ath9k i2c_algo_bit drm_kms_helper ath9k_common ath9k_hw
snd_hda_intel mac80211 ath snd_hda_codec snd_hwdep snd_pcm snd_timer
xhci_hcd cfg80211 drm ehci_hcd usbcore snd psmouse intel_agp atl1c
usb_common video intel_gtt i2c_core evdev crc32c_intel microcode
snd_page_alloc agpgart

Pid: 0, comm: swapper/4 Not tainted 3.4.4 #37 Gigabyte Technology Co.,
Ltd. To be filled by O.E.M./Z77-D3H
RIP: 0010:[<ffffffff8125103a>][<ffffffff8125103a>] skm_dequeue+0x3a/0x58
RSP: blah... look at image if you care
RAX: 0000...00012 ... RCX: 0
blah blah blah

Call Trace:
test_and_clear_sta_flag+0x33/0x33 [mac80211]
ieee80211_add_pending_skbs_fn+0x81/0xf7 [mac80211]
ieee80211_sta_ps_deliver_wakeup+0x170/0x18a[mac80211]
ieee80211_rx_handlers+0x5b3/0x1685 [mac80211]
get_pageblock_migratetype+0xc/0xd
ieee80211_prepare_and_rx_handle+0x634/0x6c6 [mac80211]
ieee80211_rx+0x492/0x5a1 [ath9k]
ath_rx_tasklet+0x135/0x15a1 [ath9k]
ath9k_tasklet+0xce/0x10b [ath9k]
...blah blah blah

Code: 32 a8 07 00 48 8b 5d 00 48 39 eb 74 27 48 85 db 74 24 ff 4d 10
48 8b 0b 48 c7 03 00 00 00 00 48 8b ...


2012-07-17 15:05:58

by Andrew Chant

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

It was good overnight. Is this worth trying to put into 3.4.6 if there is one?

On Mon, Jul 16, 2012 at 8:18 PM, Mohammed Shafi
<[email protected]> wrote:
> Hi Andrew,
>
> On Tue, Jul 17, 2012 at 8:36 AM, Andrew Chant <[email protected]> wrote:
>> Thanks. That patch seems good against 3.4.4 after the first few
>> minutes - I'll leave it to run overnight.
>
> thanks, sure!
>
> --
> thanks,
> shafi

2012-07-12 06:36:15

by Andrew Chant

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

Any QCA people get a chance to take a look? This is completely
reproducible for me on 3.4.4, sometimes within a few minutes but
occasionally requires up to an hour. Do you qca folks have any tests
where you continuously transmit as many UDP packets as you possibly
can to another host?

On Fri, Jul 6, 2012 at 12:46 AM, Andrew Chant <[email protected]> wrote:
> I was able to reproduce this on a boot shortly afterwards without
> changing the frequencies.
> Exact same stack trace w/ exception of slightly different values for
> RBX & R15, and R10 had 0x7f instead of 0x80. I have not been able to
> reproduce since despite trying quite hard :) I have a picture of the
> second oops if that helps.
> PCI ID is 168c:0030 (AR9300 Wireless LAN adaptor (rev 01))
> -Andrew
>
> On Fri, Jul 6, 2012 at 12:15 AM, Johannes Berg
> <[email protected]> wrote:
>> -John
>> +QCA folks
>>
>> On Thu, 2012-07-05 at 21:36 -0700, Andrew Chant wrote:
>>
>>> while performance testing ath9k -> ath9k performance in 3.4.4, I got
>>> a nasty kernel panic. My performance testing involved filling the air
>>> with 1410-byte UDP packets between the machines, and switching the
>>> frequencies of the two cards to see how frequency affected
>>> performance. I had switched between channels 36, 40, 44, and 48.
>>> Oops was on the transmitting machine, which was acting as the AP.
>>>
>>> Very clear screen image of the oops is at
>>> https://picasaweb.google.com/lh/photo/CjBdHLZH0up5PrnmCySJidMTjNZETYmyPJy0liipFm0?feat=directlink
>>
>> I briefly looked at this, but I don't see a bug in mac80211. It seems
>> likely that ath9k hands back a corrupted SKB, or frees one it no longer
>> owns, or such. The skb->next/prev pointers seem corrupted (rcx is NULL)
>> in one of the SKBs on the list, but mac80211 can't do that afaict.
>>
>> johannes
>>

2012-07-17 03:07:39

by Andrew Chant

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

Thanks. That patch seems good against 3.4.4 after the first few
minutes - I'll leave it to run overnight.

On Sun, Jul 15, 2012 at 10:19 PM, Mohammed Shafi
<[email protected]> wrote:
> On Thu, Jul 12, 2012 at 12:05 PM, Andrew Chant <[email protected]> wrote:
>> Any QCA people get a chance to take a look? This is completely
>> reproducible for me on 3.4.4, sometimes within a few minutes but
>> occasionally requires up to an hour. Do you qca folks have any tests
>> where you continuously transmit as many UDP packets as you possibly
>> can to another host?
>
> please check whether the following patch helps.
> http://comments.gmane.org/gmane.linux.kernel.wireless.general/93723
> Could please help whether it happens with wireless-testing tree ?
> http://linuxwireless.org/en/developers/Documentation/git-guide#Cloning_latest_wireless-testing
>
>>
>> On Fri, Jul 6, 2012 at 12:46 AM, Andrew Chant <[email protected]> wrote:
>>> I was able to reproduce this on a boot shortly afterwards without
>>> changing the frequencies.
>>> Exact same stack trace w/ exception of slightly different values for
>>> RBX & R15, and R10 had 0x7f instead of 0x80. I have not been able to
>>> reproduce since despite trying quite hard :) I have a picture of the
>>> second oops if that helps.
>>> PCI ID is 168c:0030 (AR9300 Wireless LAN adaptor (rev 01))
>>> -Andrew
>>>
>>> On Fri, Jul 6, 2012 at 12:15 AM, Johannes Berg
>>> <[email protected]> wrote:
>>>> -John
>>>> +QCA folks
>>>>
>>>> On Thu, 2012-07-05 at 21:36 -0700, Andrew Chant wrote:
>>>>
>>>>> while performance testing ath9k -> ath9k performance in 3.4.4, I got
>>>>> a nasty kernel panic. My performance testing involved filling the air
>>>>> with 1410-byte UDP packets between the machines, and switching the
>>>>> frequencies of the two cards to see how frequency affected
>>>>> performance. I had switched between channels 36, 40, 44, and 48.
>>>>> Oops was on the transmitting machine, which was acting as the AP.
>>>>>
>>>>> Very clear screen image of the oops is at
>>>>> https://picasaweb.google.com/lh/photo/CjBdHLZH0up5PrnmCySJidMTjNZETYmyPJy0liipFm0?feat=directlink
>>>>
>>>> I briefly looked at this, but I don't see a bug in mac80211. It seems
>>>> likely that ath9k hands back a corrupted SKB, or frees one it no longer
>>>> owns, or such. The skb->next/prev pointers seem corrupted (rcx is NULL)
>>>> in one of the SKBs on the list, but mac80211 can't do that afaict.
>>>>
>>>> johannes
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> thanks,
> shafi

2012-07-16 05:19:58

by Mohammed Shafi

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

On Thu, Jul 12, 2012 at 12:05 PM, Andrew Chant <[email protected]> wrote:
> Any QCA people get a chance to take a look? This is completely
> reproducible for me on 3.4.4, sometimes within a few minutes but
> occasionally requires up to an hour. Do you qca folks have any tests
> where you continuously transmit as many UDP packets as you possibly
> can to another host?

please check whether the following patch helps.
http://comments.gmane.org/gmane.linux.kernel.wireless.general/93723
Could please help whether it happens with wireless-testing tree ?
http://linuxwireless.org/en/developers/Documentation/git-guide#Cloning_latest_wireless-testing

>
> On Fri, Jul 6, 2012 at 12:46 AM, Andrew Chant <[email protected]> wrote:
>> I was able to reproduce this on a boot shortly afterwards without
>> changing the frequencies.
>> Exact same stack trace w/ exception of slightly different values for
>> RBX & R15, and R10 had 0x7f instead of 0x80. I have not been able to
>> reproduce since despite trying quite hard :) I have a picture of the
>> second oops if that helps.
>> PCI ID is 168c:0030 (AR9300 Wireless LAN adaptor (rev 01))
>> -Andrew
>>
>> On Fri, Jul 6, 2012 at 12:15 AM, Johannes Berg
>> <[email protected]> wrote:
>>> -John
>>> +QCA folks
>>>
>>> On Thu, 2012-07-05 at 21:36 -0700, Andrew Chant wrote:
>>>
>>>> while performance testing ath9k -> ath9k performance in 3.4.4, I got
>>>> a nasty kernel panic. My performance testing involved filling the air
>>>> with 1410-byte UDP packets between the machines, and switching the
>>>> frequencies of the two cards to see how frequency affected
>>>> performance. I had switched between channels 36, 40, 44, and 48.
>>>> Oops was on the transmitting machine, which was acting as the AP.
>>>>
>>>> Very clear screen image of the oops is at
>>>> https://picasaweb.google.com/lh/photo/CjBdHLZH0up5PrnmCySJidMTjNZETYmyPJy0liipFm0?feat=directlink
>>>
>>> I briefly looked at this, but I don't see a bug in mac80211. It seems
>>> likely that ath9k hands back a corrupted SKB, or frees one it no longer
>>> owns, or such. The skb->next/prev pointers seem corrupted (rcx is NULL)
>>> in one of the SKBs on the list, but mac80211 can't do that afaict.
>>>
>>> johannes
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
thanks,
shafi

2012-07-06 07:47:24

by Andrew Chant

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

I was able to reproduce this on a boot shortly afterwards without
changing the frequencies.
Exact same stack trace w/ exception of slightly different values for
RBX & R15, and R10 had 0x7f instead of 0x80. I have not been able to
reproduce since despite trying quite hard :) I have a picture of the
second oops if that helps.
PCI ID is 168c:0030 (AR9300 Wireless LAN adaptor (rev 01))
-Andrew

On Fri, Jul 6, 2012 at 12:15 AM, Johannes Berg
<[email protected]> wrote:
> -John
> +QCA folks
>
> On Thu, 2012-07-05 at 21:36 -0700, Andrew Chant wrote:
>
>> while performance testing ath9k -> ath9k performance in 3.4.4, I got
>> a nasty kernel panic. My performance testing involved filling the air
>> with 1410-byte UDP packets between the machines, and switching the
>> frequencies of the two cards to see how frequency affected
>> performance. I had switched between channels 36, 40, 44, and 48.
>> Oops was on the transmitting machine, which was acting as the AP.
>>
>> Very clear screen image of the oops is at
>> https://picasaweb.google.com/lh/photo/CjBdHLZH0up5PrnmCySJidMTjNZETYmyPJy0liipFm0?feat=directlink
>
> I briefly looked at this, but I don't see a bug in mac80211. It seems
> likely that ath9k hands back a corrupted SKB, or frees one it no longer
> owns, or such. The skb->next/prev pointers seem corrupted (rcx is NULL)
> in one of the SKBs on the list, but mac80211 can't do that afaict.
>
> johannes
>

2012-07-17 03:18:13

by Mohammed Shafi

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

Hi Andrew,

On Tue, Jul 17, 2012 at 8:36 AM, Andrew Chant <[email protected]> wrote:
> Thanks. That patch seems good against 3.4.4 after the first few
> minutes - I'll leave it to run overnight.

thanks, sure!

--
thanks,
shafi

2012-07-06 07:15:49

by Johannes Berg

[permalink] [raw]
Subject: Re: v3.4.4 ath9k: kernel NULL pointer dereference in skb_dequeue during heavy udp xmit

-John
+QCA folks

On Thu, 2012-07-05 at 21:36 -0700, Andrew Chant wrote:

> while performance testing ath9k -> ath9k performance in 3.4.4, I got
> a nasty kernel panic. My performance testing involved filling the air
> with 1410-byte UDP packets between the machines, and switching the
> frequencies of the two cards to see how frequency affected
> performance. I had switched between channels 36, 40, 44, and 48.
> Oops was on the transmitting machine, which was acting as the AP.
>
> Very clear screen image of the oops is at
> https://picasaweb.google.com/lh/photo/CjBdHLZH0up5PrnmCySJidMTjNZETYmyPJy0liipFm0?feat=directlink

I briefly looked at this, but I don't see a bug in mac80211. It seems
likely that ath9k hands back a corrupted SKB, or frees one it no longer
owns, or such. The skb->next/prev pointers seem corrupted (rcx is NULL)
in one of the SKBs on the list, but mac80211 can't do that afaict.

johannes