2009-01-08 13:49:50

by Hugh Dickins

[permalink] [raw]
Subject: ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

Thank you for the ath5k wireless driver,
which I've just started using on an Aspire One.

When running swapping load tests, wireless alive but not in active use,
I've now twice hit the BUG_ON(bf->skb == NULL) in ath5k_tasklet_rx().

First time was with 2.6.28 plus kdb patch,
kernel BUG at drivers/net/wireless/ath5k/base.c:1675!
and poking around the messages I could see before that an
ath5k phy0: can't alloc skbuff of size 2673, after a
swapper: page allocation failure, order:0 mode:0x4020
on a call from dev_alloc_skb(), with memory summary.

Second time was with yesterday's 2.6.28-git, no kdb,
kernel BUG at drivers/net/wireless/ath5k/base.c:1683!
running ath5k_tasklet_rx() from under do_softirq(),
Kernel panic - not syncing: Fatal exception in interrupt
so I couldn't see more; but at the top of the screen, the
last three lines of a page allocation failure memory summary.

So, that BUG_ON(bf->skb == NULL) appears to be unsafe under
memory pressure; but the fix wasn't obvious to me, so over
to you!

I'd be glad to try patches, of course, but it's not happening
often enough for me to report back success quickly - unless I
stumble on a quicker way to reproduce it, it'll need a week or
two to grow confident of a fix.

Thanks,
Hugh


2009-01-08 18:41:04

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Thu, Jan 8, 2009 at 12:55 PM, Hugh Dickins <[email protected]> wrote:
> Let me see if I can reproduce the ath5k BUG with a straightforward
> memhog, repeatedly dirtying more anon memory than RAM can provide.
> Is there something suitable I could run to exercise that wireless
> path concurrently? It was just idling when I hit the BUGs before.

Receiving any packets will do it, in your case it was probably
beacons from nearby APs if you weren't using the wifi. So iperf
or similar will work.

--
Bob Copeland %% http://www.bobcopeland.com

2009-01-13 17:45:50

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Tue, Jan 13, 2009 at 08:40:21AM -0800, Hugh Dickins wrote:
> On Tue, 13 Jan 2009, Bob Copeland wrote:
> > On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
> >
> > I'll go ahead and push this upstream then today or tomorrow;
>
> Great, thanks - should include a Cc: [email protected] I think.
>
> > let me know if you run into any problems with more testing.
>
> Sure, will do.
>
> > > > Changes-licensed-under: 3-Clause-BSD
> > >
> > > Hmm, I haven't noticed anyone doing that before: hope you're not
> > > starting a trend! I think you'll find (Documentation/SubmittingPatches)
> > > that your Signed-off-by agrees to the Developer's Certificate of Origin
> > > 1.1, which would put your patch under the same open source licence(s) as
> > > drivers/net/wireless/ath5k/base.c already contains - that's the usual
> > > case.
> >
> > I agree with all of the above, but the SFLC suggests we do it anyway.

Lets not open a can a worms here, but I just want to clarify a few things.

To be fair, as stated before, SFLC suggested we just ensure people are aware of
the licenses involved and in order to keep files under permissive licenses we
just need a way to ensure the contributors are actually submitting their changes
under the file's specified license.

We did review Documentation/SubmittingPatches but this was deemed not
sufficient as from what I recall you cannot gaurantee all contributors
would have read that and understood clearly what is implied under this
discussion.

The Changes-licensed-under tag was just my own suggestion to the problem
to help the BSD family and that did suffice the legal criteria so we went
with it.

> > http://kerneltrap.org/Linux/Wireless_Project_Suggests_Changes-licensed-under_Tag
>
> Interesting, thanks a lot for the pointer.
> I agree very much with Steve and Krzysztof.
>
> I'd be inclined to think that adding such a line in some patches
> would only tend towards making those which omit such a line more
> questionable for use under the BSD licence i.e. would weaken the
> BSD position rather than strengthening it as intended - though
> the Signed-off-by should override even that tendency.

You just need to ensure your contributors have read and understood
Documentation/SubmittingPatches.

> But it's certainly not an issue for my attention!

For ath9k we've taken a bit different approach, and I should note SFLC was
not involved with this, we tend to ensure wireless contributors and
developers have read the "Developer's Certificate of Origin 1.1" and any
new contributors get an e-mail from us asking them to read it [1]. This
also proves a bit difficult as some seasoned developers take offense to
such e-mails while some new developers speculate to our intentions and
may develop crazy conspiracy theories.

http://lkml.org/lkml/2008/11/25/289

Luis

2009-01-10 20:16:06

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Sat, Jan 10, 2009 at 11:47:05AM -0500, Bob Copeland wrote:
> Well I got a lockup doing that, I'll try again later but anyway I see
> the bug already, read on if interested. I should have a patch shortly.

Err, of course I would get a lockup, it's a BUG_ON. This patch seems to
fix it for me.

From: Bob Copeland <[email protected]>
Date: Sat, 10 Jan 2009 14:42:54 -0500
Subject: [PATCH] ath5k: fix bf->skb==NULL panic in ath5k_tasklet_rx

Under memory pressure, we may not be able to allocate a new skb for
new packets. If the allocation fails, ath5k_tasklet_rx will exit but
will leave a buffer in the list with a NULL skb, eventually triggering
a BUG_ON.

Extract the skb allocation from ath5k_rxbuf_setup() and change the
tasklet to allocate the next skb before accepting a packet.

Changes-licensed-under: 3-Clause-BSD

Signed-off-by: Bob Copeland <[email protected]>
---
drivers/net/wireless/ath5k/base.c | 85 +++++++++++++++++++++++--------------
1 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/drivers/net/wireless/ath5k/base.c b/drivers/net/wireless/ath5k/base.c
index 2b7b7f6..1c0061a 100644
--- a/drivers/net/wireless/ath5k/base.c
+++ b/drivers/net/wireless/ath5k/base.c
@@ -1095,6 +1095,42 @@ ath5k_hw_to_driver_rix(struct ath5k_softc *sc, int hw_rix)
* Buffers setup *
\***************/

+static
+struct sk_buff *ath5k_rx_skb_alloc(struct ath5k_softc *sc, dma_addr_t *skb_addr)
+{
+ struct sk_buff *skb;
+ unsigned int off;
+
+ /*
+ * Allocate buffer with headroom_needed space for the
+ * fake physical layer header at the start.
+ */
+ skb = dev_alloc_skb(sc->rxbufsize + sc->cachelsz - 1);
+
+ if (!skb) {
+ ATH5K_ERR(sc, "can't alloc skbuff of size %u\n",
+ sc->rxbufsize + sc->cachelsz - 1);
+ return NULL;
+ }
+ /*
+ * Cache-line-align. This is important (for the
+ * 5210 at least) as not doing so causes bogus data
+ * in rx'd frames.
+ */
+ off = ((unsigned long)skb->data) % sc->cachelsz;
+ if (off != 0)
+ skb_reserve(skb, sc->cachelsz - off);
+
+ *skb_addr = pci_map_single(sc->pdev,
+ skb->data, sc->rxbufsize, PCI_DMA_FROMDEVICE);
+ if (unlikely(pci_dma_mapping_error(sc->pdev, *skb_addr))) {
+ ATH5K_ERR(sc, "%s: DMA mapping failed\n", __func__);
+ dev_kfree_skb(skb);
+ return NULL;
+ }
+ return skb;
+}
+
static int
ath5k_rxbuf_setup(struct ath5k_softc *sc, struct ath5k_buf *bf)
{
@@ -1102,37 +1138,11 @@ ath5k_rxbuf_setup(struct ath5k_softc *sc, struct ath5k_buf *bf)
struct sk_buff *skb = bf->skb;
struct ath5k_desc *ds;

- if (likely(skb == NULL)) {
- unsigned int off;
-
- /*
- * Allocate buffer with headroom_needed space for the
- * fake physical layer header at the start.
- */
- skb = dev_alloc_skb(sc->rxbufsize + sc->cachelsz - 1);
- if (unlikely(skb == NULL)) {
- ATH5K_ERR(sc, "can't alloc skbuff of size %u\n",
- sc->rxbufsize + sc->cachelsz - 1);
+ if (!skb) {
+ skb = ath5k_rx_skb_alloc(sc, &bf->skbaddr);
+ if (!skb)
return -ENOMEM;
- }
- /*
- * Cache-line-align. This is important (for the
- * 5210 at least) as not doing so causes bogus data
- * in rx'd frames.
- */
- off = ((unsigned long)skb->data) % sc->cachelsz;
- if (off != 0)
- skb_reserve(skb, sc->cachelsz - off);
-
bf->skb = skb;
- bf->skbaddr = pci_map_single(sc->pdev,
- skb->data, sc->rxbufsize, PCI_DMA_FROMDEVICE);
- if (unlikely(pci_dma_mapping_error(sc->pdev, bf->skbaddr))) {
- ATH5K_ERR(sc, "%s: DMA mapping failed\n", __func__);
- dev_kfree_skb(skb);
- bf->skb = NULL;
- return -ENOMEM;
- }
}

/*
@@ -1661,7 +1671,8 @@ ath5k_tasklet_rx(unsigned long data)
{
struct ieee80211_rx_status rxs = {};
struct ath5k_rx_status rs = {};
- struct sk_buff *skb;
+ struct sk_buff *skb, *next_skb;
+ dma_addr_t next_skb_addr;
struct ath5k_softc *sc = (void *)data;
struct ath5k_buf *bf, *bf_last;
struct ath5k_desc *ds;
@@ -1746,10 +1757,17 @@ ath5k_tasklet_rx(unsigned long data)
goto next;
}
accept:
+ next_skb = ath5k_rx_skb_alloc(sc, &next_skb_addr);
+
+ /*
+ * If we can't replace bf->skb with a new skb under memory
+ * pressure, just skip this packet
+ */
+ if (!next_skb)
+ goto next;
+
pci_unmap_single(sc->pdev, bf->skbaddr, sc->rxbufsize,
PCI_DMA_FROMDEVICE);
- bf->skb = NULL;
-
skb_put(skb, rs.rs_datalen);

/* The MAC header is padded to have 32-bit boundary if the
@@ -1822,6 +1840,9 @@ accept:
ath5k_check_ibss_tsf(sc, skb, &rxs);

__ieee80211_rx(sc->hw, skb, &rxs);
+
+ bf->skb = next_skb;
+ bf->skbaddr = next_skb_addr;
next:
list_move_tail(&bf->list, &sc->rxbuf);
} while (ath5k_rxbuf_setup(sc, bf) == 0);
--
1.6.0.6


--
Bob Copeland %% http://www.bobcopeland.com


2009-01-08 16:19:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Thu, 8 Jan 2009, Maxim Levitsky wrote:
> On Thu, 2009-01-08 at 13:49 +0000, Hugh Dickins wrote:
> >
> > When running swapping load tests, wireless alive but not in active use,
> > I've now twice hit the BUG_ON(bf->skb == NULL) in ath5k_tasklet_rx().
> >
> > So, that BUG_ON(bf->skb == NULL) appears to be unsafe under
> > memory pressure; but the fix wasn't obvious to me, so over
> > to you!
> >
> > I'd be glad to try patches, of course, but it's not happening
> > often enough for me to report back success quickly - unless I
> > stumble on a quicker way to reproduce it, it'll need a week or
> > two to grow confident of a fix.
>
> I use aspire one, and I don't see that, did this happen several times?

A couple of times, when running memory swapping loads: I doubt it
would be any problem when there's easily freeable memory around.

>
> Could you try compat-wireless:
>
> http://wireless.kernel.org/en/users/Download

Thanks, I've downloaded the tarball, but the difference between the
drivers/net/wireless/ath5k/base.c in there and in Linus's git is
very little: I'd be be very surprised if it fixes this bug.

As I said, it'll take a week or two to be confident that the issue
has been fixed, so I'd rather embark on that once a likely fix has
gone in.

(Of course, just removing the BUG_ON, and making sure there's
no oops on the NULL pointer, would fix my immediate issue:
but I doubt the right fix will be as simple as that.)

>
> It has many fixes that made my wireless on that notebook more or less
> usable.

I find 2.6.28's wireless very usable on that machine: I bet you're
a wireless specialist who stresses the wireless side of things in
ways I wouldn't begin to imagine: whereas I just browse and download
and scp a little, but as a memory specialist try swapping loads
which perhaps that driver had not been subjected to before.

Hugh

2009-01-08 17:10:12

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Thu, Jan 8, 2009 at 11:18 AM, Hugh Dickins <[email protected]> wrote:
>> > So, that BUG_ON(bf->skb == NULL) appears to be unsafe under
>> > memory pressure; but the fix wasn't obvious to me, so over
>> > to you!

Thanks for the report... I guess the error paths haven't been tested
much when rx buf setup fails.

> (Of course, just removing the BUG_ON, and making sure there's
> no oops on the NULL pointer, would fix my immediate issue:
> but I doubt the right fix will be as simple as that.)

Are your memory load testing scripts available somewhere?

--
Bob Copeland %% http://www.bobcopeland.com

2009-01-13 16:42:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Tue, 13 Jan 2009, Bob Copeland wrote:
> On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
>
> I'll go ahead and push this upstream then today or tomorrow;

Great, thanks - should include a Cc: [email protected] I think.

> let me know if you run into any problems with more testing.

Sure, will do.

> > > Changes-licensed-under: 3-Clause-BSD
> >
> > Hmm, I haven't noticed anyone doing that before: hope you're not
> > starting a trend! I think you'll find (Documentation/SubmittingPatches)
> > that your Signed-off-by agrees to the Developer's Certificate of Origin
> > 1.1, which would put your patch under the same open source licence(s) as
> > drivers/net/wireless/ath5k/base.c already contains - that's the usual
> > case.
>
> I agree with all of the above, but the SFLC suggests we do it anyway.
>
> http://kerneltrap.org/Linux/Wireless_Project_Suggests_Changes-licensed-under_Tag

Interesting, thanks a lot for the pointer.
I agree very much with Steve and Krzysztof.

I'd be inclined to think that adding such a line in some patches
would only tend towards making those which omit such a line more
questionable for use under the BSD licence i.e. would weaken the
BSD position rather than strengthening it as intended - though
the Signed-off-by should override even that tendency.

But it's certainly not an issue for my attention!

Hugh

2009-01-13 15:36:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Sat, 10 Jan 2009, Bob Copeland wrote:
> On Sat, Jan 10, 2009 at 11:47:05AM -0500, Bob Copeland wrote:
> > Well I got a lockup doing that, I'll try again later but anyway I see
> > the bug already, read on if interested. I should have a patch shortly.
>
> Err, of course I would get a lockup, it's a BUG_ON. This patch seems to
> fix it for me.
>
> From: Bob Copeland <[email protected]>
> Date: Sat, 10 Jan 2009 14:42:54 -0500
> Subject: [PATCH] ath5k: fix bf->skb==NULL panic in ath5k_tasklet_rx
>
> Under memory pressure, we may not be able to allocate a new skb for
> new packets. If the allocation fails, ath5k_tasklet_rx will exit but
> will leave a buffer in the list with a NULL skb, eventually triggering
> a BUG_ON.
>
> Extract the skb allocation from ath5k_rxbuf_setup() and change the
> tasklet to allocate the next skb before accepting a packet.

Thanks a lot, Bob, this looks like it should work.

But I haven't seen any allocation failure for several days.
I didn't get any while trying to simplify the reproduction,
and so once your patch arrived I switched back to trying my
original load with your patch applied.

That's so far had three runs (one on 2.6.28, two on 2.6.29-rc1) of
about 16 hours each, without even hitting the origin of the problem.
I did say originally that it would take a week or two to confirm,
so I'll just keep on trying.

I guess we can see from the nature of your patch that it has to
fix it; but it still would be nice to see such an allocation
failure logged, and the system and its wireless continuing okay.

>
> Changes-licensed-under: 3-Clause-BSD

Hmm, I haven't noticed anyone doing that before: hope you're not
starting a trend! I think you'll find (Documentation/SubmittingPatches)
that your Signed-off-by agrees to the Developer's Certificate of Origin
1.1, which would put your patch under the same open source licence(s) as
drivers/net/wireless/ath5k/base.c already contains - that's the usual
case.

But IANAL, and I may be missing an important nuance you'd hoped to convey:
that "(unless I am permitted to submit under a different licence)" clause?

Hugh

>
> Signed-off-by: Bob Copeland <[email protected]>
> ---
> drivers/net/wireless/ath5k/base.c | 85 +++++++++++++++++++++++--------------
> 1 files changed, 53 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/wireless/ath5k/base.c b/drivers/net/wireless/ath5k/base.c
> index 2b7b7f6..1c0061a 100644
> --- a/drivers/net/wireless/ath5k/base.c
> +++ b/drivers/net/wireless/ath5k/base.c
> @@ -1095,6 +1095,42 @@ ath5k_hw_to_driver_rix(struct ath5k_softc *sc, int hw_rix)
> * Buffers setup *
> \***************/
>
> +static
> +struct sk_buff *ath5k_rx_skb_alloc(struct ath5k_softc *sc, dma_addr_t *skb_addr)
> +{
> + struct sk_buff *skb;
> + unsigned int off;
> +
> + /*
> + * Allocate buffer with headroom_needed space for the
> + * fake physical layer header at the start.
> + */
> + skb = dev_alloc_skb(sc->rxbufsize + sc->cachelsz - 1);
> +
> + if (!skb) {
> + ATH5K_ERR(sc, "can't alloc skbuff of size %u\n",
> + sc->rxbufsize + sc->cachelsz - 1);
> + return NULL;
> + }
> + /*
> + * Cache-line-align. This is important (for the
> + * 5210 at least) as not doing so causes bogus data
> + * in rx'd frames.
> + */
> + off = ((unsigned long)skb->data) % sc->cachelsz;
> + if (off != 0)
> + skb_reserve(skb, sc->cachelsz - off);
> +
> + *skb_addr = pci_map_single(sc->pdev,
> + skb->data, sc->rxbufsize, PCI_DMA_FROMDEVICE);
> + if (unlikely(pci_dma_mapping_error(sc->pdev, *skb_addr))) {
> + ATH5K_ERR(sc, "%s: DMA mapping failed\n", __func__);
> + dev_kfree_skb(skb);
> + return NULL;
> + }
> + return skb;
> +}
> +
> static int
> ath5k_rxbuf_setup(struct ath5k_softc *sc, struct ath5k_buf *bf)
> {
> @@ -1102,37 +1138,11 @@ ath5k_rxbuf_setup(struct ath5k_softc *sc, struct ath5k_buf *bf)
> struct sk_buff *skb = bf->skb;
> struct ath5k_desc *ds;
>
> - if (likely(skb == NULL)) {
> - unsigned int off;
> -
> - /*
> - * Allocate buffer with headroom_needed space for the
> - * fake physical layer header at the start.
> - */
> - skb = dev_alloc_skb(sc->rxbufsize + sc->cachelsz - 1);
> - if (unlikely(skb == NULL)) {
> - ATH5K_ERR(sc, "can't alloc skbuff of size %u\n",
> - sc->rxbufsize + sc->cachelsz - 1);
> + if (!skb) {
> + skb = ath5k_rx_skb_alloc(sc, &bf->skbaddr);
> + if (!skb)
> return -ENOMEM;
> - }
> - /*
> - * Cache-line-align. This is important (for the
> - * 5210 at least) as not doing so causes bogus data
> - * in rx'd frames.
> - */
> - off = ((unsigned long)skb->data) % sc->cachelsz;
> - if (off != 0)
> - skb_reserve(skb, sc->cachelsz - off);
> -
> bf->skb = skb;
> - bf->skbaddr = pci_map_single(sc->pdev,
> - skb->data, sc->rxbufsize, PCI_DMA_FROMDEVICE);
> - if (unlikely(pci_dma_mapping_error(sc->pdev, bf->skbaddr))) {
> - ATH5K_ERR(sc, "%s: DMA mapping failed\n", __func__);
> - dev_kfree_skb(skb);
> - bf->skb = NULL;
> - return -ENOMEM;
> - }
> }
>
> /*
> @@ -1661,7 +1671,8 @@ ath5k_tasklet_rx(unsigned long data)
> {
> struct ieee80211_rx_status rxs = {};
> struct ath5k_rx_status rs = {};
> - struct sk_buff *skb;
> + struct sk_buff *skb, *next_skb;
> + dma_addr_t next_skb_addr;
> struct ath5k_softc *sc = (void *)data;
> struct ath5k_buf *bf, *bf_last;
> struct ath5k_desc *ds;
> @@ -1746,10 +1757,17 @@ ath5k_tasklet_rx(unsigned long data)
> goto next;
> }
> accept:
> + next_skb = ath5k_rx_skb_alloc(sc, &next_skb_addr);
> +
> + /*
> + * If we can't replace bf->skb with a new skb under memory
> + * pressure, just skip this packet
> + */
> + if (!next_skb)
> + goto next;
> +
> pci_unmap_single(sc->pdev, bf->skbaddr, sc->rxbufsize,
> PCI_DMA_FROMDEVICE);
> - bf->skb = NULL;
> -
> skb_put(skb, rs.rs_datalen);
>
> /* The MAC header is padded to have 32-bit boundary if the
> @@ -1822,6 +1840,9 @@ accept:
> ath5k_check_ibss_tsf(sc, skb, &rxs);
>
> __ieee80211_rx(sc->hw, skb, &rxs);
> +
> + bf->skb = next_skb;
> + bf->skbaddr = next_skb_addr;
> next:
> list_move_tail(&bf->list, &sc->rxbuf);
> } while (ath5k_rxbuf_setup(sc, bf) == 0);
> --
> 1.6.0.6
>
>
> --
> Bob Copeland %% http://www.bobcopeland.com

2009-01-09 13:41:56

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Thu, Jan 8, 2009 at 1:41 PM, Bob Copeland <[email protected]> wrote:
> On Thu, Jan 8, 2009 at 12:55 PM, Hugh Dickins <[email protected]> wrote:
>> Let me see if I can reproduce the ath5k BUG with a straightforward
>> memhog, repeatedly dirtying more anon memory than RAM can provide.
>> Is there something suitable I could run to exercise that wireless
>> path concurrently? It was just idling when I hit the BUGs before.
>
> Receiving any packets will do it, in your case it was probably
> beacons from nearby APs if you weren't using the wifi. So iperf
> or similar will work.

Actually it's probably enough to just intentionally fail one
of the allocations in the driver. I'll do that and let you know
if I can easily break it.

--
Bob Copeland %% http://www.bobcopeland.com

2009-01-13 15:56:57

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
> But I haven't seen any allocation failure for several days.
> I didn't get any while trying to simplify the reproduction,

Ok, well here's how I reproduced it manually. Without the patch, you
should see "can't alloc skbuff of size ...." followed shortly by the
BUG. With the patch, you see the same error but things proceed as
normal.

(I chose 60 because anything < 40 will fail in probe).

I'll go ahead and push this upstream then today or tomorrow; let me
know if you run into any problems with more testing.

> > +static
> > +struct sk_buff *ath5k_rx_skb_alloc(struct ath5k_softc *sc, dma_addr_t
*skb_addr)
> > +{
> > + struct sk_buff *skb;
> > + unsigned int off;

+ static int foo = 0;

> > + /*
> > + * Allocate buffer with headroom_needed space for the
> > + * fake physical layer header at the start.
> > + */
> > + skb = dev_alloc_skb(sc->rxbufsize + sc->cachelsz - 1);

+ if (++foo == 60)
+ skb = NULL;

> > +
> > + if (!skb) {
> > + ATH5K_ERR(sc, "can't alloc skbuff of size %u\n",
> > + sc->rxbufsize + sc->cachelsz - 1);
> > + return NULL;
> > + }


> > Changes-licensed-under: 3-Clause-BSD
>
> Hmm, I haven't noticed anyone doing that before: hope you're not
> starting a trend! I think you'll find (Documentation/SubmittingPatches)
> that your Signed-off-by agrees to the Developer's Certificate of Origin
> 1.1, which would put your patch under the same open source licence(s) as
> drivers/net/wireless/ath5k/base.c already contains - that's the usual
> case.

I agree with all of the above, but the SFLC suggests we do it anyway.

http://kerneltrap.org/Linux/Wireless_Project_Suggests_Changes-licensed-under_Tag

--
Bob Copeland %% http://www.bobcopeland.com



2009-01-09 14:11:08

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, 9 Jan 2009, Bob Copeland wrote:
> On Thu, Jan 8, 2009 at 1:41 PM, Bob Copeland <[email protected]> wrote:
> > On Thu, Jan 8, 2009 at 12:55 PM, Hugh Dickins <[email protected]> wrote:
> >> Let me see if I can reproduce the ath5k BUG with a straightforward
> >> memhog, repeatedly dirtying more anon memory than RAM can provide.
> >> Is there something suitable I could run to exercise that wireless
> >> path concurrently? It was just idling when I hit the BUGs before.
> >
> > Receiving any packets will do it, in your case it was probably
> > beacons from nearby APs if you weren't using the wifi. So iperf
> > or similar will work.
>
> Actually it's probably enough to just intentionally fail one
> of the allocations in the driver. I'll do that and let you know
> if I can easily break it.

Good, that should be a lot quicker.

My overnight try with simple memhog, and iperf -s receiving from
another machine doing iperf -c at intervals, didn't hit it (but my
more involved test wouldn't necessarily have hit it in that time either).

Currently trying just a 2.6.28 defconfig kernel build in tmpfs
(other filesystems should get their dirty pages pdflushed sooner)
along with the iperf, but haven't got the sizes suitably tuned yet.

Hugh

2009-01-08 14:47:03

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Thu, 2009-01-08 at 13:49 +0000, Hugh Dickins wrote:
> Thank you for the ath5k wireless driver,
> which I've just started using on an Aspire One.
>
> When running swapping load tests, wireless alive but not in active use,
> I've now twice hit the BUG_ON(bf->skb == NULL) in ath5k_tasklet_rx().
>
> First time was with 2.6.28 plus kdb patch,
> kernel BUG at drivers/net/wireless/ath5k/base.c:1675!
> and poking around the messages I could see before that an
> ath5k phy0: can't alloc skbuff of size 2673, after a
> swapper: page allocation failure, order:0 mode:0x4020
> on a call from dev_alloc_skb(), with memory summary.
>
> Second time was with yesterday's 2.6.28-git, no kdb,
> kernel BUG at drivers/net/wireless/ath5k/base.c:1683!
> running ath5k_tasklet_rx() from under do_softirq(),
> Kernel panic - not syncing: Fatal exception in interrupt
> so I couldn't see more; but at the top of the screen, the
> last three lines of a page allocation failure memory summary.
>
> So, that BUG_ON(bf->skb == NULL) appears to be unsafe under
> memory pressure; but the fix wasn't obvious to me, so over
> to you!
>
> I'd be glad to try patches, of course, but it's not happening
> often enough for me to report back success quickly - unless I
> stumble on a quicker way to reproduce it, it'll need a week or
> two to grow confident of a fix.
>
> Thanks,
> Hugh


I use aspire one, and I don't see that, did this happen several times?


Could you try compat-wireless:

http://wireless.kernel.org/en/users/Download


It has many fixes that made my wireless on that notebook more or less
usable.

Best regards,
Maxim Levitsky


2009-01-10 16:47:33

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, Jan 09, 2009 at 02:10:50PM +0000, Hugh Dickins wrote:
> Good, that should be a lot quicker.

Well I got a lockup doing that, I'll try again later but anyway I see
the bug already, read on if interested. I should have a patch shortly.

ath5k_tasklet_rx creates a new skb for every packet that is handed to
the upper layers, and the buffer is moved to the end of the rxbuf list.
But ath5k_rxbuf_setup fails skb_alloc so the buffer is left in the rxbuf
list with a null skb. Once another 40 packets are processed that buffer
comes up again and triggers the BUG_ON. Disabling the BUG_ON is bad
because we could end up with no available buffers.

The fix would be to create a new skb when we accept a packet, and just
drop the packet if that skb creation fails. Then swap in the new skb
after ieee80211_rx. ath9k already does something like this.

--
Bob Copeland %% http://www.bobcopeland.com


2009-01-08 17:55:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Thu, 8 Jan 2009, Bob Copeland wrote:
> On Thu, Jan 8, 2009 at 11:18 AM, Hugh Dickins <[email protected]> wrote:
> >> > So, that BUG_ON(bf->skb == NULL) appears to be unsafe under
> >> > memory pressure; but the fix wasn't obvious to me, so over
> >> > to you!
>
> Thanks for the report... I guess the error paths haven't been tested
> much when rx buf setup fails.
>
> > (Of course, just removing the BUG_ON, and making sure there's
> > no oops on the NULL pointer, would fix my immediate issue:
> > but I doubt the right fix will be as simple as that.)
>
> Are your memory load testing scripts available somewhere?

No. They're little more than repeatedly running a "make -j20"
kernel build in a tmpfs, and another in an ext2 on /dev/loop0
backed by large tmpfs file. But most of that will be irrelevant
to the ath5k issue, and it can be tricky when setting up to get
memory and swap and tmpfs sizes right to do plenty of swapping,
without the test just collapsing in out-of-memory kills.

Let me see if I can reproduce the ath5k BUG with a straightforward
memhog, repeatedly dirtying more anon memory than RAM can provide.
Is there something suitable I could run to exercise that wireless
path concurrently? It was just idling when I hit the BUGs before.

Hugh

2009-02-06 18:38:21

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, 6 Feb 2009 13:12:34 +0000 (GMT), Hugh Dickins wrote
> Hi Bob,
>
> On Tue, 13 Jan 2009, Bob Copeland wrote:
> > On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
> > > But I haven't seen any allocation failure for several days.
> > > I didn't get any while trying to simplify the reproduction,
> >
> > I'll go ahead and push this upstream then today or tomorrow; let me
> > know if you run into any problems with more testing.
>
> Just a prod: I've not seen your fix go into the 2.6.29-rc tree yet.

Good point, it is queued for 2.6.30. John, would there be any problem
with including a8e3cd514, "ath5k: fix bf->skb==NULL panic in
ath5k_tasklet_rx" for 2.6.29?

--
Bob Copeland %% http://www.bobcopeland.com



2009-02-09 02:32:21

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, Feb 06, 2009 at 01:44:46PM -0500, John W. Linville wrote:
> Hmmm...well it is a little big, but I guess it is OK. Do you have
> an example of the stack trace? Or a link to a bugzilla or somesuch
> would be even better...

Here's an oops I captured by forcing an skb alloc to fail (transcribed,
my usb-to-serial adapter tends to drop lots of stuff):

------------[ cut here ]------------
kernel BUG at drivers/net/wireless/ath5k/base.c:1689!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/class/misc/fuse/dev

Pid: 0, comm: swapper Not tainted (2.6.29-rc3 #46) MacBook1,1
EIP: 0060:[f82222ba>] EFLAGS: 00010246 CPU: 1
EIP is at ath5k_tasklet_rx+0xb4/0x5a6 [ath5k]
EAX: f40923b8 EBX: 00000000 ECX: f409239c EDX: 00000000
ESI: 00000080 EDI: 00000082 RBP: 00000000 ESP: f73d5e70
DS: 007b ES: 007b FS: 00d8 FS: 0000 SS: 0068
Process swapper (pid: 0, ti=f73d4000 task=f73ce750 task.ti=f73d4000)
Stack:
00000082 f4068f60 f406f0f8 f406f0f0 00000000 f40923b8 f4092380 f406c090
c0460404 f73ae000 2b74f34f 0000001f 00000000 0000099e ffffffe6 ffffffa0
00000064 00000001 00000000 00000082 7e4f014e ff460000 0000011b f406c0ec
Call Trace:
[<c0126bec>] tasklet_action+0x8a/0xeb
[<c0127000>] __do_softirq+0x8c/0x13c
[<c01270e6>] do_softirq+0x36/0x51
[<c012727b>] irq_exit+0x43/0x79
[<c01044f3>] do_IRQ+0x8e/0xa4
[<c010346c>] common_interrupt+0x2c/0x34
[<c013007b>] __create_workqueue_key+0xe1/0x184
[<f85f009f>] acpi_idle_enter_bm+0x271/0x2f0 [processor]
[<c02a0483>] menu_select+0x39/0x96
[<c029fad0>] cpuidle_idle_call+0x62/0x92
[<c0101f91>] cpu_idle+0x74/0xa7
Code: 64 31 00 00 81 c1 30 31 00 00 89 4c 24 1c 89 54 24 18 8b 44 24 04 8b 80 60 31 00 00 89 44 24 14 8b 50 14 85 d2 89 54 24 10 75 04 <0f> 0b eb fe 8b 4c 24 14 39 4c 24 18 8b 51 0c 75 08 8b 44 24 18
EIP: [<f82222ba>] ath5k_tasklet_rx+0xb4/0x5a6 [ath5k] SS:ESP 0068:f73d5e70
---[ end trace 67aef94060f00cf7 ]---
Kernel panic - not syncing: Fatal exception in interrupt

--
Bob Copeland %% http://www.bobcopeland.com


2009-02-06 20:59:10

by Bob Copeland

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, 6 Feb 2009 19:01:28 +0000 (GMT), Hugh Dickins wrote
> On Fri, 6 Feb 2009, John W. Linville wrote:
> > Hmmm...well it is a little big, but I guess it is OK. Do you have
> > an example of the stack trace? Or a link to a bugzilla or somesuch
> > would be even better...
>
> Below is my original report: I didn't copy down the stacktrace,
> and the kernel paniced with "Fatal exception in interrupt", so
> nothing I can send you from the logs I'm afraid.

I can generate the panic if you like, but it will need to wait a few
hours until I am back home to the machine with the serial console.
I just forced one of the skb allocations to fail to test it.

--
Bob Copeland %% http://www.bobcopeland.com



2009-02-06 19:02:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, 6 Feb 2009, John W. Linville wrote:
> On Fri, Feb 06, 2009 at 01:37:49PM -0500, Bob Copeland wrote:
> > On Fri, 6 Feb 2009 13:12:34 +0000 (GMT), Hugh Dickins wrote
> > > Hi Bob,
> > >
> > > On Tue, 13 Jan 2009, Bob Copeland wrote:
> > > > On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
> > > > > But I haven't seen any allocation failure for several days.
> > > > > I didn't get any while trying to simplify the reproduction,
> > > >
> > > > I'll go ahead and push this upstream then today or tomorrow; let me
> > > > know if you run into any problems with more testing.
> > >
> > > Just a prod: I've not seen your fix go into the 2.6.29-rc tree yet.
> >
> > Good point, it is queued for 2.6.30. John, would there be any problem
> > with including a8e3cd514, "ath5k: fix bf->skb==NULL panic in
> > ath5k_tasklet_rx" for 2.6.29?
>
> Hmmm...well it is a little big, but I guess it is OK. Do you have
> an example of the stack trace? Or a link to a bugzilla or somesuch
> would be even better...

Below is my original report: I didn't copy down the stacktrace,
and the kernel paniced with "Fatal exception in interrupt", so
nothing I can send you from the logs I'm afraid.

Hugh

>From [email protected] Thu Jan 8 13:49:47 2009
Date: Thu, 8 Jan 2009 13:49:44 +0000 (GMT)
From: Hugh Dickins <[email protected]>
To: Jiri Slaby <[email protected]>
Cc: <[email protected]>, <[email protected]>
Subject: ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

Thank you for the ath5k wireless driver,
which I've just started using on an Aspire One.

When running swapping load tests, wireless alive but not in active use,
I've now twice hit the BUG_ON(bf->skb == NULL) in ath5k_tasklet_rx().

First time was with 2.6.28 plus kdb patch,
kernel BUG at drivers/net/wireless/ath5k/base.c:1675!
and poking around the messages I could see before that an
ath5k phy0: can't alloc skbuff of size 2673, after a
swapper: page allocation failure, order:0 mode:0x4020
on a call from dev_alloc_skb(), with memory summary.

Second time was with yesterday's 2.6.28-git, no kdb,
kernel BUG at drivers/net/wireless/ath5k/base.c:1683!
running ath5k_tasklet_rx() from under do_softirq(),
Kernel panic - not syncing: Fatal exception in interrupt
so I couldn't see more; but at the top of the screen, the
last three lines of a page allocation failure memory summary.

So, that BUG_ON(bf->skb == NULL) appears to be unsafe under
memory pressure; but the fix wasn't obvious to me, so over
to you!

I'd be glad to try patches, of course, but it's not happening
often enough for me to report back success quickly - unless I
stumble on a quicker way to reproduce it, it'll need a week or
two to grow confident of a fix.

Thanks,
Hugh

2009-02-06 18:45:53

by John W. Linville

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

On Fri, Feb 06, 2009 at 01:37:49PM -0500, Bob Copeland wrote:
> On Fri, 6 Feb 2009 13:12:34 +0000 (GMT), Hugh Dickins wrote
> > Hi Bob,
> >
> > On Tue, 13 Jan 2009, Bob Copeland wrote:
> > > On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
> > > > But I haven't seen any allocation failure for several days.
> > > > I didn't get any while trying to simplify the reproduction,
> > >
> > > I'll go ahead and push this upstream then today or tomorrow; let me
> > > know if you run into any problems with more testing.
> >
> > Just a prod: I've not seen your fix go into the 2.6.29-rc tree yet.
>
> Good point, it is queued for 2.6.30. John, would there be any problem
> with including a8e3cd514, "ath5k: fix bf->skb==NULL panic in
> ath5k_tasklet_rx" for 2.6.29?

Hmmm...well it is a little big, but I guess it is OK. Do you have
an example of the stack trace? Or a link to a bugzilla or somesuch
would be even better...

John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.

2009-02-06 13:13:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: [ath5k-devel] ath5k_tasklet_rx BUG_ON(bf->skb == NULL)

Hi Bob,

On Tue, 13 Jan 2009, Bob Copeland wrote:
> On Tue, 13 Jan 2009 15:35:29 +0000 (GMT), Hugh Dickins wrote
> > But I haven't seen any allocation failure for several days.
> > I didn't get any while trying to simplify the reproduction,
>
> I'll go ahead and push this upstream then today or tomorrow; let me
> know if you run into any problems with more testing.

Just a prod: I've not seen your fix go into the 2.6.29-rc tree yet.

I confess that not even once since then have I seen the page allocation
failure which led to the BUG you've fixed: though most of my testing
has been on 2.6.29-rc kernels, so my guess would be that something
else has changed to make a failure there less likely.

But I do still carry your patch, and would be reassured to see it in.

Thanks,
Hugh