2010-12-03 12:49:08

by Andrej Ota

[permalink] [raw]
Subject: unable to handle kernel NULL pointer dereference in skb_dequeue

I have also hit on Bug 20292 (https://bugzilla.kernel.org/show_bug.cgi?id=20292) in final 2.6.36. After investigating changes made between 2.6.35.4, which worked, and 2.6.36 which started oopsing, I think the problem was in double freeing of skb caused by change of return value for __pppoe_xmit in case of errors.

As it turned out this might be the cause of random BUG reports throught the kernel, whenever something touched skb. Most common BUG with my use case happened at skb_dequeue:

00000060 <skb_dequeue>:
60: 53 push %ebx
61: 89 c2 mov %eax,%edx
63: 9c pushf
64: 59 pop %ecx
65: fa cli
66: 8b 00 mov (%eax),%eax
68: 39 c2 cmp %eax,%edx
6a: 74 24 je 90 <skb_dequeue+0x30>
6c: 85 c0 test %eax,%eax
6e: 74 1a je 8a <skb_dequeue+0x2a>
70: ff 4a 08 decl 0x8(%edx)
73: 8b 18 mov (%eax),%ebx
75: c7 00 00 00 00 00 movl $0x0,(%eax)
7b: 8b 50 04 mov 0x4(%eax),%edx
7e: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
85: 89 53 04 mov %edx,0x4(%ebx)
88:* 89 1a mov %ebx,(%edx)
8a: 51 push %ecx
8b: 9d popf
8c: 5b pop %ebx
8d: c3 ret
8e: 66 90 xchg %ax,%ax
90: b8 00 00 00 00 mov $0x0,%eax
95: eb f3 jmp 8a <skb_dequeue+0x2a>
97: 89 f6 mov %esi,%esi
99: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi


This location corresponds to line "next = next->next" from inlined __skb_dequeue(). The only reason BUG could happen here is something overwriting or otherwise corrupting skb list.

Patch that works for me is below. Now I only hope I haven't (re)introduced a memory leak...

I am not subscribed to LKML, so please reply-to-all if you need to contact me.

-----------------------------------------------------------------------------
--- linux-2.6.36/drivers/net/pppoe.c 2010-10-20 22:30:22.000000000 +0200
+++ linux-2.6.36.toshio/drivers/net/pppoe.c 2010-12-03 13:11:56.000000000 +0100
@@ -924,8 +924,10 @@
/* Copy the data if there is no space for the header or if it's
* read-only.
*/
- if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
+ if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len)) {
+ kfree_skb(skb);
goto abort;
+ }

__skb_push(skb, sizeof(*ph));
skb_reset_network_header(skb);
@@ -947,7 +949,6 @@
return 1;

abort:
- kfree_skb(skb);
return 0;
}

-----------------------------------------------------------------------------

Andrej Ota.


2010-12-03 13:09:59

by Eric Dumazet

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

Le vendredi 03 décembre 2010 à 13:42 +0100, Toshio a écrit :
> I have also hit on Bug 20292 (https://bugzilla.kernel.org/show_bug.cgi?id=20292) in final 2.6.36. After investigating changes made between 2.6.35.4, which worked, and 2.6.36 which started oopsing, I think the problem was in double freeing of skb caused by change of return value for __pppoe_xmit in case of errors.
>
> As it turned out this might be the cause of random BUG reports throught the kernel, whenever something touched skb. Most common BUG with my use case happened at skb_dequeue:
>

CC netdev and Rami Rosen

> 00000060 <skb_dequeue>:
> 60: 53 push %ebx
> 61: 89 c2 mov %eax,%edx
> 63: 9c pushf
> 64: 59 pop %ecx
> 65: fa cli
> 66: 8b 00 mov (%eax),%eax
> 68: 39 c2 cmp %eax,%edx
> 6a: 74 24 je 90 <skb_dequeue+0x30>
> 6c: 85 c0 test %eax,%eax
> 6e: 74 1a je 8a <skb_dequeue+0x2a>
> 70: ff 4a 08 decl 0x8(%edx)
> 73: 8b 18 mov (%eax),%ebx
> 75: c7 00 00 00 00 00 movl $0x0,(%eax)
> 7b: 8b 50 04 mov 0x4(%eax),%edx
> 7e: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
> 85: 89 53 04 mov %edx,0x4(%ebx)
> 88:* 89 1a mov %ebx,(%edx)
> 8a: 51 push %ecx
> 8b: 9d popf
> 8c: 5b pop %ebx
> 8d: c3 ret
> 8e: 66 90 xchg %ax,%ax
> 90: b8 00 00 00 00 mov $0x0,%eax
> 95: eb f3 jmp 8a <skb_dequeue+0x2a>
> 97: 89 f6 mov %esi,%esi
> 99: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
>
>
> This location corresponds to line "next = next->next" from inlined __skb_dequeue(). The only reason BUG could happen here is something overwriting or otherwise corrupting skb list.
>
> Patch that works for me is below. Now I only hope I haven't (re)introduced a memory leak...
>
> I am not subscribed to LKML, so please reply-to-all if you need to contact me.
>
> -----------------------------------------------------------------------------
> --- linux-2.6.36/drivers/net/pppoe.c 2010-10-20 22:30:22.000000000 +0200
> +++ linux-2.6.36.toshio/drivers/net/pppoe.c 2010-12-03 13:11:56.000000000 +0100
> @@ -924,8 +924,10 @@
> /* Copy the data if there is no space for the header or if it's
> * read-only.
> */
> - if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
> + if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len)) {
> + kfree_skb(skb);
> goto abort;
> + }
>
> __skb_push(skb, sizeof(*ph));
> skb_reset_network_header(skb);
> @@ -947,7 +949,6 @@
> return 1;
>
> abort:
> - kfree_skb(skb);
> return 0;
> }
>

Problem comes from commit 55c95e738da85 (fix return value of
__pppoe_xmit() method)

I am not sure patch is OK



2010-12-03 14:43:41

by Andrej Ota

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

>> Patch that works for me is below. Now I only hope I haven't
>> (re)introduced a memory leak...

> Problem comes from commit 55c95e738da85 (fix return value of
> __pppoe_xmit() method)
>
> I am not sure patch is OK


Me neither. That's why I wrote "works for me". All I dare say is that it
works better than current code and is probably no worse than it was before
above mentioned commit. Apart from that, there is no point in having return
value for __pppoe_xmit if return value isn't needed.

Easiest way of triggering this BUG is by terminating PPPoE on the server
side, which then hits "if (!dev) { goto abort; }". This in turn calls
"kfree_skb(skb); return 0;" which returns to pppoe_rcv_core which then
goto-s to "abort_put" which again calls "kfree_skb(skb)". Voila the bug.

I don't know how to trigger "if (skb_cow_head(skb, ..." to see if I have
just caused another BUG. However, if I read file comments at the top, I see
a comment from 19/07/01 stating that I have to delete original skb if code
succeeds and never delete it on failure. About the skb copy mentioned in
the same comment, I don't know. 2001 was many commits ago.

Andrej Ota.

2010-12-03 14:46:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

Le vendredi 03 décembre 2010 à 15:37 +0100, Andrej Ota a écrit :
> >> Patch that works for me is below. Now I only hope I haven't
> >> (re)introduced a memory leak...
>
> > Problem comes from commit 55c95e738da85 (fix return value of
> > __pppoe_xmit() method)
> >
> > I am not sure patch is OK
>
>
> Me neither. That's why I wrote "works for me". All I dare say is that it
> works better than current code and is probably no worse than it was before
> above mentioned commit. Apart from that, there is no point in having return
> value for __pppoe_xmit if return value isn't needed.
>
> Easiest way of triggering this BUG is by terminating PPPoE on the server
> side, which then hits "if (!dev) { goto abort; }". This in turn calls
> "kfree_skb(skb); return 0;" which returns to pppoe_rcv_core which then
> goto-s to "abort_put" which again calls "kfree_skb(skb)". Voila the bug.
>
> I don't know how to trigger "if (skb_cow_head(skb, ..." to see if I have
> just caused another BUG. However, if I read file comments at the top, I see
> a comment from 19/07/01 stating that I have to delete original skb if code
> succeeds and never delete it on failure. About the skb copy mentioned in
> the same comment, I don't know. 2001 was many commits ago.

Well, all I wanted to say was that _I_ was not sure, but probably other
network guys have a better diagnostic.

Rami, could you re-explain the rationale of your patch ?

Thanks

2010-12-03 22:07:09

by Jarek Poplawski

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

Eric Dumazet wrote:
> Le vendredi 03 décembre 2010 à 15:37 +0100, Andrej Ota a écrit :
>>>> Patch that works for me is below. Now I only hope I haven't
>>>> (re)introduced a memory leak...
>>
>>> Problem comes from commit 55c95e738da85 (fix return value of
>>> __pppoe_xmit() method)
...
> Rami, could you re-explain the rationale of your patch ?

I guess we could revert it in the meantime: it seems there might be
at least three threads (including bugzilla) with this problem.

Jarek P.

2010-12-03 22:24:54

by Denys Fedoryshchenko

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

On Friday 03 December 2010 16:46:35 Eric Dumazet wrote:
> Le vendredi 03 décembre 2010 à 15:37 +0100, Andrej Ota a écrit :
> > >> Patch that works for me is below. Now I only hope I haven't
> > >> (re)introduced a memory leak...
> > >
> > > Problem comes from commit 55c95e738da85 (fix return value of
> > > __pppoe_xmit() method)
> > >
> > > I am not sure patch is OK
> >
> > Me neither. That's why I wrote "works for me". All I dare say is that it
> > works better than current code and is probably no worse than it was
> > before above mentioned commit. Apart from that, there is no point in
> > having return value for __pppoe_xmit if return value isn't needed.
> >
> > Easiest way of triggering this BUG is by terminating PPPoE on the server
> > side, which then hits "if (!dev) { goto abort; }". This in turn calls
> > "kfree_skb(skb); return 0;" which returns to pppoe_rcv_core which then
> > goto-s to "abort_put" which again calls "kfree_skb(skb)". Voila the bug.
> >
> > I don't know how to trigger "if (skb_cow_head(skb, ..." to see if I have
> > just caused another BUG. However, if I read file comments at the top, I
> > see a comment from 19/07/01 stating that I have to delete original skb
> > if code succeeds and never delete it on failure. About the skb copy
> > mentioned in the same comment, I don't know. 2001 was many commits ago.
>
> Well, all I wanted to say was that _I_ was not sure, but probably other
> network guys have a better diagnostic.
>
> Rami, could you re-explain the rationale of your patch ?
>
> Thanks
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
This patch seems fixed my issue (Re: 2.6.35->2.6.36 regression, vanilla kernel
panic, ppp or hrtimers crashing), tested now.

2010-12-10 20:00:48

by Denys Fedoryshchenko

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

On Friday 03 December 2010 16:46:35 Eric Dumazet wrote:
> Le vendredi 03 décembre 2010 à 15:37 +0100, Andrej Ota a écrit :
> > >> Patch that works for me is below. Now I only hope I haven't
> > >> (re)introduced a memory leak...
> > >
> > > Problem comes from commit 55c95e738da85 (fix return value of
> > > __pppoe_xmit() method)
> > >
> > > I am not sure patch is OK
> >
> > Me neither. That's why I wrote "works for me". All I dare say is that it
> > works better than current code and is probably no worse than it was
> > before above mentioned commit. Apart from that, there is no point in
> > having return value for __pppoe_xmit if return value isn't needed.
> >
> > Easiest way of triggering this BUG is by terminating PPPoE on the server
> > side, which then hits "if (!dev) { goto abort; }". This in turn calls
> > "kfree_skb(skb); return 0;" which returns to pppoe_rcv_core which then
> > goto-s to "abort_put" which again calls "kfree_skb(skb)". Voila the bug.
> >
> > I don't know how to trigger "if (skb_cow_head(skb, ..." to see if I have
> > just caused another BUG. However, if I read file comments at the top, I
> > see a comment from 19/07/01 stating that I have to delete original skb
> > if code succeeds and never delete it on failure. About the skb copy
> > mentioned in the same comment, I don't know. 2001 was many commits ago.
>
> Well, all I wanted to say was that _I_ was not sure, but probably other
> network guys have a better diagnostic.
>
> Rami, could you re-explain the rationale of your patch ?
>
> Thanks
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Is there any plans to queue any patch to stable?
pppoe is almost dead in 2.6.36.*

2010-12-10 20:18:18

by David Miller

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

From: Denys Fedoryshchenko <[email protected]>
Date: Fri, 10 Dec 2010 21:51:04 +0200

> On Friday 03 December 2010 16:46:35 Eric Dumazet wrote:
>> Le vendredi 03 d?cembre 2010 ? 15:37 +0100, Andrej Ota a ?crit :
>> > >> Patch that works for me is below. Now I only hope I haven't
>> > >> (re)introduced a memory leak...
>> > >
>> > > Problem comes from commit 55c95e738da85 (fix return value of
>> > > __pppoe_xmit() method)
>> > >
>> > > I am not sure patch is OK
>> >
>> > Me neither. That's why I wrote "works for me". All I dare say is that it
>> > works better than current code and is probably no worse than it was
>> > before above mentioned commit. Apart from that, there is no point in
>> > having return value for __pppoe_xmit if return value isn't needed.
>> >
>> > Easiest way of triggering this BUG is by terminating PPPoE on the server
>> > side, which then hits "if (!dev) { goto abort; }". This in turn calls
>> > "kfree_skb(skb); return 0;" which returns to pppoe_rcv_core which then
>> > goto-s to "abort_put" which again calls "kfree_skb(skb)". Voila the bug.
>> >
>> > I don't know how to trigger "if (skb_cow_head(skb, ..." to see if I have
>> > just caused another BUG. However, if I read file comments at the top, I
>> > see a comment from 19/07/01 stating that I have to delete original skb
>> > if code succeeds and never delete it on failure. About the skb copy
>> > mentioned in the same comment, I don't know. 2001 was many commits ago.
>>
>> Well, all I wanted to say was that _I_ was not sure, but probably other
>> network guys have a better diagnostic.
>>
>> Rami, could you re-explain the rationale of your patch ?
>>
>> Thanks
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Is there any plans to queue any patch to stable?
> pppoe is almost dead in 2.6.36.*

I'll deal with it for -stable once I evaluate this patch for upstream,
which I haven't even gotten to yet.

When people bark about -stable this and -stable that, it just takes
more time away from me actually getting through all the patches. If
it causes a crash, I know it should go to stable and I'll take care of
it. So there is no need to make an explicit note or query about it.

Thanks.

2010-12-10 21:30:34

by Jarek Poplawski

[permalink] [raw]
Subject: Re: unable to handle kernel NULL pointer dereference in skb_dequeue

David Miller wrote:
> From: Denys Fedoryshchenko <[email protected]>
> Date: Fri, 10 Dec 2010 21:51:04 +0200
...
>> Is there any plans to queue any patch to stable?
>> pppoe is almost dead in 2.6.36.*
>
> I'll deal with it for -stable once I evaluate this patch for upstream,
> which I haven't even gotten to yet.

This patch is here:
http://patchwork.ozlabs.org/patch/75095/

But IMHO it's buggy for the reason I mentioned. Since Andrej
didn't respond yet I'd suggest reverting of the offending
commit 55c95e738da85. If you agree with that but e.g. too busy,
let me know.

Jarek P.