2009-10-17 17:57:19

by Anirban Sinha

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled

On Thu, 8 Oct 2009, David Miller wrote:

> >> > We are noticing a kernel OOPS on 2.6.26 kernel when we issue the command
> >> > "clear ip bgp <bgp-peer-ip>" on Quagga BGP routing software.
> >>
> >> You will need to update your kernel, there have been many TCP
> >> MD5 bug fixes since 2.6.26
> >>
> >
> > Sigh ... wish that were that easy!
>
> Contact your vendor for support :-)

but we *are* the vendors for our distribution and even though I may not be
a networking guru, I have little bit knowledge in working my way through the
kernel code. I have traced down the cause of the BUG() and when I did a git
pull against Linus' tree, I see the same issue in teh git tip as well.

The BUG() is triggered from kernel/timer.c, line 1037 within the
function __run_timers(). I am reporting these line numbers from the latest git
tip.

What happens is that before and after the callback, the code grabs the preempt
count and catches unbalanced preempt_enable() and preempt_disable() calls from
within the callback function. In this case, the callback function is
inet_twdr_hangman() as can be seen from this instrumented log:

[02:15:15.941981] Kernel panic - not syncing: <3>huh, entered ffffffff803fbd60
(inet_twdr_hangman+0x0/0xe0)with preempt_count 00000102, exited with 00000101?

Clearly there is an extra unbalanced preempt_enable() somewhere within the
callback function.

When I looked at the function, I see that in net/ipv4/inet_timewait_sock.c
line 215, the function calls schedule_work(). schedule_work() calls
queue_work() which in turn calls put_cpu() that ultimately does a
preempt_enable(). It is this unbalanced preempt_enable() that decriments the
preempt_count by one as can be seen from the above trace.

I suspect that workqueue related operatios are illegal from a timer callback
function. In that case, the above mentioned callback function needs to be
fixed.

Yes, I can't explain why others are also not seeing the same bug crash. We
don't have the luxury to pull in the latest and greatest kernel from the git
tree everytime an update is made and try it out. So I am unable to repo the
issue with the latest kernel. But if that means that this issue should be
ignored, then that is fine by me. We will fix our private kenrel with an
appropriate patch as we continue to investigate more.

Cheers,

Ani


2009-10-18 02:41:00

by Anirban Sinha

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled



Once upon a time, like on 09-10-17 10:57 AM, Anirban Sinha wrote:
> On Thu, 8 Oct 2009, David Miller wrote:
>
>>>>> We are noticing a kernel OOPS on 2.6.26 kernel when we issue the command
>>>>> "clear ip bgp <bgp-peer-ip>" on Quagga BGP routing software.

and btw, this is the crash (on mips) we are talking about:

# [23:10:35.108808] Kernel bug detected[#1]:
[23:10:35.112527] Cpu 0
[23:10:35.114676] $ 0 : 0000000000000000 0000000014001fe0
0000000000000066 0000000000000004
[23:10:35.122845] $ 4 : ffffffff80516c10 0000000014001fe0
ffffffff8050c010 0000000000000004
[23:10:35.131015] $ 8 : 0000000000000000 0000000000000041
ffffffff805142e8 0000000000000001
[23:10:35.139184] $12 : ffffffff80600000 ffffffff805f0000
0000000000000064 0000000000000190
[23:10:35.147354] $16 : 0000000000000102 ffffffff803afdf0
ffffffff80539040 ffffffff80600780
[23:10:35.155526] $20 : ffffffff80540000 0000000000200200
ffffffff804c0000 000000000000000a
[23:10:35.163695] $24 : a3d70a3d70a3d70b 8000000000000003
[23:10:35.171865] $28 : ffffffff8050c000 ffffffff8050fd90
9000000010030000 ffffffff801487a8
[23:10:35.180035] Hi : 0000000000000000
[23:10:35.183819] Lo : 0000000000000000
[23:10:35.187603] epc : ffffffff801487a8 run_timer_softirq+0x198/0x258
Tainted: P
[23:10:35.196032] ra : ffffffff801487a8 run_timer_softirq+0x198/0x258
[23:10:35.202395] Status: 14001fe3 KX SX UX KERNEL EXL IE
[23:10:35.207814] Cause : 00808024
[23:10:35.210911] PrId : 01041100 (SiByte SB1A)
[23:10:35.215209] Modules linked in: xt_state ipt_REJECT iptable_filter
nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
ip_tables ebtable_filter ebtables bridge llc zeug_ipmcdrv(P) irqdisp(P)
zvirt(P) zeugmod(P) softdog
[23:10:35.236024] Process swapper (pid: 0, threadinfo=ffffffff8050c000,
task=ffffffff805142e8, tls=0000000000000000)
[23:10:35.246169] Stack : ffffffff8050fd90 ffffffff8050fd90
0000000014001fe0 ffffffff805ff3e0
[23:10:35.254166] ffffffff806003c4 0000000000000001
ffffffff8053f650 ffffffff805706d0
[23:10:35.262337] ffffffff80572020 ffffffff80142280
ffffffff806003c0 0000000000000000
[23:10:35.270507] 0000000014001fe0 000000000000c5b0
ffffffff8fefc520 ffffffff8feea52c
[23:10:35.278676] 0000000000000015 0000000000004460
0000000000000940 ffffffff8fe1bf00
[23:10:35.286846] ffffffff8fffdab0 ffffffff80142410
0000000000000000 ffffffff80142778
[23:10:35.295017] ffffffff80103d20 ffffffff80103d20
0000000000000000 0000000014001fe1
[23:10:35.303187] 0000000000040000 ffffffff8050c010
0000000000000000 a80000017f87c138
[23:10:35.311357] 0000000014001fe0 ffffffffffff00fe
0000000000000004 a80000017e7e0680
[23:10:35.319528] 0000000000000000 000000000000001d
ffffffff8050ffe0 0000000000001f00
[23:10:35.327696] ...
[23:10:35.330536] Call Trace:
[23:10:35.333201] [<ffffffff801487a8>] run_timer_softirq+0x198/0x258
[23:10:35.339224] [<ffffffff80142280>] __do_softirq+0x198/0x288
[23:10:35.344812] [<ffffffff80142410>] do_softirq+0xa0/0xa8
[23:10:35.350057] [<ffffffff80142778>] irq_exit+0x70/0x88
[23:10:35.355131] [<ffffffff80103d20>] ret_from_irq+0x0/0x4
[23:10:35.360377] [<ffffffff801063f4>] cpu_idle+0x1c/0x88
[23:10:35.365455]
[23:10:35.367171]
[23:10:35.367174] Code: 0040382d 0c04ef4c 00000000 <0200000d> 0c10ee9c
0260202d dfa60000 17a6ffe5 00000000
[23:10:35.378822] Kernel panic - not syncing: Fatal exception in
interrupt

2009-10-18 20:19:33

by Anirban Sinha

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled

Hi Oleg:

I have a question for you. The queue_work() routine which is called from schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is this an attempt to trigger the scheduler? One of the side affects of this enable_preempt() is the crash that we see below. What is happening is that a timer callback routine, in this case inet_twdr_hangman(), tries a bunch of cleanup until a threshold is reached. If further cleanups needs to be done beyond the threshold, it queues a work function. Now when the timer callback is run in __run_timers(), the routine grabs the value of preempt_count before and after the callback function call. If the two counts do not match, it calls BUG() (line 1037 in kernel/timer.c). Is is it illegal to schedule a work function from within a timer callback? What would be a good solution? I have already posted in netdev but since workqueues and timers are general kernel infrastructure, I thought I might as well post the question in the main linux m
ailing list and to you.

Here's the output from my instrumented BUG() call:

[02:15:15.941981] Kernel panic - not syncing: <3>huh, entered ffffffff803fbd60
(inet_twdr_hangman+0x0/0xe0)with preempt_count 00000102, exited with 00000101?

I was thinking of a hacky solution, to replace schedule_work() with schedule_delayed_work() just to get around the issue. But I am sure this is just too hacky and probably not the ideal solution ...

Cheers,

Ani


Once upon a time, like on 09-10-17 7:35 PM, Anirban Sinha wrote:
>
>
> Once upon a time, like on 09-10-17 10:57 AM, Anirban Sinha wrote:
>> On Thu, 8 Oct 2009, David Miller wrote:
>>
>>>>>> We are noticing a kernel OOPS on 2.6.26 kernel when we issue the command
>>>>>> "clear ip bgp <bgp-peer-ip>" on Quagga BGP routing software.
>
> and btw, this is the crash (on mips) we are talking about:
>
> # [23:10:35.108808] Kernel bug detected[#1]:
> [23:10:35.112527] Cpu 0
> [23:10:35.114676] $ 0 : 0000000000000000 0000000014001fe0
> 0000000000000066 0000000000000004
> [23:10:35.122845] $ 4 : ffffffff80516c10 0000000014001fe0
> ffffffff8050c010 0000000000000004
> [23:10:35.131015] $ 8 : 0000000000000000 0000000000000041
> ffffffff805142e8 0000000000000001
> [23:10:35.139184] $12 : ffffffff80600000 ffffffff805f0000
> 0000000000000064 0000000000000190
> [23:10:35.147354] $16 : 0000000000000102 ffffffff803afdf0
> ffffffff80539040 ffffffff80600780
> [23:10:35.155526] $20 : ffffffff80540000 0000000000200200
> ffffffff804c0000 000000000000000a
> [23:10:35.163695] $24 : a3d70a3d70a3d70b 8000000000000003
> [23:10:35.171865] $28 : ffffffff8050c000 ffffffff8050fd90
> 9000000010030000 ffffffff801487a8
> [23:10:35.180035] Hi : 0000000000000000
> [23:10:35.183819] Lo : 0000000000000000
> [23:10:35.187603] epc : ffffffff801487a8 run_timer_softirq+0x198/0x258
> Tainted: P
> [23:10:35.196032] ra : ffffffff801487a8 run_timer_softirq+0x198/0x258
> [23:10:35.202395] Status: 14001fe3 KX SX UX KERNEL EXL IE
> [23:10:35.207814] Cause : 00808024
> [23:10:35.210911] PrId : 01041100 (SiByte SB1A)
> [23:10:35.215209] Modules linked in: xt_state ipt_REJECT iptable_filter
> nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
> ip_tables ebtable_filter ebtables bridge llc zeug_ipmcdrv(P) irqdisp(P)
> zvirt(P) zeugmod(P) softdog
> [23:10:35.236024] Process swapper (pid: 0, threadinfo=ffffffff8050c000,
> task=ffffffff805142e8, tls=0000000000000000)
> [23:10:35.246169] Stack : ffffffff8050fd90 ffffffff8050fd90
> 0000000014001fe0 ffffffff805ff3e0
> [23:10:35.254166] ffffffff806003c4 0000000000000001
> ffffffff8053f650 ffffffff805706d0
> [23:10:35.262337] ffffffff80572020 ffffffff80142280
> ffffffff806003c0 0000000000000000
> [23:10:35.270507] 0000000014001fe0 000000000000c5b0
> ffffffff8fefc520 ffffffff8feea52c
> [23:10:35.278676] 0000000000000015 0000000000004460
> 0000000000000940 ffffffff8fe1bf00
> [23:10:35.286846] ffffffff8fffdab0 ffffffff80142410
> 0000000000000000 ffffffff80142778
> [23:10:35.295017] ffffffff80103d20 ffffffff80103d20
> 0000000000000000 0000000014001fe1
> [23:10:35.303187] 0000000000040000 ffffffff8050c010
> 0000000000000000 a80000017f87c138
> [23:10:35.311357] 0000000014001fe0 ffffffffffff00fe
> 0000000000000004 a80000017e7e0680
> [23:10:35.319528] 0000000000000000 000000000000001d
> ffffffff8050ffe0 0000000000001f00
> [23:10:35.327696] ...
> [23:10:35.330536] Call Trace:
> [23:10:35.333201] [<ffffffff801487a8>] run_timer_softirq+0x198/0x258
> [23:10:35.339224] [<ffffffff80142280>] __do_softirq+0x198/0x288
> [23:10:35.344812] [<ffffffff80142410>] do_softirq+0xa0/0xa8
> [23:10:35.350057] [<ffffffff80142778>] irq_exit+0x70/0x88
> [23:10:35.355131] [<ffffffff80103d20>] ret_from_irq+0x0/0x4
> [23:10:35.360377] [<ffffffff801063f4>] cpu_idle+0x1c/0x88
> [23:10:35.365455]
> [23:10:35.367171]
> [23:10:35.367174] Code: 0040382d 0c04ef4c 00000000 <0200000d> 0c10ee9c
> 0260202d dfa60000 17a6ffe5 00000000
> [23:10:35.378822] Kernel panic - not syncing: Fatal exception in
> interrupt
>

2009-10-19 12:18:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled

Hi Anirban,

On 10/18, Anirban Sinha wrote:
>
> I have a question for you. The queue_work() routine which is called from
> schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is
> this an attempt to trigger the scheduler?

No. please note that queue_work() does get_cpu() + put_cpu() to protect
against cpu_down() in between.

This can trigger the scheduler of course, but everything should be OK.

> One of the side affects of
> this enable_preempt() is the crash that we see below. What is happening
> is that a timer callback routine, in this case inet_twdr_hangman(),
> tries a bunch of cleanup until a threshold is reached. If further cleanups
> needs to be done beyond the threshold, it queues a work function. Now when
> the timer callback is run in __run_timers(), the routine grabs the value
> of preempt_count before and after the callback function call. If the two
> counts do not match, it calls BUG() (line 1037 in kernel/timer.c).

Yes, but I can't see how queue_work() can be involved, it doesn't change
->preempt_count. Note again it does put after get.

> Is is
> it illegal to schedule a work function from within a timer callback?

Yes sure.

I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.

Could you verify this?

Oleg.

2009-10-19 15:32:07

by Anirban Sinha

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled



Once upon a time, like on 09-10-19 5:13 AM, Oleg Nesterov wrote:
> Hi Anirban,
>
> On 10/18, Anirban Sinha wrote:
>>
>> I have a question for you. The queue_work() routine which is called from
>> schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is
>> this an attempt to trigger the scheduler?
>
> No. please note that queue_work() does get_cpu() + put_cpu() to protect
> against cpu_down() in between.

grrr! Ah yes, my eyes failed me (or it saw what I wanted it to see :)). You do have a get_cpu() and put_cpu() together in the same code path. I guess I will have to keep looking at inet_twdr_hangman().

>> Is is
>> it illegal to schedule a work function from within a timer callback?
>
> Yes sure.

hmm. may be in that case, that function needs to be re-written.

> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
>
> Could you verify this?

I'll keep looking. Thanks for the help Oleg.

Ani

2009-10-19 15:40:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled

On 10/19, Anirban Sinha wrote:
>
> Once upon a time, like on 09-10-19 5:13 AM, Oleg Nesterov wrote:
>
> >> Is is
> >> it illegal to schedule a work function from within a timer callback?
> >
> > Yes sure.
>
> hmm. may be in that case, that function needs to be re-written.

OOPS!!!! I misread your question, didn't notice "il" above...

I meant: yes sure it _is legal_ to schedule a work from within a timer
callback (in fact it is legal from any context).

Sorry for confusion.

Oleg.

2009-10-19 16:06:20

by Anirban Sinha

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled


> I meant: yes sure it _is legal_ to schedule a work from within a timer
> callback (in fact it is legal from any context).

ok, then that part of the function looks fine.

Ani

2009-10-20 00:56:05

by Anirban Sinha

[permalink] [raw]
Subject: Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled


> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
>
> Could you verify this?


Yes, I have now verified this. There is indeed an issue with one of the
functions called by inet_twdr_hangman(). The call sequence is:

inet_twdr_hangman() -> inet_twdr_do_twkill_work() -> inet_twsk_put() ->
twsk_destructor().

In this case, the destructor callback is tcp_twsk_destructor() (installed
from line 1208 in net/ipv4/tcp_ipv4.c and line 906 in net/ipv6/tcp_ipv6.c) .
Without the TCP_MD5SUM compiled in, the function is a no-op. However, with the MD5SUM
compiled in, it calls tcp_put_md5_sig_pool() (when keylen is non zero) which
does an unbalanced put_cpu(). I did a grep across the entire tree.
tcp_put_md5_sig_pool() is a matching function for tcp_get_md5_sig_pool() and
in all other TCP IPV4 cases, it is called from net/ipv4/tcp_ipv4.c from
functions tcp_v4_md5_hash_hdr() and

2009-10-20 01:08:19

by Anirban Sinha

[permalink] [raw]
Subject: [PATCH] Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled



> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
>
> Could you verify this?

Yes, I have now verified this. There is indeed an issue with one of the
functions called by inet_twdr_hangman(). The call sequence is:

inet_twdr_hangman() -> inet_twdr_do_twkill_work() -> inet_twsk_put() ->
twsk_destructor().

In this case, the destructor callback is tcp_twsk_destructor() (installed
from line 1208 in net/ipv4/tcp_ipv4.c and line 906 in net/ipv6/tcp_ipv6.c) .
Without the TCP_MD5SUM compiled in, the function is a no-op. However, with the MD5SUM
compiled in, it calls tcp_put_md5_sig_pool() (when keylen is non zero) which
does an unbalanced put_cpu(). I did a grep across the entire tree.
tcp_put_md5_sig_pool() is a matching function for tcp_get_md5_sig_pool() and
in all other TCP IPV4 cases, it is called from net/ipv4/tcp_ipv4.c from
functions tcp_v4_md5_hash_hdr() and tcp_v4_hash_skb() along with the matching
get()
function. So I would think that in tcp_twsk_destructor(), the call should be
replaced by tcp_free_md5_sig_pool() instead.

Signed-of-by: Anirban Sinha <[email protected]>
---

net/ipv4/tcp_minisocks.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e48c37d..dccc01e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
#ifdef CONFIG_TCP_MD5SIG
struct tcp_timewait_sock *twsk = tcp_twsk(sk);
if (twsk->tw_md5_keylen)
- tcp_put_md5sig_pool();
+ tcp_free_md5sig_pool();
#endif
}






2009-10-20 01:13:20

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled

From: Anirban Sinha <[email protected]>
Date: Mon, 19 Oct 2009 18:08:21 -0700 (PDT)

> @@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
> #ifdef CONFIG_TCP_MD5SIG
> struct tcp_timewait_sock *twsk = tcp_twsk(sk);
> if (twsk->tw_md5_keylen)
> - tcp_put_md5sig_pool();
> + tcp_free_md5sig_pool();
> #endif
> }

This has been fixed in the tree for a month of so:

commit 657e9649e745b06675aa5063c84430986cdc3afa
Author: Robert Varga <[email protected]>
Date: Tue Sep 15 23:49:21 2009 -0700

tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG()

I have recently came across a preemption imbalance detected by:

<4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
<0>------------[ cut here ]------------
<2>kernel BUG at /usr/src/linux/kernel/timer.c:664!
<0>invalid opcode: 0000 [1] PREEMPT SMP

with ffffffff80644630 being inet_twdr_hangman().

This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
bit, so I looked at what might have caused it.

One thing that struck me as strange is tcp_twsk_destructor(), as it
calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
as far as I can tell.

Signed-off-by: Robert Varga <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 045bcfd..624c3c9 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
#ifdef CONFIG_TCP_MD5SIG
struct tcp_timewait_sock *twsk = tcp_twsk(sk);
if (twsk->tw_md5_keylen)
- tcp_put_md5sig_pool();
+ tcp_free_md5sig_pool();
#endif
}

2009-10-20 01:17:51

by Anirban Sinha

[permalink] [raw]
Subject: Re: [PATCH] Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled


> This has been fixed in the tree for a month of so:

Grrrr! Time for me to do a git pull again. The kernel source tree in my work
machine must be out of date by about the same time.