2006-03-20 21:47:54

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Kernel panic: Route cache, RCU, possibly FIB trie.


Kernel panic report.

Have experienced some kernel panic's on a production Linux box acting
as a router for a large number of customers.

I have tried to track down the problem, and I think I have narrowed it
a bit down. My theory is that it is related to the route cache
(ip_dst_cache) or FIB, which cannot dealloacate route cache slab
elements (maybe RCU related). (I have seen my route cache increase to
around 520k entries using rtstat, before dying).

I'm using the FIB trie system/algorithm (CONFIG_IP_FIB_TRIE). Think
that the error might be cause by the "fib_trie" code. See the syslog,
output below.

Below are some kernel panic outputs from the console and some
interesting errors found in syslog.

Happy bug hunting...

Greetings
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science, Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------



Kernel version:
2.6.15.1 SMP PREEMPT

Patched with:
Netfiler patch-o-magic, layer7 filter and kernel debugger.

System:
HP server DL380, with two Intel Xeon CPU's


Syslog#1 (indicating a problem with the fib trie)
--------
Mar 20 18:00:04 hostname kernel: Debug: sleeping function called from invalid context at mm/slab.c:2472
Mar 20 18:00:04 hostname kernel: in_atomic():1, irqs_disabled():0
Mar 20 18:00:04 hostname kernel: [<c0103d9f>] dump_stack+0x1e/0x22
Mar 20 18:00:04 hostname kernel: [<c011cbe1>] __might_sleep+0xa6/0xae
Mar 20 18:00:04 hostname kernel: [<c014f3e9>] __kmalloc+0xd9/0xf3
Mar 20 18:00:04 hostname kernel: [<c014f5a4>] kzalloc+0x23/0x50
Mar 20 18:00:04 hostname kernel: [<c030ecd1>] tnode_alloc+0x3c/0x82
Mar 20 18:00:04 hostname kernel: [<c030edf6>] tnode_new+0x26/0x91
Mar 20 18:00:04 hostname kernel: [<c030f757>] halve+0x43/0x31d
Mar 20 18:00:04 hostname kernel: [<c030f090>] resize+0x118/0x27e
Mar 20 18:00:04 hostname kernel: [<c030fca1>] trie_rebalance+0x90/0x115
Mar 20 18:00:04 hostname kernel: [<c0310a58>] trie_leaf_remove+0x123/0x1c7
Mar 20 18:00:04 hostname kernel: [<c031101d>] fn_trie_flush+0x5c/0x84
Mar 20 18:00:04 hostname kernel: [<c030bb06>] fib_flush+0x2a/0x4c
Mar 20 18:00:04 hostname kernel: [<c030c905>] fib_disable_ip+0x47/0x49
Mar 20 18:00:04 hostname kernel: [<c030c9fb>] fib_netdev_event+0x87/0x94
Mar 20 18:00:04 hostname kernel: [<c012d15a>] notifier_call_chain+0x27/0x3f
Mar 20 18:00:04 hostname kernel: [<c02c02f2>] dev_close+0x6f/0x8d
Mar 20 18:00:04 hostname kernel: [<c02c19e0>] dev_change_flags+0x56/0x127
Mar 20 18:00:04 hostname kernel: [<f8b60806>] vlan_device_event+0x16b/0x16d [8021q]
Mar 20 18:00:04 hostname kernel: [<c012d15a>] notifier_call_chain+0x27/0x3f
Mar 20 18:00:04 hostname kernel: [<c02c02f2>] dev_close+0x6f/0x8d
Mar 20 18:00:04 hostname kernel: [<c02c19e0>] dev_change_flags+0x56/0x127
Mar 20 18:00:04 hostname kernel: [<c0304b95>] devinet_ioctl+0x232/0x5a2
Mar 20 18:00:04 hostname kernel: [<c0306e75>] inet_ioctl+0x5a/0x97
Mar 20 18:00:04 hostname kernel: [<c02b6c4f>] sock_ioctl+0xb0/0x236
Mar 20 18:00:04 hostname kernel: [<c0178473>] do_ioctl+0x7b/0x8b
Mar 20 18:00:04 hostname kernel: [<c017861d>] vfs_ioctl+0x62/0x1c1
Mar 20 18:00:04 hostname kernel: [<c01787bd>] sys_ioctl+0x41/0x64
Mar 20 18:00:04 hostname kernel: [<c0102ef9>] syscall_call+0x7/0xb



Kernel panic#1
--------------
EIP is at _stext+0x3feffd68/0x49
eax: d9d07f00 ebx: 00000000 ecx: 00000100 edx: 00000000
esi: c190a1a0 edi: 00000010 ebp: f6d15f10 esp: f6d15efc
ds: 007b es: 007b ss: 0068
Process named (pid: 2122, threadinfo=f6d14000 task=f7e61570)
Stack: c012fc85 d9d07f00 c190a1a0 c036db80 c036dc80 f6d15f30 c012fe7b c190a1a0
c036dc80 c190a1a0 c190a1a0 c03fd020 c03f9130 f6d15f4c c012feff c036db80
c036dc80 c190a1a0 c190a1d8 00000000 f6d15f68 c0124578 00000000 c03f7380
Call Trace:
[<c0103cc7>] show_stack+0x80/0x96
[<c0103e60>] show_registers+0x161/0x1c5
[<c0104057>] die+0x107/0x186
[<c0116c5f>] do_page_fault+0x2c6/0x57d
[<c0103997>] error_code+0x4f/0x54
[<c012fe7b>] __rcu_process_callbacks+0xaa/0xd3
[<c012feff>] rcu_process_callbacks+0x5b/0x65
[<c0124578>] tasklet_action+0x77/0xc9
[<c01241f1>] __do_softirq+0xc1/0xd6
[<c0124251>] do_softirq+0x4b/0x4d
[<c012433b>] irq_exit+0x47/0x49
[<c010533b>] do_IRQ+0x2b/0x3b
[<c010383e>] common_interrupt+0x1a/0x20
Code: Bad EIP value.
<0>Kernel panic - not syncing: Fatal exception in interrupt


Kernel panic#2 with KDB
--------------
[<c0103f04>] show_registers+0x161/0x1c5
[<c010410b>] die+0x117/0x1b3
[<c01171eb>] do_page_fault+0x300/0x59c
[<c0103a1f>] error_code+0x4f/0x54
[<c0129344>] run_timer_softirq+0xd4/0x1c2
[<c0125115>] __do_softirq+0xc1/0xd6
[<c0125175>] do_softirq+0x4b/0x4d
[<c012525f>] irq_exit+0x47/0x49
[<c01139cc>] smp_apic_timer_interrupt+0x6f/0xee
[<c0103954>] apic_timer_interrupt+0x1c/0x24
[<c0100e10>] cpu_idle+0x89/0x9f
[<c01126fa>] start_secondary+0x17f/0x2ef
[<00000000>] 0x0
[<c1dbffb4>] 0xc1dbffb4
Code: 24 04 e8 6e 02 e5 ff 83 c4 28 5b 5e 5f 5d c3 8b 45 ec 25 ff 03 00 00 6b c0
14 03 05 28 3b 49 c0 e8 2c de 04 00 8b 1e 85 db 74 1b <8b> 43 20 85 c0 0f 84 a6
00 00 00 3b 45 e4 78 34 d1 6d e0 89 de

Entering kdb (current=0xc1db0aa0, pid 0) on processor 1 Oops: Oops
due to oops @ 0xc02d8a85
eax = 0x00000000 ebx = 0xccacef04 ecx = 0x00000000 edx = 0xccacff04
esi = 0xccacff04 edi = 0x0001dd08 esp = 0xc1dbfe7c eip = 0xc02d8a85
ebp = 0xc1dbfeb0 xss = 0xc02b0068 xcs = 0x00000060 eflags = 0x00010282
xds = 0xccac007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xc1dbfe48
[1]kdb>


Kernel panic#3 with KDB
--------------
(tried to increase the route cache flush interval...)

[<c01171eb>] do_page_fault+0x300/0x59c
[<c0103a1f>] error_code+0x4f/0x54
[<c02d8d07>] rt_cache_flush+0xf5/0xfe
[<c02d8d2b>] rt_secret_rebuild+0x1b/0x38
[<c0129344>] run_timer_softirq+0xd4/0x1c2
[<c0125115>] __do_softirq+0xc1/0xd6
[<c0125175>] do_softirq+0x4b/0x4d
[<c012525f>] irq_exit+0x47/0x49
[<c01139cc>] smp_apic_timer_interrupt+0x6f/0xee
[<c0103954>] apic_timer_interrupt+0x1c/0x24
[<c0100e10>] cpu_idle+0x89/0x9f
[<c01002dd>] _stext+0x45/0x49
[<c041aa3c>] start_kernel+0x1b5/0x1f6
[<c0100210>] 0xc0100210
Code: 01 f0 e8 40 de 04 00 a1 2c 3b 49 c0 8b 1c b8 85 db 74 07 c7 04 b8 00 00 00
00 03 35 28 3b 49 c0 89 f0 e8 d8 e1 04 00 85 db 74 37 <0f> b7 83 92 00 00 00 8b
33 8b 04 85 50 58 49 c0 85 c0 74 0d 8b

Entering kdb (current=0xc0392ba0, pid 0) on processor 0 Oops: Oops
due to oops @ 0xc02d8bc6
eax = 0xd8b62f84 ebx = 0xd9d75f04 ecx = 0x00000100 edx = 0xc02dda2e
esi = 0xd9d75f04 edi = 0x000077fc esp = 0xc0419e94 eip = 0xc02d8bc6
ebp = 0xc0419ea8 xss = 0xc02b0068 xcs = 0x00000060 eflags = 0x00010282
xds = 0xc02d007b xes = 0xc010007b origeax = 0xffffffff &regs = 0xc0419e60
[0]kdb>


[0]kdb> bt

Stack traceback for pid 0
0xc0392ba0 0 0 1 0 R 0xc0392d80 *swapper
EBP EIP Function (args)
0xc0419ea8 0xc02d8bc6 rt_run_flush+0x6b (0x0, 0xc0394ce4, 0x0, 0xc42ee, 0x102)
0xc0419ec8 0xc02d8d07 rt_cache_flush+0xf5 (0x0, 0xc0326d81, 0x0)
0xc0419edc 0xc02d8d2b rt_secret_rebuild+0x1b (0x0, 0xc1c0c520, 0x0, 0xc42ee, 0xc
0419f00)
0xc0419f18 0xc0129344 run_timer_softirq+0xd4 (0xc040f308, 0x0, 0xa, 0xc1c0b2e0,
0x46)
0xc0419f38 0xc0125115 __do_softirq+0xc1 (0xc1c0b2e0, 0x0)
0xc0419f48 0xc0125175 do_softirq+0x4b (0xc044d98c)
0xc0419f54 0xc012525f irq_exit+0x47 (0x0, 0xc0419f7c, 0xc044d988, 0xc0100d19, 0x
c0418000)
0xc0419f74 0xc01139cc smp_apic_timer_interrupt+0x6f (0xc0100d19, 0x1, 0xc0418000
, 0xc0418000, 0xc044d380)
0xc0419fb0 0xc0103954 apic_timer_interrupt+0x1c (0x602080b, 0x9f300, 0xc040c800)
0xc0100e10 cpu_idle+0x89
0xc0419fd8 0xc01002dd _stext+0x45


2006-03-20 22:11:45

by Dipankar Sarma

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

On Mon, Mar 20, 2006 at 10:44:21PM +0100, Jesper Dangaard Brouer wrote:
>
> Kernel panic report.
>
> Have experienced some kernel panic's on a production Linux box acting
> as a router for a large number of customers.
>
> I have tried to track down the problem, and I think I have narrowed it
> a bit down. My theory is that it is related to the route cache
> (ip_dst_cache) or FIB, which cannot dealloacate route cache slab
> elements (maybe RCU related). (I have seen my route cache increase to
> around 520k entries using rtstat, before dying).
>
> I'm using the FIB trie system/algorithm (CONFIG_IP_FIB_TRIE). Think
> that the error might be cause by the "fib_trie" code. See the syslog,
> output below.
>
> Below are some kernel panic outputs from the console and some
> interesting errors found in syslog.
>
> Kernel panic#1
> --------------
> EIP is at _stext+0x3feffd68/0x49
> c03f7380
> Call Trace:
> [<c0103cc7>] show_stack+0x80/0x96
> [<c0103e60>] show_registers+0x161/0x1c5
> [<c0104057>] die+0x107/0x186
> [<c0116c5f>] do_page_fault+0x2c6/0x57d
> [<c0103997>] error_code+0x4f/0x54
> [<c012fe7b>] __rcu_process_callbacks+0xaa/0xd3
> [<c012feff>] rcu_process_callbacks+0x5b/0x65
> [<c0124578>] tasklet_action+0x77/0xc9
> [<c01241f1>] __do_softirq+0xc1/0xd6
> [<c0124251>] do_softirq+0x4b/0x4d
> [<c012433b>] irq_exit+0x47/0x49
> [<c010533b>] do_IRQ+0x2b/0x3b
> [<c010383e>] common_interrupt+0x1a/0x20
> Code: Bad EIP value.
> <0>Kernel panic - not syncing: Fatal exception in interrupt

Bad eip in processing rcu callback often indicates that the object
that embeds the rcu_head has already been freed. Can you enable
slab debugging and see if this can be detected there in a different
path ?

Thanks
Dipankar

2006-03-21 10:32:36

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


On Tue, 21 Mar 2006, Dipankar Sarma wrote:

> On Mon, Mar 20, 2006 at 10:44:21PM +0100, Jesper Dangaard Brouer wrote:
>>
>> Kernel panic report.
>>
>> Have experienced some kernel panic's on a production Linux box acting
>> as a router for a large number of customers.
>>
>> I have tried to track down the problem, and I think I have narrowed it
>> a bit down. My theory is that it is related to the route cache
>> (ip_dst_cache) or FIB, which cannot dealloacate route cache slab
>> elements (maybe RCU related). (I have seen my route cache increase to
>> around 520k entries using rtstat, before dying).
>>
>> I'm using the FIB trie system/algorithm (CONFIG_IP_FIB_TRIE). Think
>> that the error might be cause by the "fib_trie" code. See the syslog,
>> output below.
>>
>> Below are some kernel panic outputs from the console and some
>> interesting errors found in syslog.
>>
>> Kernel panic#1
>> --------------
>> EIP is at _stext+0x3feffd68/0x49
>> c03f7380
>> Call Trace:
>> [<c0103cc7>] show_stack+0x80/0x96
>> [<c0103e60>] show_registers+0x161/0x1c5
>> [<c0104057>] die+0x107/0x186
>> [<c0116c5f>] do_page_fault+0x2c6/0x57d
>> [<c0103997>] error_code+0x4f/0x54
>> [<c012fe7b>] __rcu_process_callbacks+0xaa/0xd3
>> [<c012feff>] rcu_process_callbacks+0x5b/0x65
>> [<c0124578>] tasklet_action+0x77/0xc9
>> [<c01241f1>] __do_softirq+0xc1/0xd6
>> [<c0124251>] do_softirq+0x4b/0x4d
>> [<c012433b>] irq_exit+0x47/0x49
>> [<c010533b>] do_IRQ+0x2b/0x3b
>> [<c010383e>] common_interrupt+0x1a/0x20
>> Code: Bad EIP value.
>> <0>Kernel panic - not syncing: Fatal exception in interrupt
>
> Bad eip in processing rcu callback often indicates that the object
> that embeds the rcu_head has already been freed. Can you enable
> slab debugging and see if this can be detected there in a different
> path ?

I have enabled slab debugging (CONFIG_DEBUG_SLAB=y) on my kernel with the
KDB patch applied (and all of the other debugging options I could find).

If you look (in my original mail and below) on the output from "Kernel
panic#3", there is a different code path which also fail. Related to
route cache flushing.


Do you have an explaination of the syslog listing, showing yet another
code path sleeping/failing??. In the fib_trie code. (I have recorded 12
of these).

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------


Syslog#1 (indicating a problem with the fib trie)
--------
Mar 20 18:00:04 hostname kernel: Debug: sleeping function called from invalid context at mm/slab.c:2472
Mar 20 18:00:04 hostname kernel: in_atomic():1, irqs_disabled():0
Mar 20 18:00:04 hostname kernel: [<c0103d9f>] dump_stack+0x1e/0x22
Mar 20 18:00:04 hostname kernel: [<c011cbe1>] __might_sleep+0xa6/0xae
Mar 20 18:00:04 hostname kernel: [<c014f3e9>] __kmalloc+0xd9/0xf3
Mar 20 18:00:04 hostname kernel: [<c014f5a4>] kzalloc+0x23/0x50
Mar 20 18:00:04 hostname kernel: [<c030ecd1>] tnode_alloc+0x3c/0x82
Mar 20 18:00:04 hostname kernel: [<c030edf6>] tnode_new+0x26/0x91
Mar 20 18:00:04 hostname kernel: [<c030f757>] halve+0x43/0x31d
Mar 20 18:00:04 hostname kernel: [<c030f090>] resize+0x118/0x27e
Mar 20 18:00:04 hostname kernel: [<c030fca1>] trie_rebalance+0x90/0x115
Mar 20 18:00:04 hostname kernel: [<c0310a58>] trie_leaf_remove+0x123/0x1c7
Mar 20 18:00:04 hostname kernel: [<c031101d>] fn_trie_flush+0x5c/0x84
Mar 20 18:00:04 hostname kernel: [<c030bb06>] fib_flush+0x2a/0x4c
Mar 20 18:00:04 hostname kernel: [<c030c905>] fib_disable_ip+0x47/0x49
Mar 20 18:00:04 hostname kernel: [<c030c9fb>] fib_netdev_event+0x87/0x94
Mar 20 18:00:04 hostname kernel: [<c012d15a>] notifier_call_chain+0x27/0x3f
Mar 20 18:00:04 hostname kernel: [<c02c02f2>] dev_close+0x6f/0x8d
Mar 20 18:00:04 hostname kernel: [<c02c19e0>] dev_change_flags+0x56/0x127
Mar 20 18:00:04 hostname kernel: [<f8b60806>] vlan_device_event+0x16b/0x16d [8021q]
Mar 20 18:00:04 hostname kernel: [<c012d15a>] notifier_call_chain+0x27/0x3f
Mar 20 18:00:04 hostname kernel: [<c02c02f2>] dev_close+0x6f/0x8d
Mar 20 18:00:04 hostname kernel: [<c02c19e0>] dev_change_flags+0x56/0x127
Mar 20 18:00:04 hostname kernel: [<c0304b95>] devinet_ioctl+0x232/0x5a2
Mar 20 18:00:04 hostname kernel: [<c0306e75>] inet_ioctl+0x5a/0x97
Mar 20 18:00:04 hostname kernel: [<c02b6c4f>] sock_ioctl+0xb0/0x236
Mar 20 18:00:04 hostname kernel: [<c0178473>] do_ioctl+0x7b/0x8b
Mar 20 18:00:04 hostname kernel: [<c017861d>] vfs_ioctl+0x62/0x1c1
Mar 20 18:00:04 hostname kernel: [<c01787bd>] sys_ioctl+0x41/0x64
Mar 20 18:00:04 hostname kernel: [<c0102ef9>] syscall_call+0x7/0xb



Kernel panic#3 with KDB
--------------
[<c01171eb>] do_page_fault+0x300/0x59c
[<c0103a1f>] error_code+0x4f/0x54
[<c02d8d07>] rt_cache_flush+0xf5/0xfe
[<c02d8d2b>] rt_secret_rebuild+0x1b/0x38
[<c0129344>] run_timer_softirq+0xd4/0x1c2
[<c0125115>] __do_softirq+0xc1/0xd6
[<c0125175>] do_softirq+0x4b/0x4d
[<c012525f>] irq_exit+0x47/0x49
[<c01139cc>] smp_apic_timer_interrupt+0x6f/0xee
[<c0103954>] apic_timer_interrupt+0x1c/0x24
[<c0100e10>] cpu_idle+0x89/0x9f
[<c01002dd>] _stext+0x45/0x49
[<c041aa3c>] start_kernel+0x1b5/0x1f6
[<c0100210>] 0xc0100210
Code: 01 f0 e8 40 de 04 00 a1 2c 3b 49 c0 8b 1c b8 85 db 74 07 c7 04 b8 00 00 00
00 03 35 28 3b 49 c0 89 f0 e8 d8 e1 04 00 85 db 74 37 <0f> b7 83 92 00 00 00 8b
33 8b 04 85 50 58 49 c0 85 c0 74 0d 8b

Entering kdb (current=0xc0392ba0, pid 0) on processor 0 Oops: Oops
due to oops @ 0xc02d8bc6
eax = 0xd8b62f84 ebx = 0xd9d75f04 ecx = 0x00000100 edx = 0xc02dda2e
esi = 0xd9d75f04 edi = 0x000077fc esp = 0xc0419e94 eip = 0xc02d8bc6
ebp = 0xc0419ea8 xss = 0xc02b0068 xcs = 0x00000060 eflags = 0x00010282
xds = 0xc02d007b xes = 0xc010007b origeax = 0xffffffff &regs = 0xc0419e60
[0]kdb>


[0]kdb> bt

Stack traceback for pid 0
0xc0392ba0 0 0 1 0 R 0xc0392d80 *swapper
EBP EIP Function (args)
0xc0419ea8 0xc02d8bc6 rt_run_flush+0x6b (0x0, 0xc0394ce4, 0x0, 0xc42ee, 0x102)
0xc0419ec8 0xc02d8d07 rt_cache_flush+0xf5 (0x0, 0xc0326d81, 0x0)
0xc0419edc 0xc02d8d2b rt_secret_rebuild+0x1b (0x0, 0xc1c0c520, 0x0, 0xc42ee, 0xc0419f00)
0xc0419f18 0xc0129344 run_timer_softirq+0xd4 (0xc040f308, 0x0, 0xa, 0xc1c0b2e0,0x46)
0xc0419f38 0xc0125115 __do_softirq+0xc1 (0xc1c0b2e0, 0x0)
0xc0419f48 0xc0125175 do_softirq+0x4b (0xc044d98c)
0xc0419f54 0xc012525f irq_exit+0x47 (0x0, 0xc0419f7c, 0xc044d988, 0xc0100d19, 0xc0418000)
0xc0419f74 0xc01139cc smp_apic_timer_interrupt+0x6f (0xc0100d19, 0x1, 0xc0418000, 0xc0418000, 0xc044d380)
0xc0419fb0 0xc0103954 apic_timer_interrupt+0x1c (0x602080b, 0x9f300, 0xc040c800)
0xc0100e10 cpu_idle+0x89
0xc0419fd8 0xc01002dd _stext+0x45


2006-03-21 10:37:17

by David Miller

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

From: Jesper Dangaard Brouer <[email protected]>
Date: Tue, 21 Mar 2006 11:29:16 +0100 (CET)

> Do you have an explaination of the syslog listing, showing yet another
> code path sleeping/failing??. In the fib_trie code. (I have recorded 12
> of these).

If you happen to have the IP_ROUTE_MULTIPATH_CACHED config option
enabled in your kernels, please turn it off and retest.

2006-03-21 13:45:54

by Robert Olsson

[permalink] [raw]
Subject: Kernel panic: Route cache, RCU, possibly FIB trie.


Jesper Dangaard Brouer writes:

> I have tried to track down the problem, and I think I have narrowed it
> a bit down. My theory is that it is related to the route cache
> (ip_dst_cache) or FIB, which cannot dealloacate route cache slab
> elements (maybe RCU related). (I have seen my route cache increase to
> around 520k entries using rtstat, before dying).
>
> I'm using the FIB trie system/algorithm (CONFIG_IP_FIB_TRIE). Think
> that the error might be cause by the "fib_trie" code. See the syslog,
> output below.

> Syslog#1 (indicating a problem with the fib trie)
> --------
> Mar 20 18:00:04 hostname kernel: Debug: sleeping function called from invalid context at mm/slab.c:2472
> Mar 20 18:00:04 hostname kernel: in_atomic():1, irqs_disabled():0
> Mar 20 18:00:04 hostname kernel: [<c0103d9f>] dump_stack+0x1e/0x22
> Mar 20 18:00:04 hostname kernel: [<c011cbe1>] __might_sleep+0xa6/0xae
> Mar 20 18:00:04 hostname kernel: [<c014f3e9>] __kmalloc+0xd9/0xf3
> Mar 20 18:00:04 hostname kernel: [<c014f5a4>] kzalloc+0x23/0x50
> Mar 20 18:00:04 hostname kernel: [<c030ecd1>] tnode_alloc+0x3c/0x82
> Mar 20 18:00:04 hostname kernel: [<c030edf6>] tnode_new+0x26/0x91
> Mar 20 18:00:04 hostname kernel: [<c030f757>] halve+0x43/0x31d
> Mar 20 18:00:04 hostname kernel: [<c030f090>] resize+0x118/0x27e

Hello!

Out of memory? Running BGP with full routing? And large number of flows.
Whats your normal number of entries route cache? And how much memory do
you have?

From your report problems seems to related to flushing either rt_cache_flush
or fib_flush (before there was dev_close()?) so all associated entries should
freed. All the entries are freed via RCU which due to the deferred delete
can give a very high transient memory pressure. If we believe it's memory problem
we can try something out...

Cheers.
--ro


2006-03-21 15:28:53

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


On Tue, 21 Mar 2006, Robert Olsson wrote:

> Jesper Dangaard Brouer writes:
>
> > I have tried to track down the problem, and I think I have narrowed it
> > a bit down. My theory is that it is related to the route cache
> > (ip_dst_cache) or FIB, which cannot dealloacate route cache slab
> > elements (maybe RCU related). (I have seen my route cache increase to
> > around 520k entries using rtstat, before dying).
> >
> > I'm using the FIB trie system/algorithm (CONFIG_IP_FIB_TRIE). Think
> > that the error might be cause by the "fib_trie" code. See the syslog,
> > output below.
>
> > Syslog#1 (indicating a problem with the fib trie)
> > --------
> > Mar 20 18:00:04 hostname kernel: Debug: sleeping function called from invalid context at mm/slab.c:2472
> > Mar 20 18:00:04 hostname kernel: in_atomic():1, irqs_disabled():0
> > Mar 20 18:00:04 hostname kernel: [<c0103d9f>] dump_stack+0x1e/0x22
> > Mar 20 18:00:04 hostname kernel: [<c011cbe1>] __might_sleep+0xa6/0xae
> > Mar 20 18:00:04 hostname kernel: [<c014f3e9>] __kmalloc+0xd9/0xf3
> > Mar 20 18:00:04 hostname kernel: [<c014f5a4>] kzalloc+0x23/0x50
> > Mar 20 18:00:04 hostname kernel: [<c030ecd1>] tnode_alloc+0x3c/0x82
> > Mar 20 18:00:04 hostname kernel: [<c030edf6>] tnode_new+0x26/0x91
> > Mar 20 18:00:04 hostname kernel: [<c030f757>] halve+0x43/0x31d
> > Mar 20 18:00:04 hostname kernel: [<c030f090>] resize+0x118/0x27e
>
> Hello!
>
> Out of memory?
One of the crashed was caused by out of memory, but all the memory was
allocated through slab. More specifically to ip_dst_cache.

> Running BGP with full routing?
No, running OSPF with around 760 subnets.

> And large number of flows.
Yes, very large number of flows.

> Whats your normal number of entries route cache?
On this machine, rigth now, between 14000 to 60000 entries in the route
cache. On other machines, rigth now, I have a max of 151560 entries.

> And how much memory do you have?
On this machine 1Gb memory (and 4 others), most of the machines have 2Gb.


> From your report problems seems to related to flushing either rt_cache_flush
> or fib_flush (before there was dev_close()?) so all associated entries should
> freed. All the entries are freed via RCU which due to the deferred delete
> can give a very high transient memory pressure. If we believe it's memory problem
> we can try something out...

There is definitly high memory pressure on this machine!
Slab memory usage, range from 39Mb to 205Mb (at the moment on the production servers).

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------

2006-03-21 15:37:38

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

Jesper Dangaard Brouer a ?crit :
>
> On Tue, 21 Mar 2006, Robert Olsson wrote:
>
>> Jesper Dangaard Brouer writes:
>>
>> > I have tried to track down the problem, and I think I have narrowed it
>> > a bit down. My theory is that it is related to the route cache
>> > (ip_dst_cache) or FIB, which cannot dealloacate route cache slab
>> > elements (maybe RCU related). (I have seen my route cache increase to
>> > around 520k entries using rtstat, before dying).
>> >
>> > I'm using the FIB trie system/algorithm (CONFIG_IP_FIB_TRIE). Think
>> > that the error might be cause by the "fib_trie" code. See the syslog,
>> > output below.
>>
>> > Syslog#1 (indicating a problem with the fib trie)
>> > --------
>> > Mar 20 18:00:04 hostname kernel: Debug: sleeping function called
>> from invalid context at mm/slab.c:2472
>> > Mar 20 18:00:04 hostname kernel: in_atomic():1, irqs_disabled():0
>> > Mar 20 18:00:04 hostname kernel: [<c0103d9f>] dump_stack+0x1e/0x22
>> > Mar 20 18:00:04 hostname kernel: [<c011cbe1>] __might_sleep+0xa6/0xae
>> > Mar 20 18:00:04 hostname kernel: [<c014f3e9>] __kmalloc+0xd9/0xf3
>> > Mar 20 18:00:04 hostname kernel: [<c014f5a4>] kzalloc+0x23/0x50
>> > Mar 20 18:00:04 hostname kernel: [<c030ecd1>] tnode_alloc+0x3c/0x82
>> > Mar 20 18:00:04 hostname kernel: [<c030edf6>] tnode_new+0x26/0x91
>> > Mar 20 18:00:04 hostname kernel: [<c030f757>] halve+0x43/0x31d
>> > Mar 20 18:00:04 hostname kernel: [<c030f090>] resize+0x118/0x27e
>>
>> Hello!
>>
>> Out of memory?
> One of the crashed was caused by out of memory, but all the memory was
> allocated through slab. More specifically to ip_dst_cache.
>
>> Running BGP with full routing?
> No, running OSPF with around 760 subnets.
>
>> And large number of flows.
> Yes, very large number of flows.
>
>> Whats your normal number of entries route cache?
> On this machine, rigth now, between 14000 to 60000 entries in the route
> cache. On other machines, rigth now, I have a max of 151560 entries.
>
>> And how much memory do you have?
> On this machine 1Gb memory (and 4 others), most of the machines have 2Gb.
>
>
>> From your report problems seems to related to flushing either
>> rt_cache_flush
>> or fib_flush (before there was dev_close()?) so all associated entries
>> should
>> freed. All the entries are freed via RCU which due to the deferred delete
>> can give a very high transient memory pressure. If we believe it's
>> memory problem
>> we can try something out...
>
> There is definitly high memory pressure on this machine!
> Slab memory usage, range from 39Mb to 205Mb (at the moment on the
> production servers).
>

Did you tried 2.6.16 ?

It contains changes in kernel/rcupdate.c so that not too many RCU elems are
queued (force_quiescent_state()). So in the case a rt_cache_flush is done, you
have the guarantee all entries are not pushed into rcu at once.

Eric

2006-03-21 15:45:44

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


On Tue, 21 Mar 2006, David S. Miller wrote:

> From: Jesper Dangaard Brouer <[email protected]>
> Date: Tue, 21 Mar 2006 11:29:16 +0100 (CET)
>
>> Do you have an explaination of the syslog listing, showing yet another
>> code path sleeping/failing??. In the fib_trie code. (I have recorded 12
>> of these).
>
> If you happen to have the IP_ROUTE_MULTIPATH_CACHED config option
> enabled in your kernels, please turn it off and retest.

You guessed right... I did enable IP_ROUTE_MULTIPATH_CACHED, I have now
disabled it and equal multi path routing in general (CONFIG_IP_ROUTE_MULTIPATH).

As it is a production server, I'll need to schedule a reboot. The server
has now been running a 2.4.26 kernel for 16 hours without any kernel
panics. And the ip_dst_cache can grow and shink.

cat /proc/slabinfo | grep ip_dst_cache
ip_dst_cache 15300 56145 256 1022 3743 1

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------

2006-03-21 15:45:04

by Dipankar Sarma

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

On Tue, Mar 21, 2006 at 04:37:19PM +0100, Eric Dumazet wrote:
> Jesper Dangaard Brouer a ?crit :
> >There is definitly high memory pressure on this machine!
> >Slab memory usage, range from 39Mb to 205Mb (at the moment on the
> >production servers).
> >
>
> Did you tried 2.6.16 ?
>
> It contains changes in kernel/rcupdate.c so that not too many RCU elems are
> queued (force_quiescent_state()). So in the case a rt_cache_flush is done,
> you have the guarantee all entries are not pushed into rcu at once.

Well, memory pressure or not, the oopses shouldn't be happening :)
Perhaps we should look at them before we work around memory
pressure through the rcu batch tuning stuff in 2.6.16 ?

One of the oopses looked like the rcu callback function pointer
getting corrupted indicating that it was double freed or
problem with RCU itself.

Thanks
Dipankar

2006-03-21 16:31:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

Dipankar Sarma a ?crit :
> On Tue, Mar 21, 2006 at 04:37:19PM +0100, Eric Dumazet wrote:
>> Jesper Dangaard Brouer a ?crit :
>>> There is definitly high memory pressure on this machine!
>>> Slab memory usage, range from 39Mb to 205Mb (at the moment on the
>>> production servers).
>>>
>> Did you tried 2.6.16 ?
>>
>> It contains changes in kernel/rcupdate.c so that not too many RCU elems are
>> queued (force_quiescent_state()). So in the case a rt_cache_flush is done,
>> you have the guarantee all entries are not pushed into rcu at once.
>
> Well, memory pressure or not, the oopses shouldn't be happening :)
> Perhaps we should look at them before we work around memory
> pressure through the rcu batch tuning stuff in 2.6.16 ?
>
> One of the oopses looked like the rcu callback function pointer
> getting corrupted indicating that it was double freed or
> problem with RCU itself.
>

Yep, but as this is a production server, I believe its owner might want a fast
way to have it back to life :)

And the RCU change in 2.6.16 is definitly a big improvement when a dump of
million entries is done :)

If RCU quiescent state is forced, maybe the bug (in the route cache code) will
trigger faster ?

Eric

2006-03-21 21:25:21

by David Miller

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

From: Jesper Dangaard Brouer <[email protected]>
Date: Tue, 21 Mar 2006 15:51:34 +0100 (CET)

> You guessed right... I did enable IP_ROUTE_MULTIPATH_CACHED, I have
> now disabled it and equal multi path routing in general
> (CONFIG_IP_ROUTE_MULTIPATH).

It is almost certainly the cause of your crashes, that code
is still extremely raw and that's why it is listed as "EXPERIMENTAL".

2006-03-23 15:41:39

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


On Tue, 21 Mar 2006, David S. Miller wrote:

> From: Jesper Dangaard Brouer <[email protected]>
> Date: Tue, 21 Mar 2006 15:51:34 +0100 (CET)
>
>> You guessed right... I did enable IP_ROUTE_MULTIPATH_CACHED, I have
>> now disabled it and equal multi path routing in general
>> (CONFIG_IP_ROUTE_MULTIPATH).
>
> It is almost certainly the cause of your crashes, that code
> is still extremely raw and that's why it is listed as "EXPERIMENTAL".

It seems your are right :-) (and I'll take more care of using experimental
code on production again). The machine, has now been running for 34 hours
without crashing. The strange thing is that I'm running the same kernel
on 30 other (similar) machines, which have not crashed. (I do suspect the
specific traffic load pattern to influence this)


BUT, I do think I have noticed another problem in the garbage collection
code (route.c), that causes the garbage collector (almost) never to
garbage collect.

This is caused by the value "ip_rt_max_size" (/proc/sys/net/ipv4/route/max_size)
being set too large. It is set to 16 times the gc_thresh value (this
size dependend on the memory size). In the garbage collection function
(rt_garbage_collect) garbage collecting entries are ignored (gc_ignored)
if the number of entries are below "ip_rt_max_size".

With 1Gb memory, gc_thresh=65536 times 16 is 1048576. Which means that we
only start to garbage collect when there is more than 1 million entries.
This seems wrong... (the reason it does not grow this large is the 600
second periodic flushes).


Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------


grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:5000
/proc/sys/net/ipv4/route/error_cost:1000
grep: /proc/sys/net/ipv4/route/flush: Operation not permitted
/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:65536
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_delay:10
/proc/sys/net/ipv4/route/max_size:1048576
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_delay:2
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:20
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:20480
/proc/sys/net/ipv4/route/secret_interval:600

2006-03-23 15:51:29

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


On Thu, 23 Mar 2006, Jesper Dangaard Brouer wrote:

> On Tue, 21 Mar 2006, David S. Miller wrote:
>
>> From: Jesper Dangaard Brouer <[email protected]>
>> Date: Tue, 21 Mar 2006 15:51:34 +0100 (CET)
>>
>>> You guessed right... I did enable IP_ROUTE_MULTIPATH_CACHED, I have
>>> now disabled it and equal multi path routing in general
>>> (CONFIG_IP_ROUTE_MULTIPATH).
>>
>> It is almost certainly the cause of your crashes, that code
>> is still extremely raw and that's why it is listed as "EXPERIMENTAL".
>
> It seems your are right :-) (and I'll take more care of using experimental
> code on production again). The machine, has now been running for 34 hours
> without crashing. The strange thing is that I'm running the same kernel on
> 30 other (similar) machines, which have not crashed. (I do suspect the
> specific traffic load pattern to influence this)

Argh!! -- nemesis!!! The machine, just died again...
The machine did not crash it just ran out of memory, and killed too many
important processes. I had to power recycle it... :-((( Could ping it...

I can see that, the traffic pattern have changed and the route cache is
growing rapitly...


> BUT, I do think I have noticed another problem in the garbage collection code
> (route.c), that causes the garbage collector (almost) never to garbage
> collect.
>
> This is caused by the value "ip_rt_max_size"
> (/proc/sys/net/ipv4/route/max_size)
> being set too large. It is set to 16 times the gc_thresh value (this size
> dependend on the memory size). In the garbage collection function
> (rt_garbage_collect) garbage collecting entries are ignored (gc_ignored) if
> the number of entries are below "ip_rt_max_size".
>
> With 1Gb memory, gc_thresh=65536 times 16 is 1048576. Which means that we
> only start to garbage collect when there is more than 1 million entries. This
> seems wrong... (the reason it does not grow this large is the 600 second
> periodic flushes).
>
>
> Hilsen
> Jesper Brouer
>
> --
> -------------------------------------------------------------------
> Cand. scient datalog
> Dept. of Computer Science, University of Copenhagen
> -------------------------------------------------------------------
>
>
> grep . /proc/sys/net/ipv4/route/*
> /proc/sys/net/ipv4/route/error_burst:5000
> /proc/sys/net/ipv4/route/error_cost:1000
> grep: /proc/sys/net/ipv4/route/flush: Operation not permitted
> /proc/sys/net/ipv4/route/gc_elasticity:8
> /proc/sys/net/ipv4/route/gc_interval:60
> /proc/sys/net/ipv4/route/gc_min_interval:0
> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> /proc/sys/net/ipv4/route/gc_thresh:65536
> /proc/sys/net/ipv4/route/gc_timeout:300
> /proc/sys/net/ipv4/route/max_delay:10
> /proc/sys/net/ipv4/route/max_size:1048576
> /proc/sys/net/ipv4/route/min_adv_mss:256
> /proc/sys/net/ipv4/route/min_delay:2
> /proc/sys/net/ipv4/route/min_pmtu:552
> /proc/sys/net/ipv4/route/mtu_expires:600
> /proc/sys/net/ipv4/route/redirect_load:20
> /proc/sys/net/ipv4/route/redirect_number:9
> /proc/sys/net/ipv4/route/redirect_silence:20480
> /proc/sys/net/ipv4/route/secret_interval:600
>
>

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------

2006-03-23 16:16:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

Jesper Dangaard Brouer a ?crit :

>> grep . /proc/sys/net/ipv4/route/*
>> /proc/sys/net/ipv4/route/error_burst:5000
>> /proc/sys/net/ipv4/route/error_cost:1000
>> grep: /proc/sys/net/ipv4/route/flush: Operation not permitted
>> /proc/sys/net/ipv4/route/gc_elasticity:8
>> /proc/sys/net/ipv4/route/gc_interval:60
>> /proc/sys/net/ipv4/route/gc_min_interval:0
>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>> /proc/sys/net/ipv4/route/gc_thresh:65536
>> /proc/sys/net/ipv4/route/gc_timeout:300
>> /proc/sys/net/ipv4/route/max_delay:10
>> /proc/sys/net/ipv4/route/max_size:1048576
>> /proc/sys/net/ipv4/route/min_adv_mss:256
>> /proc/sys/net/ipv4/route/min_delay:2
>> /proc/sys/net/ipv4/route/min_pmtu:552
>> /proc/sys/net/ipv4/route/mtu_expires:600
>> /proc/sys/net/ipv4/route/redirect_load:20
>> /proc/sys/net/ipv4/route/redirect_number:9
>> /proc/sys/net/ipv4/route/redirect_silence:20480
>> /proc/sys/net/ipv4/route/secret_interval:600

I would say : Change the settings :)

echo 2 > /proc/sys/net/ipv4/route/gc_elasticity
echo 1 > /proc/sys/net/ipv4/route/gc_interval
echo 131072 > /proc/sys/net/ipv4/route/gc_thresh

and watch the output of :

rtstat -c 100 -i 1

(you might have to recompile lnstat/rtstat from iproute2 package from

http://developer.osdl.org/dev/iproute2/download/

Eric

2006-03-23 21:33:04

by Robert Olsson

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


Jesper Dangaard Brouer writes:

> > It is almost certainly the cause of your crashes, that code
> > is still extremely raw and that's why it is listed as "EXPERIMENTAL".
>
> It seems your are right :-) (and I'll take more care of using experimental
> code on production again). The machine, has now been running for 34 hours
> without crashing. The strange thing is that I'm running the same kernel
> on 30 other (similar) machines, which have not crashed. (I do suspect the
> specific traffic load pattern to influence this)

Sounds good... seems like Dave did the the correct analysis.

> BUT, I do think I have noticed another problem in the garbage collection
> code (route.c), that causes the garbage collector (almost) never to
> garbage collect.

We're trying to avoid most/all periodic GC in hi-flow systems and just to
do GC as new entries are created. We use the patch below to create et another
(unfortunately) /proc entry to better control this. ip_rt_gc_max_chain_length
it also decreases the threshhold for this from 8 to 4 for this.

Cheers.
--ro


--- linux-2.6.14.5/net/ipv4/route.c.orig 2006-01-02 14:24:00.000000000 +0100
+++ linux-2.6.14.5/net/ipv4/route.c 2006-01-02 15:26:29.000000000 +0100
@@ -126,6 +126,7 @@
static int ip_rt_error_cost = HZ;
static int ip_rt_error_burst = 5 * HZ;
static int ip_rt_gc_elasticity = 8;
+static int ip_rt_gc_max_chain_length = 4;
static int ip_rt_mtu_expires = 10 * 60 * HZ;
static int ip_rt_min_pmtu = 512 + 20 + 20;
static int ip_rt_min_advmss = 256;
@@ -977,7 +978,7 @@
* The second limit is less certain. At the moment it allows
* only 2 entries per bucket. We will see.
*/
- if (chain_length > ip_rt_gc_elasticity) {
+ if (chain_length > ip_rt_gc_max_chain_length) {
*candp = cand->u.rt_next;
rt_free(cand);
}
@@ -3017,6 +3018,14 @@
.proc_handler = &proc_dointvec,
},
{
+ .ctl_name = NET_IPV4_ROUTE_GC_MAX_CHAIN_LENGTH,
+ .procname = "gc_max_chain_length",
+ .data = &ip_rt_gc_max_chain_length,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
.ctl_name = NET_IPV4_ROUTE_MTU_EXPIRES,
.procname = "mtu_expires",
.data = &ip_rt_mtu_expires,
--- linux-2.6.14.5/include/linux/sysctl.h.orig 2006-01-02 14:46:55.000000000 +0100
+++ linux-2.6.14.5/include/linux/sysctl.h 2006-01-02 14:47:01.000000000 +0100
@@ -375,6 +375,7 @@
NET_IPV4_ROUTE_MIN_ADVMSS=17,
NET_IPV4_ROUTE_SECRET_INTERVAL=18,
NET_IPV4_ROUTE_GC_MIN_INTERVAL_MS=19,
+ NET_IPV4_ROUTE_GC_MAX_CHAIN_LENGTH=20,
};

enum


2006-03-23 21:47:36

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.


On Thu, 23 Mar 2006, Eric Dumazet wrote:

> Jesper Dangaard Brouer a ?crit :
>
>>> grep . /proc/sys/net/ipv4/route/*
>>> /proc/sys/net/ipv4/route/error_burst:5000
>>> /proc/sys/net/ipv4/route/error_cost:1000
>>> grep: /proc/sys/net/ipv4/route/flush: Operation not permitted
>>> /proc/sys/net/ipv4/route/gc_elasticity:8
>>> /proc/sys/net/ipv4/route/gc_interval:60
>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>> /proc/sys/net/ipv4/route/gc_thresh:65536
>>> /proc/sys/net/ipv4/route/gc_timeout:300
>>> /proc/sys/net/ipv4/route/max_delay:10
>>> /proc/sys/net/ipv4/route/max_size:1048576
>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>> /proc/sys/net/ipv4/route/min_delay:2
>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>> /proc/sys/net/ipv4/route/redirect_load:20
>>> /proc/sys/net/ipv4/route/redirect_number:9
>>> /proc/sys/net/ipv4/route/redirect_silence:20480
>>> /proc/sys/net/ipv4/route/secret_interval:600
>
> I would say : Change the settings :)
>
> echo 2 > /proc/sys/net/ipv4/route/gc_elasticity
> echo 1 > /proc/sys/net/ipv4/route/gc_interval
> echo 131072 > /proc/sys/net/ipv4/route/gc_thresh

These parameters do not solve the problem. I think you missed my previous
point. The parameter that needs adjustment is:

/proc/sys/net/ipv4/route/max_size

The garbage collector will not be activated before the number of
entries are above "max_size" (see: function rt_garbage_collect).

I have set:
/proc/sys/net/ipv4/route/gc_thresh:30000
/proc/sys/net/ipv4/route/max_size:30000

Which solves the problem of the route cache growing too large too fast.


I have read the route.c code again, to see if I missed something. Are you
trying to make the function "rt_check_expire" to do the cleanup?

I have tried your parameters, and it does not have the desired effect.


> and watch the output of :
>
> rtstat -c 100 -i 1
> (you might have to recompile lnstat/rtstat from iproute2 package from
> http://developer.osdl.org/dev/iproute2/download/

I prefer to use Robert's version of rtstat ;-)

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------

2006-03-24 06:12:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.

Jesper Dangaard Brouer a ?crit :
>
> On Thu, 23 Mar 2006, Eric Dumazet wrote:
>
>> Jesper Dangaard Brouer a ?crit :
>>
>>>> grep . /proc/sys/net/ipv4/route/*
>>>> /proc/sys/net/ipv4/route/error_burst:5000
>>>> /proc/sys/net/ipv4/route/error_cost:1000
>>>> grep: /proc/sys/net/ipv4/route/flush: Operation not permitted
>>>> /proc/sys/net/ipv4/route/gc_elasticity:8
>>>> /proc/sys/net/ipv4/route/gc_interval:60
>>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>>> /proc/sys/net/ipv4/route/gc_thresh:65536
>>>> /proc/sys/net/ipv4/route/gc_timeout:300
>>>> /proc/sys/net/ipv4/route/max_delay:10
>>>> /proc/sys/net/ipv4/route/max_size:1048576
>>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>>> /proc/sys/net/ipv4/route/min_delay:2
>>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>>> /proc/sys/net/ipv4/route/redirect_load:20
>>>> /proc/sys/net/ipv4/route/redirect_number:9
>>>> /proc/sys/net/ipv4/route/redirect_silence:20480
>>>> /proc/sys/net/ipv4/route/secret_interval:600
>>
>> I would say : Change the settings :)
>>
>> echo 2 > /proc/sys/net/ipv4/route/gc_elasticity
>> echo 1 > /proc/sys/net/ipv4/route/gc_interval
>> echo 131072 > /proc/sys/net/ipv4/route/gc_thresh
>
> These parameters do not solve the problem. I think you missed my
> previous point. The parameter that needs adjustment is:
>
> /proc/sys/net/ipv4/route/max_size
>
> The garbage collector will not be activated before the number of entries
> are above "max_size" (see: function rt_garbage_collect).
>
> I have set:
> /proc/sys/net/ipv4/route/gc_thresh:30000
> /proc/sys/net/ipv4/route/max_size:30000
>
> Which solves the problem of the route cache growing too large too fast.
>
>
> I have read the route.c code again, to see if I missed something. Are
> you trying to make the function "rt_check_expire" to do the cleanup?
>
> I have tried your parameters, and it does not have the desired effect.
>

I was just guessing, since you didnt give us your rtstat output.

If you start to hit max_size then you have really a problem.

And no, I was not trying to make the garbage collector starts. This cost way
too much cpu, the router drops packets.

On my routers, I try to size the rt hash table appropriatly (boot param :
rhash_entries=1048575 for example), and keep the chains small, to avoid
spending too much cpu time (and too much memory cache lines touched) in
{soft}irq processing.

But yes it uses some memory.

# rtstat -c10 -i1
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti|
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
| | tot| mc| ute| | an_dst| an_src|
| _tot| _mc| | ed| miss| verflow| _search|t_search|

1672276| 32062| 4688| 0| 0| 0| 0| 0|
2124| 2176| 0| 0| 0| 0| 0| 26020| 4385|

1671142| 31826| 4626| 0| 0| 0| 0| 0|
2055| 1906| 0| 0| 0| 0| 0| 25617| 4124|

1670424| 31473| 4702| 0| 0| 0| 0| 0|
2221| 2144| 0| 0| 0| 0| 0| 25348| 4340|

1670928| 31671| 7600| 0| 0| 0| 0| 0|
2037| 2152| 0| 0| 0| 0| 0| 30354| 4245|

1670444| 31704| 4818| 0| 0| 0| 0| 0|
2037| 1927| 0| 0| 0| 0| 0| 25424| 3871|

1670133| 31785| 4598| 0| 0| 0| 0| 0|
1988| 2184| 0| 0| 0| 0| 0| 24946| 4302|

1669990| 31544| 4700| 0| 0| 0| 0| 0|
2022| 2188| 0| 0| 0| 0| 0| 25092| 4357|

1669880| 31930| 4750| 0| 0| 0| 0| 0|
2054| 2002| 0| 0| 0| 0| 0| 25703| 4214|


Eric

2006-03-24 10:43:08

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Kernel panic: Route cache, RCU, possibly FIB trie.



On Fri, 24 Mar 2006, Eric Dumazet wrote:

> Jesper Dangaard Brouer a ?crit :
>>
>> On Thu, 23 Mar 2006, Eric Dumazet wrote:
>>
>>> Jesper Dangaard Brouer a ?crit :
>>>
>>
>> I have read the route.c code again, to see if I missed something. Are you
>> trying to make the function "rt_check_expire" to do the cleanup?
>>
>> I have tried your parameters, and it does not have the desired effect.
>>
>
> I was just guessing, since you didnt give us your rtstat output.
>
> If you start to hit max_size then you have really a problem.
>
> And no, I was not trying to make the garbage collector starts. This cost way
> too much cpu, the router drops packets.
>
> On my routers, I try to size the rt hash table appropriatly (boot param :
> rhash_entries=1048575 for example), and keep the chains small, to avoid
> spending too much cpu time (and too much memory cache lines touched) in
> {soft}irq processing.
>
> But yes it uses some memory.

How much memory do your machine have?

What kernel version are you using?



> # rtstat -c10 -i1
> rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
> entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti| out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
> tot| mc| ute| | an_dst| an_src| _tot| _mc| | ed| miss| verflow| _search|t_search|
>
> 1672276| 32062| 4688| 0| 0| 0| 0| 0| 2124| 2176| 0| 0| 0| 0| 0| 26020| 4385|
> 1671142| 31826| 4626| 0| 0| 0| 0| 0| 2055| 1906| 0| 0| 0| 0| 0| 25617| 4124|
> 1670424| 31473| 4702| 0| 0| 0| 0| 0| 2221| 2144| 0| 0| 0| 0| 0| 25348| 4340|
> 1670928| 31671| 7600| 0| 0| 0| 0| 0| 2037| 2152| 0| 0| 0| 0| 0| 30354| 4245|
> 1670444| 31704| 4818| 0| 0| 0| 0| 0| 2037| 1927| 0| 0| 0| 0| 0| 25424| 3871|
> 1670133| 31785| 4598| 0| 0| 0| 0| 0| 1988| 2184| 0| 0| 0| 0| 0| 24946| 4302|
> 1669990| 31544| 4700| 0| 0| 0| 0| 0| 2022| 2188| 0| 0| 0| 0| 0| 25092| 4357|
> 1669880| 31930| 4750| 0| 0| 0| 0| 0| 2054| 2002| 0| 0| 0| 0| 0| 25703| 4214|


You definitly have more entries than me!

After disabling equal cost multi path routing, the machine does not kernel
panic, BUT it still dies because it runs out-of-memory!



An rtstat example (2.6.15.1):

size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf HASH: in_search out_search
377172 9431 153 0 0 2 0 0 119 5 0 159 157 0 0 1683 17
377350 9244 171 0 0 1 0 0 129 3 0 172 170 0 0 1741 20
377490 9822 143 0 0 2 0 0 101 6 0 149 147 0 0 1864 26
377657 9582 149 0 0 1 0 0 106 4 0 153 151 0 0 2192 23
377787 9105 136 0 0 1 0 0 106 6 0 141 139 0 0 1881 21
377930 9216 149 0 0 2 0 0 114 5 0 154 152 0 0 2033 19
377891 9157 159 0 0 4 0 0 117 6 0 165 163 0 0 2117 28
378126 9052 185 0 0 1 0 0 128 6 0 191 189 0 0 1908 24
378313 8862 185 0 0 2 0 0 152 5 0 189 187 0 0 1987 43
378504 8771 194 0 0 2 0 0 111 4 0 197 195 0 0 1904 20
378735 9069 219 0 0 4 0 0 153 4 0 223 221 0 0 1973 33
378948 8836 217 0 0 0 0 0 145 6 0 222 220 0 0 1807 18
379139 8952 190 0 0 5 0 0 120 5 0 195 193 0 0 1840 25


The strange thing is the kernel 2.4.26 rtstat output on the same machine,
which does not grow out of proportions...

Kernel 2.4.26 output.
(Showing where the cache is periodically flushed)

size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf HASH: in_search out_search
28556 8705 180 0 0 1 0 0 155 2 0 182 180 0 0 26229 514
28742 8630 184 0 0 2 0 0 92 6 0 191 189 0 0 24708 330
25955 8530 203 0 0 2 0 0 73 3 0 206 204 0 0 23575 237
26143 9015 228 0 0 2 0 0 93 2 0 230 228 0 0 23017 292
26345 9012 196 0 0 1 0 0 160 4 0 199 197 0 0 23527 560
26546 8535 216 0 0 0 0 0 65 3 0 216 214 0 0 24245 206
26715 8415 170 0 0 0 0 0 66 2 0 171 169 0 0 23079 203
26894 8376 180 0 0 0 0 0 77 5 0 185 183 0 0 22002 245
27073 8407 173 0 0 1 0 0 69 2 0 175 173 0 0 23926 228
27224 8590 160 0 0 2 0 0 68 3 0 163 161 0 0 24457 211
27381 8304 153 0 0 2 0 0 79 6 0 159 157 0 0 23955 313
27555 7993 174 0 0 1 0 0 52 2 0 176 174 0 0 22940 209
27766 8710 215 0 0 0 0 0 44 4 0 220 218 0 0 25631 174
27958 8073 192 0 0 1 0 0 46 3 0 195 193 0 0 24186 178
28082 7877 138 0 0 0 0 0 53 2 0 140 138 0 0 23524 205
28171 7519 91 0 0 1 0 0 47 4 0 95 93 0 0 22122 209
28254 7953 81 0 0 0 0 0 38 4 0 85 83 0 0 22011 175
28353 8062 103 0 0 1 0 0 53 3 0 107 105 0 0 22077 220
28450 8129 89 0 0 1 0 0 41 5 0 94 92 0 0 23018 154
28578 7767 129 0 0 0 0 0 53 3 0 132 130 0 0 21775 198
size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf HASH: in_search out_search
28684 8291 112 0 0 0 0 0 58 5 0 117 115 0 0 22819 229
28774 8426 93 0 0 0 0 0 56 2 0 95 93 0 0 23719 213
28851 7632 85 0 0 0 0 0 51 3 0 88 86 0 0 21774 172
485 7766 282 0 0 1 0 0 42 11 0 84 82 0 0 21271 161
1385 7572 443 0 0 1 0 0 32 14 0 0 0 0 0 824 7
1840 7538 232 1 0 1 0 0 38 12 0 0 0 0 0 1222 9
2283 7429 216 0 0 2 0 0 60 10 0 0 0 0 0 1358 15
2646 7209 172 0 0 1 0 0 43 8 0 0 0 0 0 1406 16
2950 7934 146 0 0 0 0 0 55 5 0 0 0 0 0 1777 16
3273 7784 156 0 0 2 0 0 43 5 0 0 0 0 0 2158 18
3603 7159 158 0 0 0 0 0 48 6 0 0 0 0 0 2418 31
3892 7398 140 0 0 0 0 0 49 4 0 0 0 0 0 2665 36
4142 7847 158 0 0 1 0 0 46 5 0 0 0 0 0 2936 23
4389 7789 116 0 0 0 0 0 43 9 0 0 0 0 0 3216 28
4702 8082 151 0 0 0 0 0 30 5 0 0 0 0 0 3333 23
4953 8011 120 0 0 0 0 0 42 5 0 0 0 0 0 3548 39



Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
Cand. scient datalog
Dept. of Computer Science, University of Copenhagen
-------------------------------------------------------------------