2011-02-02 09:02:52

by Yann Dupont

[permalink] [raw]
Subject: kernel 2.6.37 : oops in cleanup_once

Hello.
We recently upgraded one machine with vanilla 2.6.37, and experienced 2
kernel oops since. Each oops is after ~1 week of uptime.
The last oops was last night but we didn't had any trace.

Here is the previous oops :

Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316042]
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316096]
IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316135] PGD 0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316157]
Oops: 0002 [#1] SMP
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316188]
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316234] CPU 1
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316240]
Modules linked in: xt_physdev ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6
ipt_LOG xt_multiport xt_limit nf_conntrack_tftp nf_conntrack_ftp tun
ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT
xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm ipv6 8021q
bridge stp ext2 mbcache fuse snd_pcm snd_timer snd soundcore
snd_page_alloc i5000_edac edac_core psmouse evdev i5k_amb tpm_tis tpm
joydev dcdbas tpm_bios pcspkr rng_core ghes shpchp serio_raw pci_hotplug
processor hed button thermal_sys xfs exportfs dm_mod sg sr_mod sd_mod
cdrom usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt uhci_hcd
mptsas mptscsih ehci_hcd mptbase bnx2 scsi_transport_sas scsi_mod [last
unloaded: scsi_wait_scan]
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316694]
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316715]
Pid: 0, comm: kworker/0:0 Not tainted 2.6.37-dsiun-110105 #17
0MY736/PowerEdge M600
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316761]
RIP: 0010:[<ffffffff8130e6bf>] [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316808]
RSP: 0018:ffff8800cfc43e20 EFLAGS: 00010202
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316834]
RAX: ffff8803d3158018 RBX: ffff8803d3158000 RCX: 0000000000000005
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316878]
RDX: 0b000209f1beadde RSI: 00000000000000ac RDI: ffffffff8152a970
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318512]
RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318560]
R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfc43ea0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318604]
R13: 0000000000000100 R14: ffff88040fc99fd8 R15: 0000000000000000
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318652]
FS: 0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318698]
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318725]
CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318768]
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318812]
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318855]
Process kworker/0:0 (pid: 0, threadinfo ffff88040fc98000, task
ffff88040fc6c2e0)
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318901]
Stack:
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318921]
0000000000000082 00000001029221c1 00000000000248f6 ffffffff8130e988
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318971]
ffff88040fc90000 ffff88040fc90000 ffffffff8152a9a0 ffffffff8105e95f
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319021]
ffff8800cfc43e58 ffff88040fc91020 ffffffff8130e950 ffff88040fc99fd8
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319072]
Call Trace:
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319093] <IRQ>
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319116]
[<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319146]
[<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319175]
[<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319204]
[<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319232]
[<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319260]
[<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319288]
[<ffffffff81005f75>] ? do_softirq+0x65/0xa0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319315]
[<ffffffff81056745>] ? irq_exit+0x85/0x90
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319343]
[<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319373]
[<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319401] <EOI>
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319427]
[<ffffffffa032218c>] ? acpi_idle_enter_bm+0x243/0x27b [processor]
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319473]
[<ffffffffa0322185>] ? acpi_idle_enter_bm+0x23c/0x27b [processor]
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319519]
[<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319547]
[<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319573]
Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b
15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89
51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319768]
RIP [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319797]
RSP <ffff8800cfc43e20>
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319820]
CR2: 000000000000000d
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320187]
---[ end trace eaf3ed2d46c78768 ]---
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320257]
Kernel panic - not syncing: Fatal exception in interrupt
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320329]
Pid: 0, comm: kworker/0:0 Tainted: G D 2.6.37-dsiun-110105 #17
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320418]
Call Trace:
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320481]
<IRQ> [<ffffffff8137c75e>] ? panic+0x92/0x1a2
Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320601]
[<ffffffff81007357>] ? oops_end+0xe7/0xf0


Any ideas ??

This machine is running lots of kvm hosts. I can provide the .config if
needed.

--
Yann Dupont - Service IRTS, DSI Universit? de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : [email protected]


2011-02-02 10:53:05

by Eric Dumazet

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> Hello.
> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
> kernel oops since. Each oops is after ~1 week of uptime.
> The last oops was last night but we didn't had any trace.
>
> Here is the previous oops :
>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316042]
> BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316096]
> IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316135] PGD 0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316157]
> Oops: 0002 [#1] SMP
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316188]
> last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316234] CPU 1
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316240]
> Modules linked in: xt_physdev ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6
> ipt_LOG xt_multiport xt_limit nf_conntrack_tftp nf_conntrack_ftp tun
> ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT
> xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm ipv6 8021q
> bridge stp ext2 mbcache fuse snd_pcm snd_timer snd soundcore
> snd_page_alloc i5000_edac edac_core psmouse evdev i5k_amb tpm_tis tpm
> joydev dcdbas tpm_bios pcspkr rng_core ghes shpchp serio_raw pci_hotplug
> processor hed button thermal_sys xfs exportfs dm_mod sg sr_mod sd_mod
> cdrom usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt uhci_hcd
> mptsas mptscsih ehci_hcd mptbase bnx2 scsi_transport_sas scsi_mod [last
> unloaded: scsi_wait_scan]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316694]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316715]
> Pid: 0, comm: kworker/0:0 Not tainted 2.6.37-dsiun-110105 #17
> 0MY736/PowerEdge M600
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316761]
> RIP: 0010:[<ffffffff8130e6bf>] [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316808]
> RSP: 0018:ffff8800cfc43e20 EFLAGS: 00010202
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316834]
> RAX: ffff8803d3158018 RBX: ffff8803d3158000 RCX: 0000000000000005
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316878]
> RDX: 0b000209f1beadde RSI: 00000000000000ac RDI: ffffffff8152a970
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318512]
> RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318560]
> R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfc43ea0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318604]
> R13: 0000000000000100 R14: ffff88040fc99fd8 R15: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318652]
> FS: 0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318698]
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318725]
> CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318768]
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318812]
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318855]
> Process kworker/0:0 (pid: 0, threadinfo ffff88040fc98000, task
> ffff88040fc6c2e0)
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318901]
> Stack:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318921]
> 0000000000000082 00000001029221c1 00000000000248f6 ffffffff8130e988
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318971]
> ffff88040fc90000 ffff88040fc90000 ffffffff8152a9a0 ffffffff8105e95f
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319021]
> ffff8800cfc43e58 ffff88040fc91020 ffffffff8130e950 ffff88040fc99fd8
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319072]
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319093] <IRQ>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319116]
> [<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319146]
> [<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319175]
> [<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319204]
> [<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319232]
> [<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319260]
> [<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319288]
> [<ffffffff81005f75>] ? do_softirq+0x65/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319315]
> [<ffffffff81056745>] ? irq_exit+0x85/0x90
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319343]
> [<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319373]
> [<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319401] <EOI>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319427]
> [<ffffffffa032218c>] ? acpi_idle_enter_bm+0x243/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319473]
> [<ffffffffa0322185>] ? acpi_idle_enter_bm+0x23c/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319519]
> [<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319547]
> [<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319573]
> Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b
> 15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89
> 51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319768]
> RIP [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319797]
> RSP <ffff8800cfc43e20>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319820]
> CR2: 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320187]
> ---[ end trace eaf3ed2d46c78768 ]---
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320257]
> Kernel panic - not syncing: Fatal exception in interrupt
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320329]
> Pid: 0, comm: kworker/0:0 Tainted: G D 2.6.37-dsiun-110105 #17
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320418]
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320481]
> <IRQ> [<ffffffff8137c75e>] ? panic+0x92/0x1a2
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320601]
> [<ffffffff81007357>] ? oops_end+0xe7/0xf0
>
>
> Any ideas ??


Hi Yann

Yes this is a known problem.

Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
(inetpeer: Use correct AVL tree base pointer in inet_getpeer())

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492

I believe David will send it to stable team shortly, if not already
done :)

Thanks

2011-02-02 11:25:17

by Eric Dumazet

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> > Hello.
> > We recently upgraded one machine with vanilla 2.6.37, and experienced 2
> > kernel oops since. Each oops is after ~1 week of uptime.
> > The last oops was last night but we didn't had any trace.

oops, 2.6.37 "only"

> Yes this is a known problem.
>
> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>
> I believe David will send it to stable team shortly, if not already
> done :)

Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
affected by the problem.

So its another problem... Is there anything particular you do on this
machine ?



2011-02-02 13:08:56

by Yann Dupont

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le 02/02/2011 12:24, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>> Hello.
>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>> kernel oops since. Each oops is after ~1 week of uptime.
>>> The last oops was last night but we didn't had any trace.
> oops, 2.6.37 "only"
>
>> Yes this is a known problem.
>>
>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>
>> I believe David will send it to stable team shortly, if not already
>> done :)
> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> affected by the problem.
>
> So its another problem... Is there anything particular you do on this
> machine ?
>
>
>
>
Nothing really special there, we run a lot (20) of KVM guest (mainly
linux firewalls for lots of differents vlan), so we have a lot of
bridges vlan & tun/tap.
Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of the other
bug already sent to netdev - more to come on next mail)

Hard to say if this BUG is new in 2.6.37. This host was running fine
with 2.6.34.2 since August 2010.
Bisecting will be hard due to the time to trigger the bug (and the fact
that this machine is a production machine)

Anyway, I can test with a specific kernel version if you suspect something.

Regards,


--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : [email protected]

2011-02-02 14:53:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le mercredi 02 février 2011 à 14:08 +0100, Yann Dupont a écrit :
> Le 02/02/2011 12:24, Eric Dumazet a écrit :
> > Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> >> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> >>> Hello.
> >>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
> >>> kernel oops since. Each oops is after ~1 week of uptime.
> >>> The last oops was last night but we didn't had any trace.
> > oops, 2.6.37 "only"
> >
> >> Yes this is a known problem.
> >>
> >> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> >> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> >>
> >> I believe David will send it to stable team shortly, if not already
> >> done :)
> > Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> > affected by the problem.
> >
> > So its another problem... Is there anything particular you do on this
> > machine ?
> >
> >
> >
> >
> Nothing really special there, we run a lot (20) of KVM guest (mainly
> linux firewalls for lots of differents vlan), so we have a lot of
> bridges vlan & tun/tap.
> Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of the other
> bug already sent to netdev - more to come on next mail)
>
> Hard to say if this BUG is new in 2.6.37. This host was running fine
> with 2.6.34.2 since August 2010.
> Bisecting will be hard due to the time to trigger the bug (and the fact
> that this machine is a production machine)
>
> Anyway, I can test with a specific kernel version if you suspect something.
>

I suspect a mem corruption from another layer (not inetpeer)

Unfortunately many kmem caches share the "64 bytes" cache.

Could you please add "slub_nomerge" on your boot command ?


This way, we can separate corruptions on each cache.


On your crash, one inetpeer contain garbage on unused_lists next/prev
pointers :

RCX: 0000000000000005
RDX: 0b000209f1beadde

Definitly something overwrote these values with non pointers values.


2011-02-02 15:04:15

by Yann Dupont

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le 02/02/2011 15:53, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 14:08 +0100, Yann Dupont a écrit :
>> Le 02/02/2011 12:24, Eric Dumazet a écrit :
>>> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>>>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>>>> Hello.
>>>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>>>> kernel oops since. Each oops is after ~1 week of uptime.
>>>>> The last oops was last night but we didn't had any trace.
>>> oops, 2.6.37 "only"
>>>
>>>> Yes this is a known problem.
>>>>
>>>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>>>
>>>> I believe David will send it to stable team shortly, if not already
>>>> done :)
>>> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
>>> affected by the problem.
>>>
>>> So its another problem... Is there anything particular you do on this
>>> machine ?
>>>
>>>
>>>
>>>
>> Nothing really special there, we run a lot (20) of KVM guest (mainly
>> linux firewalls for lots of differents vlan), so we have a lot of
>> bridges vlan& tun/tap.
>> Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of the other
>> bug already sent to netdev - more to come on next mail)
>>
>> Hard to say if this BUG is new in 2.6.37. This host was running fine
>> with 2.6.34.2 since August 2010.
>> Bisecting will be hard due to the time to trigger the bug (and the fact
>> that this machine is a production machine)
>>
>> Anyway, I can test with a specific kernel version if you suspect something.
>>
> I suspect a mem corruption from another layer (not inetpeer)
>
> Unfortunately many kmem caches share the "64 bytes" cache.
>
> Could you please add "slub_nomerge" on your boot command ?
>
Ok, will do it at 18:30 CET (to minimize impact)
It the suspected bug SLUB related ?

The 2.6.34.2 kernel previously used on that server used SLAB.


2 questions :
-How can I be sure slub_nomerge is active ? Boot message ?
-Is there a very severe impact on performance ?

Regards,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : [email protected]

2011-02-02 15:09:08

by Eric Dumazet

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le mercredi 02 février 2011 à 16:04 +0100, Yann Dupont a écrit :
> >
> Ok, will do it at 18:30 CET (to minimize impact)
> It the suspected bug SLUB related ?
>

no : It can be a corruption from another part of kernel.

> The 2.6.34.2 kernel previously used on that server used SLAB.
>
>
> 2 questions :
> -How can I be sure slub_nomerge is active ? Boot message ?


# ls -l /sys/kernel/slab/

If you have symlinks : merge is on (default)

If you dont have symlinks : nomerge is in action

> -Is there a very severe impact on performance ?
>

not at all

> Regards,
>

2011-02-02 17:59:50

by Yann Dupont

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le 02/02/2011 16:08, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 16:04 +0100, Yann Dupont a écrit :
>> Ok, will do it at 18:30 CET (to minimize impact)
>> It the suspected bug SLUB related ?
>>
> no : It can be a corruption from another part of kernel.
>
>> The 2.6.34.2 kernel previously used on that server used SLAB.
>>
>>
>> 2 questions :
>> -How can I be sure slub_nomerge is active ? Boot message ?
>
> # ls -l /sys/kernel/slab/
>
> If you have symlinks : merge is on (default)
>
> If you dont have symlinks : nomerge is in action
>
>> -Is there a very severe impact on performance ?
>>
> not at all
>
>> Regards,
>>
>
well. The server had the good taste to oops at 18H05, 25 minutes before
the planned reboot :)

here is the oops (I think it's quite the same) :


Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128042]
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128097]
IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128146] PGD 0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128173]
Oops: 0002 [#1] SMP
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128200]
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128250] CPU 7
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128260]
Modules linked in: dell_rbu acpi_cpufreq freq_table mperf nls_utf8
nls_cp437 btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix
ntfs vfat msdos fat jfs rei
serfs ext4 jbd2 crc16 ext3 jbd tun ipt_MASQUERADE iptable_nat nf_nat
ipt_REJECT kvm_intel kvm xt_physdev ip6t_LOG nf_conntrack_ipv6
nf_defrag_ipv6 ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit
xt_tcpudp xt_state iptable_filter
ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ipv6 8021q bridge stp ext2
mbcache fuse snd_pcm snd_timer ghes hed button snd soundcore i5000_edac
edac_core processor shpchp tpm_tis pc
i_hotplug tpm rng_core snd_page_alloc i5k_amb dcdbas tpm_bios joydev
evdev psmouse pcspkr serio_raw thermal_sys xfs exportfs dm_mod sg sr_mod
cdrom sd_mod usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt
uhci_hcd mptsas mptscsih
mptbase bnx2 scsi_transport_sas scsi_mod ehci_hcd [last unloaded:
scsi_wait_scan]
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128834]
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128855]
Pid: 0, comm: kworker/0:1 Not tainted 2.6.37-dsiun-110105 #17
0MY736/PowerEdge M600
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128901]
RIP: 0010:[<ffffffff8130e6bf>] [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128948]
RSP: 0018:ffff8800cfdc3e20 EFLAGS: 00010206
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128974]
RAX: ffff8803a7e0ea18 RBX: ffff8803a7e0ea00 RCX: 0000000000000005
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129003]
RDX: adde806c0d860b00 RSI: 0000000000000096 RDI: ffffffff8152a970
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129032]
RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129062]
R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfdc3ea0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129091]
R13: 0000000000000100 R14: ffff88040fd29fd8 R15: 0000000000000000
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129121]
FS: 0000000000000000(0000) GS:ffff8800cfdc0000(0000) knlGS:0000000000000000
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129166]
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129193]
CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129223]
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129252]
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129282]
Process kworker/0:1 (pid: 0, threadinfo ffff88040fd28000, task
ffff88040fce6450)
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129327] Stack:
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129347]
0000000000000082 00000001008d3b66 00000000000248f6 ffffffff8130e988
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129397]
ffff88040fd24000 ffff88040fd24000 ffffffff8152a9a0 ffffffff8105e95f
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129446]
ffff8800cfdc3e58 ffff88040fd25020 ffffffff8130e950 ffff88040fd29fd8
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129496]
Call Trace:
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129523] <IRQ>
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129551]
[<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129581]
[<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129609]
[<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129638]
[<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129666]
[<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129694]
[<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129722]
[<ffffffff81005f75>] ? do_softirq+0x65/0xa0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129748]
[<ffffffff81056745>] ? irq_exit+0x85/0x90
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129776]
[<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129806]
[<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129833] <EOI>
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129857]
[<ffffffff8123f5ce>] ? acpi_hw_register_read+0x54/0xe2
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129890]
[<ffffffffa01c52b8>] ? acpi_idle_enter_simple+0xf4/0x126 [processor]
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129936]
[<ffffffffa01c52b1>] ? acpi_idle_enter_simple+0xed/0x126 [processor]
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131555]
[<ffffffffa01c5034>] ? acpi_idle_enter_bm+0xeb/0x27b [processor]
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131591]
[<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131619]
[<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131645]
Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b
15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89
51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131847]
RIP [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131876]
RSP <ffff8800cfdc3e20>
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131898]
CR2: 000000000000000d
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132280]
---[ end trace a9f45436c3b7c143 ]---
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132350]
Kernel panic - not syncing: Fatal exception in interrupt
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132422]
Pid: 0, comm: kworker/0:1 Tainted: G D 2.6.37-dsiun-110105 #17
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132510]
Call Trace:
Feb 2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132574]
<IRQ> [<ffffffff8137c75e>] ? panic+0x92/0x1a2

and I also have a screenshot with more details. I'll send it in a
private message.



Since 18H30, the server runs with slub_nomerge.

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : [email protected]

2011-03-14 10:52:52

by Yann Dupont

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le 02/02/2011 16:08, Eric Dumazet a écrit :


> I suspect a mem corruption from another layer (not inetpeer)
>
> Unfortunately many kmem caches share the "64 bytes" cache.
>
> Could you please add "slub_nomerge" on your boot command ?
>
...

>
>> -Is there a very severe impact on performance ?
>>
> not at all
>
Maybe there is an impact after all : since then, we don't have problems
anymore !

linkwood:~# uptime
11:42:03 up 39 days, 17:08, 3 users, load average: 0.01, 0.03, 0.05

So... could slub_nomerge hide or simply avoid the problem ?
Or are we just lucky this time ?


--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : [email protected]

2011-03-14 13:14:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: kernel 2.6.37 : oops in cleanup_once

Le lundi 14 mars 2011 à 11:44 +0100, Yann Dupont a écrit :
> Le 02/02/2011 16:08, Eric Dumazet a écrit :
>
>
> > I suspect a mem corruption from another layer (not inetpeer)
> >
> > Unfortunately many kmem caches share the "64 bytes" cache.
> >
> > Could you please add "slub_nomerge" on your boot command ?
> >
> ...
>
> >
> >> -Is there a very severe impact on performance ?
> >>
> > not at all
> >
> Maybe there is an impact after all : since then, we don't have problems
> anymore !
>
> linkwood:~# uptime
> 11:42:03 up 39 days, 17:08, 3 users, load average: 0.01, 0.03, 0.05
>
> So... could slub_nomerge hide or simply avoid the problem ?
> Or are we just lucky this time ?
>
>

I would say you are lucky ;)

Not all memory corruptions are noticed. Sometimes it touch unused parts
of memory, or some parts with no critical content.