2010-07-11 18:55:30

by Torsten Kaiser

[permalink] [raw]
Subject: Regression 2.6.33->2.6.34: OOPS at boot, kmalloc corruption?

Trying to upgrade my system from 2.6.33 to 2.6.34, I can't get it to boot.

All tries used CONFIG_SLUB=y

The gentoo version of 2.6.34 generated an OOPS during network
initialization and then came to a stop. (It seemed that all processes
got stuck waiting on some locks.)
As in this instance the system was able to start the syslog, I was
able to capture the complete OOPS:
Jul 3 05:51:43 ariolc kernel: [ 32.674367] BUG: unable to handle
kernel NULL pointer dereference at 0000000000000003
Jul 3 05:51:43 ariolc kernel: [ 32.675674] IP: [<ffffffff810aab89>]
__kmalloc_track_caller+0x69/0x110
Jul 3 05:51:43 ariolc kernel: [ 32.676951] PGD 11e7e5067 PUD 11fd3d067 PMD 0
Jul 3 05:51:43 ariolc kernel: [ 32.678224] Oops: 0000 [#1] SMP
Jul 3 05:51:43 ariolc kernel: [ 32.679477] last sysfs file:
/sys/devices/virtual/block/md0/md/metadata_version
Jul 3 05:51:43 ariolc kernel: [ 32.680745] CPU 1
Jul 3 05:51:43 ariolc kernel: [ 32.680761] Modules linked in:
aes_x86_64(+) aes_generic sg
Jul 3 05:51:43 ariolc kernel: [ 32.682764]
Jul 3 05:51:43 ariolc kernel: [ 32.682764] Pid: 4652, comm:
modprobe Not tainted 2.6.34-gentoo-r1 #1 MS-7368/MS-7368
Jul 3 05:51:43 ariolc kernel: [ 32.682764] RIP:
0010:[<ffffffff810aab89>] [<ffffffff810aab89>]
__kmalloc_track_caller+0x69/0x110
Jul 3 05:51:43 ariolc kernel: [ 32.682764] RSP:
0018:ffff88011e75fe08 EFLAGS: 00010006
Jul 3 05:51:43 ariolc kernel: [ 32.687268] RAX: ffff880001b0f088
RBX: ffffffff8170d4d0 RCX: ffff88011e574b80
Jul 3 05:51:43 ariolc kernel: [ 32.688564] RDX: 0000000000000000
RSI: 00000000000000d0 RDI: 00000000000002d0
Jul 3 05:51:43 ariolc kernel: [ 32.688564] RBP: 0000000000000296
R08: 0000000000000014 R09: ffff88011e574800
Jul 3 05:51:43 ariolc kernel: [ 32.691414] R10: 0000000000000001
R11: ffff880001a12008 R12: 00000000000000d0
Jul 3 05:51:43 ariolc kernel: [ 32.691414] R13: 0000000000000003
R14: ffffffff81064abb R15: ffffc90010729d68
Jul 3 05:51:43 ariolc kernel: [ 32.691414] FS:
00007f0a9acb8700(0000) GS:ffff880001b00000(0000)
knlGS:0000000000000000
Jul 3 05:51:43 ariolc kernel: [ 32.691414] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Jul 3 05:51:43 ariolc kernel: [ 32.697212] CR2: 0000000000000003
CR3: 000000011d03e000 CR4: 00000000000006e0
Jul 3 05:51:43 ariolc kernel: [ 32.698792] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Jul 3 05:51:43 ariolc kernel: [ 32.698792] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 3 05:51:43 ariolc kernel: [ 32.698792] Process modprobe (pid:
4652, threadinfo ffff88011e75e000, task ffff88011d114150)
Jul 3 05:51:43 ariolc kernel: [ 32.698792] Stack:
Jul 3 05:51:43 ariolc kernel: [ 32.698792] 0000000000000000
ffffc90010729c97 0000000000000008 ffff88011e574800
Jul 3 05:51:43 ariolc kernel: [ 32.698792] <0> ffff88011e574aa0
ffffffff8108c27b ffffffffa0018920 ffffc900000000d0
Jul 3 05:51:43 ariolc kernel: [ 32.698792] <0> ffffffffa0018920
ffffc90010728000 ffffc90010729d68 ffffffff81064abb
Jul 3 05:51:43 ariolc kernel: [ 32.708636] Call Trace:
Jul 3 05:51:43 ariolc kernel: [ 32.708636] [<ffffffff8108c27b>] ?
kstrdup+0x3b/0x70
Jul 3 05:51:43 ariolc kernel: [ 32.711488] [<ffffffff81064abb>] ?
load_module+0x13eb/0x1730
Jul 3 05:51:43 ariolc kernel: [ 32.711488] [<ffffffff81064e7b>] ?
sys_init_module+0x7b/0x260
Jul 3 05:51:43 ariolc kernel: [ 32.711488] [<ffffffff810024ab>] ?
system_call_fastpath+0x16/0x1b
Jul 3 05:51:43 ariolc kernel: [ 32.716465] Code: 23 25 dc 47 6f 00
41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d
04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55
9d 4d 85 ed 74 06 66 45 85 e4 78 22
Jul 3 05:51:43 ariolc kernel: [ 32.718865] RIP
[<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
Jul 3 05:51:43 ariolc kernel: [ 32.718865] RSP <ffff88011e75fe08>
Jul 3 05:51:43 ariolc kernel: [ 32.718865] CR2: 0000000000000003
Jul 3 05:51:43 ariolc kernel: [ 32.718865] ---[ end trace
692101747f991cfb ]---

Two other OOPSen in __kmalloc() followed this one.

I tried to switch from CONFIG_NO_BOOTMEM=y to unsetting this option.
This kernel froze before the userspace was started, I did not see any
OOPS output.

Today I tried the vanilla 2.6.34.1 (again with CONFIG_NO_BOOTMEM=y).
The vanilla kernel also crashed before userspace, again in
__kmalloc(), but with a visible OOPS.
I wrote the following informations down:
OPPS was: BUG: unable to handle kernel NULL pointer dereference at
0000000000000003
Callchain started with:
ffffffff810aab39 : __kmalloc_track_caller+0x69/0x110
ffffffff8108c23b : kstrdup+0x3b/0x70
called from sysfs_new_dirent
there where no modules loaded at this time, the faulting process was
Pid: 1, comm: swapper

>From System.map:
ffffffff810aa910 t get_slab
ffffffff810aa980 T __kmalloc_node_track_caller
ffffffff810aaad0 T __kmalloc_track_caller
ffffffff810aabe0 T __kmalloc
Dump of assembler code from 0xffffffff810aaad0 to 0xffffffff810aabe0:
0xffffffff810aaad0: sub $0x28,%rsp
0xffffffff810aaad4: cmp $0x2000,%rdi
0xffffffff810aaadb: mov %r12,0x10(%rsp)
0xffffffff810aaae0: mov %r14,0x20(%rsp)
0xffffffff810aaae5: mov %esi,%r12d
0xffffffff810aaae8: mov %rbx,(%rsp)
0xffffffff810aaaec: mov %rbp,0x8(%rsp)
0xffffffff810aaaf1: mov %rdx,%r14
0xffffffff810aaaf4: mov %r13,0x18(%rsp)
0xffffffff810aaaf9: ja 0xffffffff810aaba3
0xffffffff810aaaff: callq 0xffffffff810aa910
0xffffffff810aab04: cmp $0x10,%rax
0xffffffff810aab08: mov %rax,%rbx
0xffffffff810aab0b: jbe 0xffffffff810aab51
0xffffffff810aab0d: and 0x6f48ac(%rip),%r12d # 0xffffffff8179f3c0
0xffffffff810aab14: test $0x10,%r12b
0xffffffff810aab18: jne 0xffffffff810aab80
0xffffffff810aab1a: pushfq
0xffffffff810aab1b: pop %rbp
0xffffffff810aab1c: cli
0xffffffff810aab1d: mov %gs:0xd1a8,%rdx
0xffffffff810aab26: mov (%rbx),%rax
0xffffffff810aab29: lea (%rdx,%rax,1),%rax
0xffffffff810aab2d: mov (%rax),%r13
0xffffffff810aab30: test %r13,%r13
0xffffffff810aab33: je 0xffffffff810aab8a
0xffffffff810aab35: movslq 0x18(%rbx),%rdx
0xffffffff810aab39: mov 0x0(%r13,%rdx,1),%rdx
0xffffffff810aab3e: mov %rdx,(%rax)
0xffffffff810aab41: push %rbp
0xffffffff810aab42: popfq
0xffffffff810aab43: test %r13,%r13
0xffffffff810aab46: je 0xffffffff810aab4e
0xffffffff810aab48: test %r12w,%r12w
0xffffffff810aab4c: js 0xffffffff810aab70
0xffffffff810aab4e: mov %r13,%rax
0xffffffff810aab51: mov (%rsp),%rbx
0xffffffff810aab55: mov 0x8(%rsp),%rbp
0xffffffff810aab5a: mov 0x10(%rsp),%r12
0xffffffff810aab5f: mov 0x18(%rsp),%r13
0xffffffff810aab64: mov 0x20(%rsp),%r14
0xffffffff810aab69: add $0x28,%rsp
0xffffffff810aab6d: retq
0xffffffff810aab6e: xchg %ax,%ax
0xffffffff810aab70: movslq 0x14(%rbx),%rdx
0xffffffff810aab74: xor %esi,%esi
0xffffffff810aab76: mov %r13,%rdi
0xffffffff810aab79: callq 0xffffffff811f51e0
0xffffffff810aab7e: jmp 0xffffffff810aab4e
0xffffffff810aab80: callq 0xffffffff814cd640
0xffffffff810aab85: nopl (%rax)
0xffffffff810aab88: jmp 0xffffffff810aab1a
0xffffffff810aab8a: mov %rax,%r8
0xffffffff810aab8d: mov %r14,%rcx
0xffffffff810aab90: or $0xffffffffffffffff,%edx
0xffffffff810aab93: mov %r12d,%esi
0xffffffff810aab96: mov %rbx,%rdi
0xffffffff810aab99: callq 0xffffffff810a9ae0
0xffffffff810aab9e: mov %rax,%r13
0xffffffff810aaba1: jmp 0xffffffff810aab41
0xffffffff810aaba3: dec %rdi
0xffffffff810aaba6: or $0xffffffffffffffff,%esi
0xffffffff810aaba9: shr $0xb,%rdi
0xffffffff810aabad: inc %esi
0xffffffff810aabaf: shr %rdi
0xffffffff810aabb2: jne 0xffffffff810aabad
0xffffffff810aabb4: mov %r12d,%edi
0xffffffff810aabb7: mov (%rsp),%rbx
0xffffffff810aabbb: mov 0x8(%rsp),%rbp
0xffffffff810aabc0: mov 0x10(%rsp),%r12
0xffffffff810aabc5: mov 0x18(%rsp),%r13
0xffffffff810aabca: or $0x4000,%edi
0xffffffff810aabd0: mov 0x20(%rsp),%r14
0xffffffff810aabd5: add $0x28,%rsp
0xffffffff810aabd9: jmpq 0xffffffff81080920
0xffffffff810aabde: xchg %ax,%ax

>From this assembly, I would guess its this line in slub.c / slab_alloc():
c->freelist = get_freepointer(s, object);

A short test with 2.6.35-rc4 suggest that this problem has been fixed
on master, although 2.6.35-rc4 only boots with radeon.modset=0. With
KMS enabled the display turns off and the system does not even respond
to SysRq+B.
(I will report this KMS issue in another mail.)

The system is an AMD RS690 with an Athlon X2 BE-2400.
Under 2.6.33 the system is perfectly stable, KMS is working and enabled.

Any guesses what this might cause?

Thanks for looking that this,
Torsten


2010-07-20 20:13:13

by Pekka Enberg

[permalink] [raw]
Subject: Re: Regression 2.6.33->2.6.34: OOPS at boot, kmalloc corruption?

Hi Torsten,

On Sun, Jul 11, 2010 at 9:55 PM, Torsten Kaiser
<[email protected]> wrote:
> Trying to upgrade my system from 2.6.33 to 2.6.34, I can't get it to boot.
>
> All tries used CONFIG_SLUB=y
>
> The gentoo version of 2.6.34 generated an OOPS during network
> initialization and then came to a stop. (It seemed that all processes
> got stuck waiting on some locks.)
> As in this instance the system was able to start the syslog, I was
> able to capture the complete OOPS:
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.674367] BUG: unable to handle
> kernel NULL pointer dereference at 0000000000000003
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.675674] IP: [<ffffffff810aab89>]
> __kmalloc_track_caller+0x69/0x110
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.676951] PGD 11e7e5067 PUD 11fd3d067 PMD 0
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.678224] Oops: 0000 [#1] SMP
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.679477] last sysfs file:
> /sys/devices/virtual/block/md0/md/metadata_version
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.680745] CPU 1
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.680761] Modules linked in:
> aes_x86_64(+) aes_generic sg
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.682764]
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.682764] Pid: 4652, comm:
> modprobe Not tainted 2.6.34-gentoo-r1 #1 MS-7368/MS-7368
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.682764] RIP:
> 0010:[<ffffffff810aab89>] ?[<ffffffff810aab89>]
> __kmalloc_track_caller+0x69/0x110
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.682764] RSP:
> 0018:ffff88011e75fe08 ?EFLAGS: 00010006
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.687268] RAX: ffff880001b0f088
> RBX: ffffffff8170d4d0 RCX: ffff88011e574b80
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.688564] RDX: 0000000000000000
> RSI: 00000000000000d0 RDI: 00000000000002d0
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.688564] RBP: 0000000000000296
> R08: 0000000000000014 R09: ffff88011e574800
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.691414] R10: 0000000000000001
> R11: ffff880001a12008 R12: 00000000000000d0
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.691414] R13: 0000000000000003
> R14: ffffffff81064abb R15: ffffc90010729d68
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.691414] FS:
> 00007f0a9acb8700(0000) GS:ffff880001b00000(0000)
> knlGS:0000000000000000
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.691414] CS: ?0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.697212] CR2: 0000000000000003
> CR3: 000000011d03e000 CR4: 00000000000006e0
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] DR3: 0000000000000000
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] Process modprobe (pid:
> 4652, threadinfo ffff88011e75e000, task ffff88011d114150)
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] Stack:
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] ?0000000000000000
> ffffc90010729c97 0000000000000008 ffff88011e574800
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] <0> ffff88011e574aa0
> ffffffff8108c27b ffffffffa0018920 ffffc900000000d0
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.698792] <0> ffffffffa0018920
> ffffc90010728000 ffffc90010729d68 ffffffff81064abb
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.708636] Call Trace:
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.708636] ?[<ffffffff8108c27b>] ?
> kstrdup+0x3b/0x70
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.711488] ?[<ffffffff81064abb>] ?
> load_module+0x13eb/0x1730
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.711488] ?[<ffffffff81064e7b>] ?
> sys_init_module+0x7b/0x260
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.711488] ?[<ffffffff810024ab>] ?
> system_call_fastpath+0x16/0x1b
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.716465] Code: 23 25 dc 47 6f 00
> 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d
> 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55
> 9d 4d 85 ed 74 06 66 45 85 e4 78 22
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.718865] RIP
> [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.718865] ?RSP <ffff88011e75fe08>
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.718865] CR2: 0000000000000003
> Jul ?3 05:51:43 ariolc kernel: [ ? 32.718865] ---[ end trace
> 692101747f991cfb ]---
>
> Two other OOPSen in __kmalloc() followed this one.
>
> I tried to switch from CONFIG_NO_BOOTMEM=y to unsetting this option.
> This kernel froze before the userspace was started, I did not see any
> OOPS output.
>
> Today I tried the vanilla 2.6.34.1 (again with CONFIG_NO_BOOTMEM=y).
> The vanilla kernel also crashed before userspace, again in
> __kmalloc(), but with a visible OOPS.
> I wrote the following informations down:
> OPPS was: BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000003
> Callchain started with:
> ffffffff810aab39 : __kmalloc_track_caller+0x69/0x110
> ffffffff8108c23b : kstrdup+0x3b/0x70
> called from sysfs_new_dirent
> there where no modules loaded at this time, the faulting process was
> Pid: 1, comm: swapper

[snip]

> From this assembly, I would guess its this line in slub.c / slab_alloc():
> c->freelist = get_freepointer(s, object);
>
> A short test with 2.6.35-rc4 suggest that this problem has been fixed
> on master, although 2.6.35-rc4 only boots with radeon.modset=0. With
> KMS enabled the display turns off and the system does not even respond
> to SysRq+B.
> (I will report this KMS issue in another mail.)
>
> The system is an AMD RS690 with an Athlon X2 BE-2400.
> Under 2.6.33 the system is perfectly stable, KMS is working and enabled.
>
> Any guesses what this might cause?

It's slab corruption that can be cause by many things. Can you please
try to reproduce with CONFIG_SLUB_DEBUG_ON=y?

2010-07-20 20:23:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: Regression 2.6.33->2.6.34: OOPS at boot, kmalloc corruption?

On Tue, 20 Jul 2010, Pekka Enberg wrote:

> It's slab corruption that can be cause by many things. Can you please
> try to reproduce with CONFIG_SLUB_DEBUG_ON=y?

Or simply reboot and add a parameter slub_debug to the other parameters.

2010-07-31 10:06:25

by Torsten Kaiser

[permalink] [raw]
Subject: Re: Regression 2.6.33->2.6.34: OOPS at boot, kmalloc corruption?

On Tue, Jul 20, 2010 at 10:19 PM, Christoph Lameter
<[email protected]> wrote:
> On Tue, 20 Jul 2010, Pekka Enberg wrote:
>
>> It's slab corruption that can be cause by many things. Can you please
>> try to reproduce with CONFIG_SLUB_DEBUG_ON=y?
>
> Or simply reboot and add a parameter slub_debug to the other parameters.

I finally had the opportunity to reboot this system again.

CONFIG_SLUB_DEBUG=y was set, so I tried adding slub_debug to the commandline.

With slub_debug added the system boots normal, I could not see any
errors in the syslog. When I remove slub_debug it crashed againb
before reaching userspace.

After the KMS fixes from Alex Deucher vanilla kernel 2.6.35-rc6 works
for me. So I would thing my problems with earlier 2.6.35-rcs where
just these KMS errors and this kmalloc problem has already been fixed
in mainline.

So I have switched this system to 2.6.35-rc6 and will stay with this kernel.

Thanks, Torsten