2006-01-16 18:16:19

by Diego Calleja

[permalink] [raw]
Subject: Oops with current linus' git tree

I'm having two noticeable problems with the current linus' tree

1) Oops while watching a DVD with kaffeine (kde based video player),
oops pasted below.

2) This is a dual p3 machine, but only one CPU is being used to
run processes on it. CPU #1 is detected etc, but processes will
be scheduled only in CPU #0. /proc/interrupts shows that CPU #1 is
still used to service interrupts. I'm able to force processes to run
on that CPU with taskset but it won't happen automatically like it
usually does. dmesg here: http://terra.es/personal/diegocg/dmesg


Jan 16 18:04:07 estel kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000040
Jan 16 18:04:07 estel kernel: printing eip:
Jan 16 18:04:07 estel kernel: c0147a2e
Jan 16 18:04:07 estel kernel: *pde = 00000000
Jan 16 18:04:07 estel kernel: Oops: 0000 [#1]
Jan 16 18:04:07 estel kernel: PREEMPT SMP
Jan 16 18:04:07 estel kernel: Modules linked in: radeon ipt_REJECT xt_tcpudp lp ipt_MASQUERADE iptable_nat ip_nat ip_conntrack iptable_filter
ip_tables x_tables usbhid ohci_hcd usbcore parport_pc parport floppy pcspkr ide_cd cdrom unix
Jan 16 18:04:07 estel kernel: CPU: 0
Jan 16 18:04:07 estel kernel: EIP: 0060:[find_get_page+46/96] Not tainted VLI
Jan 16 18:04:07 estel kernel: EFLAGS: 00010002 (2.6.15)
Jan 16 18:04:07 estel kernel: EIP is at find_get_page+0x2e/0x60
Jan 16 18:04:07 estel kernel: eax: 00000040 ebx: 00000040 ecx: 00000000 edx: 00000003
Jan 16 18:04:07 estel kernel: esi: 0003352c edi: c1a2b178 ebp: c1a2b168 esp: c2b09e20
Jan 16 18:04:07 estel kernel: ds: 007b es: 007b ss: 0068
Jan 16 18:04:07 estel kernel: Process kaffeine (pid: 2164, threadinfo=c2b09000 task=e3ffc050)
Jan 16 18:04:07 estel kernel: Stack: <0>00001000 0003352c 0003352c c01491b9 00001000 0000002c f40a3cc0 f40a3d0c
Jan 16 18:04:07 estel kernel: c1a2b0b4 0003352c 000be343 00000000 0003354a 0003353e 0003352b be344000
Jan 16 18:04:07 estel kernel: 00000000 00000000 00001000 00033521 00000020 00000000 00000000 0003353d
Jan 16 18:04:07 estel kernel: Call Trace:
Jan 16 18:04:07 estel kernel: [do_generic_mapping_read+409/1200] do_generic_mapping_read+0x199/0x4b0
Jan 16 18:04:07 estel kernel: [file_read_actor+0/240] file_read_actor+0x0/0xf0
Jan 16 18:04:07 estel kernel: [__generic_file_aio_read+367/576] __generic_file_aio_read+0x16f/0x240
Jan 16 18:04:07 estel kernel: [file_read_actor+0/240] file_read_actor+0x0/0xf0
Jan 16 18:04:07 estel kernel: [unqueue_me+106/176] unqueue_me+0x6a/0xb0
Jan 16 18:04:07 estel kernel: [generic_file_read+152/192] generic_file_read+0x98/0xc0
Jan 16 18:04:07 estel kernel: [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50
Jan 16 18:04:07 estel kernel: [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 16 18:04:07 estel kernel: [vfs_read+161/352] vfs_read+0xa1/0x160
Jan 16 18:04:07 estel kernel: [generic_file_read+0/192] generic_file_read+0x0/0xc0
Jan 16 18:04:07 estel kernel: [sys_read+65/112] sys_read+0x41/0x70
Jan 16 18:04:07 estel kernel: [sysenter_past_esp+84/117] sysenter_past_esp+0x54/0x75
Jan 16 18:04:07 estel kernel: Code: 89 7c 24 08 8d 78 10 89 1c 24 89 c3 89 f8 89 74 24 04 89 d6 83 c3 04 e8 81 a1 1d 00 89 d8 89 f2 e8 68 83
08 00 85 c0 89 c3 74 0d <8b> 00 89 da f6 c4 40 75 1c f0 ff 42 04 89 f8 e8 be a5 1d 00 89
Jan 16 18:04:07 estel kernel: <6>note: kaffeine[2164] exited with preempt_count 1


2006-01-17 04:21:03

by Nick Piggin

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

Diego Calleja wrote:
> I'm having two noticeable problems with the current linus' tree
>
> 1) Oops while watching a DVD with kaffeine (kde based video player),
> oops pasted below.
>

From your oops it looks as though the radix_tree_lookup in find_get_page
has returned 0x40. It could be a flipped bit - is your memory OK?

Can you apply the attached patch and try to reproduce the oops?

> 2) This is a dual p3 machine, but only one CPU is being used to
> run processes on it. CPU #1 is detected etc, but processes will
> be scheduled only in CPU #0. /proc/interrupts shows that CPU #1 is
> still used to service interrupts. I'm able to force processes to run
> on that CPU with taskset but it won't happen automatically like it
> usually does. dmesg here: http://terra.es/personal/diegocg/dmesg
>

What happens if you run several infinite loops to increase the load?
Does everything still stay on CPU0?

>
> Jan 16 18:04:07 estel kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000040

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-01-17 04:24:28

by Nick Piggin

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c 2006-01-03 19:05:57.000000000 +1100
+++ linux-2.6/lib/radix-tree.c 2006-01-17 15:17:36.000000000 +1100
@@ -233,6 +233,8 @@ int radix_tree_insert(struct radix_tree_
int offset;
int error;

+ BUG_ON((unsigned long)item < PAGE_OFFSET);
+
/* Make sure the tree is high enough. */
if ((!index && !root->rnode) ||
index > radix_tree_maxindex(root->height)) {
@@ -334,6 +336,8 @@ void *radix_tree_lookup(struct radix_tre
void **slot;

slot = __lookup_slot(root, index);
+ if (slot && *slot)
+ BUG_ON((unsigned long)(*slot) < PAGE_OFFSET);
return slot != NULL ? *slot : NULL;
}
EXPORT_SYMBOL(radix_tree_lookup);


Attachments:
radix-tree-debug.patch (766.00 B)

2006-01-17 13:17:44

by Diego Calleja

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

El Tue, 17 Jan 2006 15:20:36 +1100,
Nick Piggin <[email protected]> escribi?:

> From your oops it looks as though the radix_tree_lookup in find_get_page
> has returned 0x40. It could be a flipped bit - is your memory OK?

It's ECC memory, I'd doubt it.


> Can you apply the attached patch and try to reproduce the oops?

You're saying that I'll have to spend all the afternoon watching
DVDs? Well, if the linux kernel needs it!


> What happens if you run several infinite loops to increase the load?
> Does everything still stay on CPU0?

Yes, I run several "cat /dev/zero > /dev/null &" and they all kept in
CPU #0.

I did a bitsection search and I couldn't found the culprit, apparently
it is caused by a config option; now it works fine after switching off
CONFIG_HOTPLUG_CPU and some ACPI options. Also, when it didn't work
the CPU that would get all the processes could be CPU #0 or #1 - it
changed randomly depending on the boot.

2006-01-18 00:20:50

by Diego Calleja

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

El Tue, 17 Jan 2006 14:17:25 +0100,
Diego Calleja <[email protected]> escribi?:

> > Can you apply the attached patch and try to reproduce the oops?
>
> You're saying that I'll have to spend all the afternoon watching
> DVDs? Well, if the linux kernel needs it!


I've been running kaffeine for hours and i didn't triggered it, it's
hard to reproduce :/

I'll continue trying to hit it, even if it was a hardware error
it should happen again!

2006-01-18 03:25:56

by Nick Piggin

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

Diego Calleja wrote:
> El Tue, 17 Jan 2006 15:20:36 +1100,
> Nick Piggin <[email protected]> escribi?:
>

>>What happens if you run several infinite loops to increase the load?
>>Does everything still stay on CPU0?
>
>
> Yes, I run several "cat /dev/zero > /dev/null &" and they all kept in
> CPU #0.
>
> I did a bitsection search and I couldn't found the culprit, apparently
> it is caused by a config option; now it works fine after switching off
> CONFIG_HOTPLUG_CPU and some ACPI options. Also, when it didn't work
> the CPU that would get all the processes could be CPU #0 or #1 - it
> changed randomly depending on the boot.
>

If you can report those configuration options and the symptoms in a
new thread to lkml that would be helpful. Also if you can work out
when it started happening, that helps too.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-01-18 03:48:34

by Nick Piggin

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

Diego Calleja wrote:
> El Tue, 17 Jan 2006 14:17:25 +0100,
> Diego Calleja <[email protected]> escribi?:
>
>
>>>Can you apply the attached patch and try to reproduce the oops?
>>
>>You're saying that I'll have to spend all the afternoon watching
>>DVDs? Well, if the linux kernel needs it!
>
>
>
> I've been running kaffeine for hours and i didn't triggered it, it's
> hard to reproduce :/
>

That's what I feared. Thanks for trying though.

> I'll continue trying to hit it, even if it was a hardware error
> it should happen again!
>

Yeah, it is unlikely to hit the same place if it is, but if it
is a rare bug then hopefully that check will catch it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-01-18 14:03:14

by Diego Calleja

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

El Wed, 18 Jan 2006 14:25:30 +1100,
Nick Piggin <[email protected]> escribi?:


> If you can report those configuration options and the symptoms in a
> new thread to lkml that would be helpful. Also if you can work out
> when it started happening, that helps too.


It's CONFIG_ACPI_PROCESSOR who triggers it; when compiled as module
everything works but when compiled in the kernel one of the two
CPUs doesn't get any process scheduled. I'll open a new bug report.

2006-01-19 19:32:42

by Diego Calleja

[permalink] [raw]
Subject: Re: Oops with current linus' git tree

El Wed, 18 Jan 2006 14:23:09 +1100,
Nick Piggin <[email protected]> escribi?:

> > I've been running kaffeine for hours and i didn't triggered it, it's
> > hard to reproduce :/
> >
>
> That's what I feared. Thanks for trying though.

Ok, I've got another oops when closing amarok. This doesn't seem to hit
your debug checks, but I though it could be related. After the fisrt
oops I enabled the ECC event logging in the bios and it hasn't recorded
anything, so I doubt the problem is faulty ram (this is plain 2.6.16-rc1)



Eeek! page_mapcount(page) went negative! (-1)
page->flags = 400
page->count = 1
page->mapping = 00000000
------------[ cut here ]------------
kernel BUG at mm/rmap.c:524!
invalid opcode: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: ipt_REJECT xt_tcpudp radeon lp thermal fan button processor ac ipt_MASQUERADE iptable_nat ip_nat ip_conntrack iptable_filt
er ip_tables x_tables usbhid ohci_hcd parport_pc parport usbcore pcspkr floppy ide_cd e100 cdrom unix
CPU: 0
EIP: 0060:[<c014ac11>] Not tainted VLI
EFLAGS: 00010286 (2.6.16-rc1)
EIP is at page_remove_rmap+0x67/0x81
eax: ffffffff ebx: c1000000 ecx: c03214f8 edx: 00000001
esi: 00000000 edi: b6208000 ebp: e7172ee8 esp: e7172ee4
ds: 007b es: 007b ss: 0068
Process amarokapp (pid: 5475, threadinfo=e7172000 task=e7197ac0)
Stack: <0>c1000000 e7172f44 c0145bcd b60ff000 de9832b0 e7172f64 00005ef9 00000000
00000000 b62cc000 ee834b60 ee834b60 ee834b60 ee818e54 ffffffff ffffffff
debdf820 c170a680 ee818e04 b62cc000 00000000 c170a680 ee818e04 e44bf440
Call Trace:
[<c0103e58>] show_stack_log_lvl+0xaa/0xb5
[<c0103f95>] show_registers+0x132/0x19e
[<c01042ca>] die+0x168/0x1ed
[<c010454e>] do_trap+0x7c/0x96
[<c01047ad>] do_invalid_op+0x89/0x93
[<c01038e3>] error_code+0x4f/0x54
[<c0145bcd>] unmap_vmas+0x22d/0x487
[<c0148900>] unmap_region+0x92/0x116
[<c0148e97>] do_munmap+0x144/0x19a
[<c0148f3b>] sys_munmap+0x4e/0x67
[<c0102d87>] sysenter_past_esp+0x54/0x75
Code: 40 74 03 8b 53 0c 8b 42 04 40 50 68 0f 02 2e c0 e8 52 34 fd ff ff 73 10 68 26 02 2e c0 e8 45 34 fd ff 83 c4 10 8b 43 08 40 79 08 <0f> 0
b 0c 02 bb 01 2e c0 83 ca ff b8 10 00 00 00 e8 e7 45 ff ff
<3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():0
[<c010401d>] show_trace+0xd/0xf
[<c0104034>] dump_stack+0x15/0x17
[<c0116ce5>] __might_sleep+0x86/0x90
[<c011e6f3>] profile_task_exit+0x1b/0x47
[<c011f96a>] do_exit+0x1c/0x72e
[<c010434f>] do_simd_coprocessor_error+0x0/0x183
[<c010454e>] do_trap+0x7c/0x96
[<c01047ad>] do_invalid_op+0x89/0x93
[<c01038e3>] error_code+0x4f/0x54
[<c0145bcd>] unmap_vmas+0x22d/0x487
[<c0148900>] unmap_region+0x92/0x116
[<c0148e97>] do_munmap+0x144/0x19a
[<c0148f3b>] sys_munmap+0x4e/0x67
[<c0102d87>] sysenter_past_esp+0x54/0x75
note: amarokapp[5475] exited with preempt_count 2
scheduling while atomic: amarokapp/0x00000002/5475
[<c010401d>] show_trace+0xd/0xf
[<c0104034>] dump_stack+0x15/0x17
[<c02c1b31>] schedule+0x43/0x7d1
[<c02c3708>] rwsem_down_read_failed+0x166/0x185
[<c0132cad>] .text.lock.futex+0x73/0xe6
[<c0132c2b>] sys_futex+0xa2/0xb1
[<c011b277>] mm_release+0x5a/0x65
[<c011eeca>] exit_mm+0x16/0x139
[<c011face>] do_exit+0x180/0x72e
[<c010434f>] do_simd_coprocessor_error+0x0/0x183
[<c010454e>] do_trap+0x7c/0x96
[<c01047ad>] do_invalid_op+0x89/0x93
[<c01038e3>] error_code+0x4f/0x54
[<c0145bcd>] unmap_vmas+0x22d/0x487
[<c0148900>] unmap_region+0x92/0x116
[<c0148e97>] do_munmap+0x144/0x19a
[<c0148f3b>] sys_munmap+0x4e/0x67
[<c0102d87>] sysenter_past_esp+0x54/0x75