2006-05-13 13:49:15

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

On Wed, Apr 26, 2006 at 01:58:09PM -0700, Andrew Morton wrote:
>> I tried this patch against 2.6.17-rc2 (I hoped that it might be fixing my
>> kswapd oopses too, as they seem related; see
>> http://lkml.org/lkml/2006/4/26/124 and followups), and it simply makes my
>> machine hang on bootup -- it seems to make modprobe hang forever on some lock
>> or something right after it loads raid6.ko (pulled in by evms_activate) in
>> initramfs. Without the patch, the machine boots just fine.
> It had a silly bug. Fixed version:

That worked fine for a long time (>14 days), but I had to switch motherboards
(from a cheap Epox to a Tyan server-board) due to external factors. Since
then, stuff has started panicing again wildly -- 2.6.17-rc2 doesn't even
boot, 2.6.17-rc3 and 2.6.17-rc4 lives for an hour or so and then gives up:

[ 1127.842645] Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP:
[ 1127.848117] <ffffffff803a1ae8>{__lock_text_start+0}
[ 1127.855474] PGD 5e38a067 PUD 5e39d067 PMD 0
[ 1127.859770] Oops: 0002 [1] SMP
[ 1127.862931] CPU 1
[ 1127.864957] Modules linked in: w83627hf_wdt eeprom ide_generic ide_disk ide_cd cdrom ipv6 psmouse i2c_nforce2 serio_raw pcspkr i2c_core parport_pc parport rtc e
xt3 jbd mbcahe raid6 raid5 xor raid10 raid1 raid0 linear mdd dm_mod sd_mod sata_nv sata_sil libata scsi_mod amd74xx generic forcedeth tg3 ide_core ohci_hcd ehci_hc
d thermal processor fan unix
[ 1127.896622] Pid: 0, comm: swapper Not tainted 2.6.17-rc4 #1
[ 1127.902191] RIP: 0010:[<ffffffff803a1ae8>] <ffffffff803a1ae8>{__lock_text_start+0}
[ 1127.909589] RSP: 0018:ffff81000245bd70 EFLAGS: 00010002
[ 1127.915094] RAX: 0000000000005d09 RBX: 0000000000005d09 RCX: ffff8100020446a8
[ 1127.922221] RDX: ffff81007e2ee800 RSI: ffff810002044648 RDI: 0000000000000040
[ 1127.929346] RBP: 0000000000005d09 R08: 0000000000000000 R09: 0000000000000000
[ 1127.936472] R10: 0000000000000001 R11: ffffffff8024c868 R12: ffff81007ddb41c0
[ 1127.943599] R13: 0000000000000296 R14: ffff81007caaf650 R15: 0000000000000000
[ 1127.950726] FS: 0000000000000000(0000) GS:ffff81007f827840(0000) knlGS:00000000f7d666c0
[ 1127.958819] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 1127.964559] CR2: 0000000000000040 CR3: 0000000062aec000 CR4: 00000000000006e0
[ 1127.971686] Process swapper (pid: 0, threadinfo ffff810002452000, task ffff810002444080)
[ 1127.979777] Stack: ffffffff802668fb ffff81007caaf650 ffffffff880d3a29 ffff81007caaf650
[ 1127.987634] 0000000000000000 ffff81007cf689f0 ffff81003d41d240 0000000000000000
[ 1127.995684] ffffffff880d3b5f ffff81007e3530e8
[ 1128.000757] Call Trace: <IRQ> <ffffffff802668fb>{kmem_cache_free+253}
[ 1128.007231] <ffffffff880d3a29>{:dm_mod:dec_pending+169} <ffffffff880d3b5f>{:dm_mod:clone_endio+127}
[ 1128.016932] <ffffffff802c9372>{__end_that_request_first+420} <ffffffff8808e8a6>{:scsi_mod:scsi_end_request+40}
[ 1128.027589] <ffffffff8808eb51>{:scsi_mod:scsi_io_completion+522}
[ 1128.034222] <ffffffff880cc4a1>{:sd_mod:sd_rw_intr+623} <ffffffff8808f5d6>{:scsi_mod:scsi_device_unbusy+85}
[ 1128.044527] <ffffffff802c86cb>{blk_done_softirq+113} <ffffffff8022c41b>{__do_softirq+86}
[ 1128.053265] <ffffffff8020a742>{call_softirq+30} <ffffffff8020b902>{do_softirq+44}
[ 1128.061396] <ffffffff8020b947>{do_IRQ+65} <ffffffff8020827b>{default_idle+0}
[ 1128.069091] <ffffffff80209aa0>{ret_from_intr+0} <EOI> <ffffffff803a02a0>{thread_return+0}
[ 1128.077926] <ffffffff802082a8>{default_idle+45} <ffffffff80208335>{cpu_idle+98}
[ 1128.085883] <ffffffff804d5c22>{start_secondary+1127}
[ 1128.091670]
[ 1128.091671] Code: f0 ff 0f 0f 88 c8 01 00 00 c3 f0 ff 0f 8b 07 ba 01 00 00 00
[ 1128.100588] RIP <ffffffff803a1ae8>{__lock_text_start+0} RSP <ffff81000245bd70>
[ 1128.107832] CR2: 0000000000000040
[ 1128.111378] <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
[ 1128.118498] <0>Rebooting in 60 seconds..

This is with the remove-softlockup-from-invalidate_mapping_pages patch, but
it looks like it's crashing somewhere else. Any good ideas? Is this related
to the memory management at all, or is is better left to the dm people?

/* Steinar */
--
Homepage: http://www.sesse.net/


2006-05-13 14:36:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

"Steinar H. Gunderson" <[email protected]> wrote:
>
> On Wed, Apr 26, 2006 at 01:58:09PM -0700, Andrew Morton wrote:
> >> I tried this patch against 2.6.17-rc2 (I hoped that it might be fixing my
> >> kswapd oopses too, as they seem related; see
> >> http://lkml.org/lkml/2006/4/26/124 and followups), and it simply makes my
> >> machine hang on bootup -- it seems to make modprobe hang forever on some lock
> >> or something right after it loads raid6.ko (pulled in by evms_activate) in
> >> initramfs. Without the patch, the machine boots just fine.
> > It had a silly bug. Fixed version:
>
> That worked fine for a long time (>14 days), but I had to switch motherboards
> (from a cheap Epox to a Tyan server-board) due to external factors. Since
> then, stuff has started panicing again wildly -- 2.6.17-rc2 doesn't even
> boot, 2.6.17-rc3 and 2.6.17-rc4 lives for an hour or so and then gives up:
>
> [ 1127.842645] Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP:
> [ 1127.848117] <ffffffff803a1ae8>{__lock_text_start+0}
> [ 1127.855474] PGD 5e38a067 PUD 5e39d067 PMD 0
> [ 1127.859770] Oops: 0002 [1] SMP
> [ 1127.862931] CPU 1
> [ 1127.864957] Modules linked in: w83627hf_wdt eeprom ide_generic ide_disk ide_cd cdrom ipv6 psmouse i2c_nforce2 serio_raw pcspkr i2c_core parport_pc parport rtc e
> xt3 jbd mbcahe raid6 raid5 xor raid10 raid1 raid0 linear mdd dm_mod sd_mod sata_nv sata_sil libata scsi_mod amd74xx generic forcedeth tg3 ide_core ohci_hcd ehci_hc
> d thermal processor fan unix
> [ 1127.896622] Pid: 0, comm: swapper Not tainted 2.6.17-rc4 #1
> [ 1127.902191] RIP: 0010:[<ffffffff803a1ae8>] <ffffffff803a1ae8>{__lock_text_start+0}
> [ 1127.909589] RSP: 0018:ffff81000245bd70 EFLAGS: 00010002
> [ 1127.915094] RAX: 0000000000005d09 RBX: 0000000000005d09 RCX: ffff8100020446a8
> [ 1127.922221] RDX: ffff81007e2ee800 RSI: ffff810002044648 RDI: 0000000000000040
> [ 1127.929346] RBP: 0000000000005d09 R08: 0000000000000000 R09: 0000000000000000
> [ 1127.936472] R10: 0000000000000001 R11: ffffffff8024c868 R12: ffff81007ddb41c0
> [ 1127.943599] R13: 0000000000000296 R14: ffff81007caaf650 R15: 0000000000000000
> [ 1127.950726] FS: 0000000000000000(0000) GS:ffff81007f827840(0000) knlGS:00000000f7d666c0
> [ 1127.958819] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 1127.964559] CR2: 0000000000000040 CR3: 0000000062aec000 CR4: 00000000000006e0
> [ 1127.971686] Process swapper (pid: 0, threadinfo ffff810002452000, task ffff810002444080)
> [ 1127.979777] Stack: ffffffff802668fb ffff81007caaf650 ffffffff880d3a29 ffff81007caaf650
> [ 1127.987634] 0000000000000000 ffff81007cf689f0 ffff81003d41d240 0000000000000000
> [ 1127.995684] ffffffff880d3b5f ffff81007e3530e8
> [ 1128.000757] Call Trace: <IRQ> <ffffffff802668fb>{kmem_cache_free+253}
> [ 1128.007231] <ffffffff880d3a29>{:dm_mod:dec_pending+169} <ffffffff880d3b5f>{:dm_mod:clone_endio+127}
> [ 1128.016932] <ffffffff802c9372>{__end_that_request_first+420} <ffffffff8808e8a6>{:scsi_mod:scsi_end_request+40}
> [ 1128.027589] <ffffffff8808eb51>{:scsi_mod:scsi_io_completion+522}
> [ 1128.034222] <ffffffff880cc4a1>{:sd_mod:sd_rw_intr+623} <ffffffff8808f5d6>{:scsi_mod:scsi_device_unbusy+85}
> [ 1128.044527] <ffffffff802c86cb>{blk_done_softirq+113} <ffffffff8022c41b>{__do_softirq+86}
> [ 1128.053265] <ffffffff8020a742>{call_softirq+30} <ffffffff8020b902>{do_softirq+44}
> [ 1128.061396] <ffffffff8020b947>{do_IRQ+65} <ffffffff8020827b>{default_idle+0}
> [ 1128.069091] <ffffffff80209aa0>{ret_from_intr+0} <EOI> <ffffffff803a02a0>{thread_return+0}
> [ 1128.077926] <ffffffff802082a8>{default_idle+45} <ffffffff80208335>{cpu_idle+98}
> [ 1128.085883] <ffffffff804d5c22>{start_secondary+1127}
> [ 1128.091670]
> [ 1128.091671] Code: f0 ff 0f 0f 88 c8 01 00 00 c3 f0 ff 0f 8b 07 ba 01 00 00 00
> [ 1128.100588] RIP <ffffffff803a1ae8>{__lock_text_start+0} RSP <ffff81000245bd70>
> [ 1128.107832] CR2: 0000000000000040
> [ 1128.111378] <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
> [ 1128.118498] <0>Rebooting in 60 seconds..
>
> This is with the remove-softlockup-from-invalidate_mapping_pages patch, but
> it looks like it's crashing somewhere else. Any good ideas? Is this related
> to the memory management at all, or is is better left to the dm people?
>

Well if it's the same software lineup on new hardware, one would also
suspect that hardware. Is it new?

Does it run other kernels OK?

Does it always crash in the same manner?

2006-05-13 14:43:40

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

On Sat, May 13, 2006 at 07:33:44AM -0700, Andrew Morton wrote:
> Well if it's the same software lineup on new hardware, one would also
> suspect that hardware. Is it new?

Yes, it's new. The differences aren't that big, though: The motherboard has
been changed, and there's an extra sil3114 controller.

> Does it run other kernels OK?

2.6.15.4 appears to be fine, but I haven't tested it enough to make sure.
(I'm running 2.6.17-rc4 without swap now, so we'll see.)

> Does it always crash in the same manner?

Yes; consistently and in the same place after about the same amount of time.

/* Steinar */
--
Homepage: http://www.sesse.net/

2006-05-13 14:54:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

"Steinar H. Gunderson" <[email protected]> wrote:
>
> On Sat, May 13, 2006 at 07:33:44AM -0700, Andrew Morton wrote:
> > Well if it's the same software lineup on new hardware, one would also
> > suspect that hardware. Is it new?
>
> Yes, it's new. The differences aren't that big, though: The motherboard has
> been changed, and there's an extra sil3114 controller.
>
> > Does it run other kernels OK?
>
> 2.6.15.4 appears to be fine, but I haven't tested it enough to make sure.
> (I'm running 2.6.17-rc4 without swap now, so we'll see.)
>
> > Does it always crash in the same manner?
>
> Yes; consistently and in the same place after about the same amount of time.

ho-hum. Please see if there's anything else you can do to rule out a
hardware failure, then copy [email protected] on the next oops report.

The stack backtrace you have there is a little surprising. Enabling
CONFIG_FRAME_POINTER might help clear it up. Also it'd be worth seeing if
CONFIG_DEBUG_SLAB turns up anything.

Thanks.

2006-05-13 15:00:09

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

On Sat, May 13, 2006 at 07:51:47AM -0700, Andrew Morton wrote:
> ho-hum. Please see if there's anything else you can do to rule out a
> hardware failure, then copy [email protected] on the next oops report.

It's not swap related. It crashes even without swap enabled; still with lots of
dm stuff in the backtrace, though:

[ 3192.568880] general protection fault: 0000 [1] SMP
[ 3192.573779] CPU 1
[ 3192.575804] Modules linked in: w83627hf_wdt eeprom ide_generic ide_disk ide_cd cdrom ipv6 psmouse i2c_nforce2 serio_raw pcspkr i2c_core parport_pc parport rtc ext3 jbd mbcache raid6 raid5 xor raid10 raid1 raid0 linear md_mod dm_mod sd_mod sata_nv tg3 sata_sil libata scsi_mod forcedeth generic amd74xx ehci_hcd ide_core ohc i_hcd thermal processor fan unix
[ 3192.607472] Pid: 3432, comm: md1_raid5 Not tainted 2.6.17-rc4 #1
[ 3192.613471] RIP: 0010:[<ffffffff803a1ae8>] <ffffffff803a1ae8>{__lock_text_start+0}
[ 3192.620870] RSP: 0018:ffff81000245bd70 EFLAGS: 00210086
[ 3192.626374] RAX: 000000000000fa40 RBX: aaaa8b5ad269b80f RCX: ffff81000151ff50
[ 3192.633501] RDX: ffff81007febd600 RSI: ffff81000151fef0 RDI: aaaa8b5ad269b81f
[ 3192.640628] RBP: 000000000000fa40 R08: ffff81007db6d2c0 R09: ffff81007db6d2c0
[ 3192.647756] R10: 0000000000000007 R11: ffffffff8024c868 R12: ffff81007febf040
[ 3192.654882] R13: 0000000000200296 R14: ffff810004a97fb0 R15: 0000000000000000
[ 3192.662009] FS: 0000000000000000(0000) GS:ffff81007f827840(0000) knlGS:00000000f7ad9ae0
[ 3192.670101] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 3192.675843] CR2: 00000000080850e0 CR3: 000000007c11e000 CR4: 00000000000006e0
[ 3192.682970] Process md1_raid5 (pid: 3432, threadinfo ffff81007ddb6000, task ffff81007f997080)
[ 3192.691495] Stack: ffffffff802668b8 ffff81007f27c880 ffff810049bb23c0 ffff810049bb23c0
[ 3192.699353] 0000000000000000 ffff81007d0788b8 ffff810049bb23c0 0000000000000000
[ 3192.707404] ffffffff880d3b67 ffff81007ed0cba8
[ 3192.712476] Call Trace: <IRQ> <ffffffff802668b8>{kmem_cache_free+186}
[ 3192.718950] <ffffffff880d3b67>{:dm_mod:clone_endio+135} <ffffffff802c9372>{__end_that_request_first+420}
[ 3192.729081] <ffffffff802c7d1b>{blk_run_queue+62} <ffffffff8806f8a6>{:scsi_mod:scsi_end_request+40}
[ 3192.738700] <ffffffff8806fb51>{:scsi_mod:scsi_io_completion+522}
[ 3192.745334] <ffffffff880cc4a1>{:sd_mod:sd_rw_intr+623} <ffffffff880705d6>{:scsi_mod:scsi_device_unbusy+85}
[ 3192.755641] <ffffffff802c86cb>{blk_done_softirq+113} <ffffffff8022c41b>{__do_softirq+86}
[ 3192.764377] <ffffffff8020a742>{call_softirq+30} <ffffffff8020b902>{do_softirq+44}
[ 3192.772509] <ffffffff8020b947>{do_IRQ+65} <ffffffff80209aa0>{ret_from_intr+0} <EOI>
[ 3192.780822] <ffffffff881129a7>{:raid5:compute_parity+880} <ffffffff802d5e2f>{memcmp+11}
[ 3192.789476] <ffffffff8811467a>{:raid5:handle_stripe+3022} <ffffffff80238a7c>{keventd_create_kthread+0}
[ 3192.799424] <ffffffff80238a7c>{keventd_create_kthread+0} <ffffffff881153e9>{:raid5:raid5d+333}
[ 3192.808682] <ffffffff880e464f>{:md_mod:md_thread+0} <ffffffff880e4751>{:md_mod:md_thread+258}
[ 3192.817864] <ffffffff80238e78>{autoremove_wake_function+0} <ffffffff880e464f>{:md_mod:md_thread+0}
[ 3192.827471] <ffffffff80238cc4>{kthread+203} <ffffffff8020a3f2>{child_rip+8}
[ 3192.835081] <ffffffff80238a7c>{keventd_create_kthread+0} <ffffffff80238bf9>{kthread+0}
[ 3192.843646] <ffffffff8020a3ea>{child_rip+0}
[ 3192.848646]
[ 3192.848647] Code: f0 ff 0f 0f 88 c8 01 00 00 c3 f0 ff 0f 8b 07 ba 01 00 00 00
[ 3192.857563] RIP <ffffffff803a1ae8>{__lock_text_start+0} RSP <ffff81000245bd70>
[ 3192.865039] <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
[ 3192.872164] <0>Rebooting in 60 seconds..

(Thank goodness for serial console; I couldn't possibly write all these oopses
by hand. :-) )

> The stack backtrace you have there is a little surprising. Enabling
> CONFIG_FRAME_POINTER might help clear it up. Also it'd be worth seeing if
> CONFIG_DEBUG_SLAB turns up anything.

I'm recompiling 2.6.17-rc4 now with those two added in. I'll let you know in a
few hours when it crashes again, I'd guess :-)

Would it be a good idea to revert your mm patch and test again, just in case?

/* Steinar */
--
Homepage: http://www.sesse.net/

2006-05-13 15:27:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

"Steinar H. Gunderson" <[email protected]> wrote:
>
> On Sat, May 13, 2006 at 07:51:47AM -0700, Andrew Morton wrote:
> > ho-hum. Please see if there's anything else you can do to rule out a
> > hardware failure, then copy [email protected] on the next oops report.
>
> It's not swap related. It crashes even without swap enabled; still with lots of
> dm stuff in the backtrace, though:

(for dm-devel: 2.6.15.4 runs OK on this machine with the same config) (yes?)

> [ 3192.568880] general protection fault: 0000 [1] SMP
> [ 3192.573779] CPU 1
> [ 3192.575804] Modules linked in: w83627hf_wdt eeprom ide_generic ide_disk ide_cd cdrom ipv6 psmouse i2c_nforce2 serio_raw pcspkr i2c_core parport_pc parport rtc ext3 jbd mbcache raid6 raid5 xor raid10 raid1 raid0 linear md_mod dm_mod sd_mod sata_nv tg3 sata_sil libata scsi_mod forcedeth generic amd74xx ehci_hcd ide_core ohc i_hcd thermal processor fan unix
> [ 3192.607472] Pid: 3432, comm: md1_raid5 Not tainted 2.6.17-rc4 #1
> [ 3192.613471] RIP: 0010:[<ffffffff803a1ae8>] <ffffffff803a1ae8>{__lock_text_start+0}
> [ 3192.620870] RSP: 0018:ffff81000245bd70 EFLAGS: 00210086
> [ 3192.626374] RAX: 000000000000fa40 RBX: aaaa8b5ad269b80f RCX: ffff81000151ff50
> [ 3192.633501] RDX: ffff81007febd600 RSI: ffff81000151fef0 RDI: aaaa8b5ad269b81f
> [ 3192.640628] RBP: 000000000000fa40 R08: ffff81007db6d2c0 R09: ffff81007db6d2c0
> [ 3192.647756] R10: 0000000000000007 R11: ffffffff8024c868 R12: ffff81007febf040
> [ 3192.654882] R13: 0000000000200296 R14: ffff810004a97fb0 R15: 0000000000000000
> [ 3192.662009] FS: 0000000000000000(0000) GS:ffff81007f827840(0000) knlGS:00000000f7ad9ae0
> [ 3192.670101] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 3192.675843] CR2: 00000000080850e0 CR3: 000000007c11e000 CR4: 00000000000006e0
> [ 3192.682970] Process md1_raid5 (pid: 3432, threadinfo ffff81007ddb6000, task ffff81007f997080)
> [ 3192.691495] Stack: ffffffff802668b8 ffff81007f27c880 ffff810049bb23c0 ffff810049bb23c0
> [ 3192.699353] 0000000000000000 ffff81007d0788b8 ffff810049bb23c0 0000000000000000
> [ 3192.707404] ffffffff880d3b67 ffff81007ed0cba8
> [ 3192.712476] Call Trace: <IRQ> <ffffffff802668b8>{kmem_cache_free+186}
> [ 3192.718950] <ffffffff880d3b67>{:dm_mod:clone_endio+135} <ffffffff802c9372>{__end_that_request_first+420}
> [ 3192.729081] <ffffffff802c7d1b>{blk_run_queue+62} <ffffffff8806f8a6>{:scsi_mod:scsi_end_request+40}
> [ 3192.738700] <ffffffff8806fb51>{:scsi_mod:scsi_io_completion+522}
> [ 3192.745334] <ffffffff880cc4a1>{:sd_mod:sd_rw_intr+623} <ffffffff880705d6>{:scsi_mod:scsi_device_unbusy+85}
> [ 3192.755641] <ffffffff802c86cb>{blk_done_softirq+113} <ffffffff8022c41b>{__do_softirq+86}
> [ 3192.764377] <ffffffff8020a742>{call_softirq+30} <ffffffff8020b902>{do_softirq+44}
> [ 3192.772509] <ffffffff8020b947>{do_IRQ+65} <ffffffff80209aa0>{ret_from_intr+0} <EOI>
> [ 3192.780822] <ffffffff881129a7>{:raid5:compute_parity+880} <ffffffff802d5e2f>{memcmp+11}
> [ 3192.789476] <ffffffff8811467a>{:raid5:handle_stripe+3022} <ffffffff80238a7c>{keventd_create_kthread+0}
> [ 3192.799424] <ffffffff80238a7c>{keventd_create_kthread+0} <ffffffff881153e9>{:raid5:raid5d+333}
> [ 3192.808682] <ffffffff880e464f>{:md_mod:md_thread+0} <ffffffff880e4751>{:md_mod:md_thread+258}
> [ 3192.817864] <ffffffff80238e78>{autoremove_wake_function+0} <ffffffff880e464f>{:md_mod:md_thread+0}
> [ 3192.827471] <ffffffff80238cc4>{kthread+203} <ffffffff8020a3f2>{child_rip+8}
> [ 3192.835081] <ffffffff80238a7c>{keventd_create_kthread+0} <ffffffff80238bf9>{kthread+0}
> [ 3192.843646] <ffffffff8020a3ea>{child_rip+0}
> [ 3192.848646]
> [ 3192.848647] Code: f0 ff 0f 0f 88 c8 01 00 00 c3 f0 ff 0f 8b 07 ba 01 00 00 00
> [ 3192.857563] RIP <ffffffff803a1ae8>{__lock_text_start+0} RSP <ffff81000245bd70>
> [ 3192.865039] <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
> [ 3192.872164] <0>Rebooting in 60 seconds..
>
> (Thank goodness for serial console; I couldn't possibly write all these oopses
> by hand. :-) )
>
> > The stack backtrace you have there is a little surprising. Enabling
> > CONFIG_FRAME_POINTER might help clear it up. Also it'd be worth seeing if
> > CONFIG_DEBUG_SLAB turns up anything.
>
> I'm recompiling 2.6.17-rc4 now with those two added in. I'll let you know in a
> few hours when it crashes again, I'd guess :-)
>
> Would it be a good idea to revert your mm patch and test again, just in case?

Which patch? remove-softlockup-from-invalidate_mapping_pages.patch? No,
that won't have caused this. But then, it's not really obvious what this
crash is.

Now your earlier trace had this important info:

[ 1127.842645] Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP:
[ 1127.848117] <ffffffff803a1ae8>{__lock_text_start+0}
[ 1127.855474] PGD 5e38a067 PUD 5e39d067 PMD 0
[ 1127.859770] Oops: 0002 [1] SMP
[ 1127.862931] CPU 1


Which kind of implies that we passed a null (or very small small) `struct
kmem_cache' pointer into kmem_cache_free(). But that doesn't seem like the
sort of thing which will take hours to reproduce.

Do you have CONFIG_NUMA set?

2006-05-13 15:32:35

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

On Sat, May 13, 2006 at 08:24:09AM -0700, Andrew Morton wrote:
> (for dm-devel: 2.6.15.4 runs OK on this machine with the same config) (yes?)

I can't really assert that; it doesn't _seem_ to be bothered the same way,
but I've only run it for an hour or so to compile the other kernels.

Just to be 100% sure it's not overheating or such, I've turned off all fan
control in the BIOS. But then again, overheating sounds really far off; the
machine is in a cooled server room (20 degrees Celsius), and has lots of fans
making sure the air flow is good.

> Which patch? remove-softlockup-from-invalidate_mapping_pages.patch? No,
> that won't have caused this. But then, it's not really obvious what this
> crash is.

OK.

> Which kind of implies that we passed a null (or very small small) `struct
> kmem_cache' pointer into kmem_cache_free(). But that doesn't seem like the
> sort of thing which will take hours to reproduce.
>
> Do you have CONFIG_NUMA set?

Hm, yes, for some reason CONFIG_NUMA, CONFIG_K8_NUMA and
CONFIG_x86_64_ACPI_NUMA are all set; I guess they're left from the stock
config I use at a base. (For those tuning in; this is a dual-core amd64. That
might be relevant somehow.)

/* Steinar */
--
Homepage: http://www.sesse.net/

2006-05-13 15:46:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

"Steinar H. Gunderson" <[email protected]> wrote:
>
> > Which kind of implies that we passed a null (or very small small) `struct
> > kmem_cache' pointer into kmem_cache_free(). But that doesn't seem like the
> > sort of thing which will take hours to reproduce.
> >
> > Do you have CONFIG_NUMA set?
>
> Hm, yes, for some reason CONFIG_NUMA, CONFIG_K8_NUMA and
> CONFIG_x86_64_ACPI_NUMA are all set; I guess they're left from the stock
> config I use at a base. (For those tuning in; this is a dual-core amd64. That
> might be relevant somehow.)

Please try it without NUMA - we've had a few problems there of late. If
that fixes it then I have patches for you to test ;)

2006-05-13 17:14:18

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

On Sat, May 13, 2006 at 08:43:17AM -0700, Andrew Morton wrote:
> Please try it without NUMA - we've had a few problems there of late. If
> that fixes it then I have patches for you to test ;)

No go:

[ 2364.512764] Unable to handle kernel paging request at ffffffff403ae552 RIP:
[ 2364.517369] [<ffffffff403ae552>]
[ 2364.526396] PGD 203027 PUD 0
[ 2364.529385] Oops: 0010 [1] SMP
[ 2364.532546] CPU 0
[ 2364.534571] Modules linked in: w83627hf_wdt eeprom ide_generic ide_disk ide_cd cdrom ipv6 psmouse i2c_nforce2 serio_raw pcspkr evdev parport_pc parport i2c_core rtc ext3 jbd mbcache raid6 raid5 xor rai
d10 raid1 raid0 linear md_mod dm_mod sd_mod sata_nv sata_sil libata tg3 scsi_mod forcedeth generic amd74xx ide_core ehci_hcd ohci_hcd thermal processor fan unix
[ 2364.566790] Pid: 3433, comm: md1_resync Not tainted 2.6.17-rc4 #1
[ 2364.572877] RIP: 0010:[<ffffffff403ae552>] [<ffffffff403ae552>]
[ 2364.578612] RSP: 0018:ffff81007d5d3c40 EFLAGS: 00010287
[ 2364.584118] RAX: 0000000000000010 RBX: ffff8100109cce90 RCX: 0000000000000290
[ 2364.591246] RDX: 000000000000000a RSI: 0000000000000282 RDI: ffff81007df604d8
[ 2364.598372] RBP: ffff81007d5d3c68 R08: 0000000000001d4b R09: ffff810076aa9e78
[ 2364.605499] R10: ffff810076aa9e78 R11: 0000000000001000 R12: ffff81007dca7448
[ 2364.612627] R13: ffff81007df60488 R14: ffff81007df73528 R15: ffff81007dca7448
[ 2364.619755] FS: 0000000000000000(0000) GS:ffffffff804c3000(0000) knlGS:00000000f7e968e0
[ 2364.627848] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 2364.633588] CR2: ffffffff403ae552 CR3: 000000007b91c000 CR4: 00000000000006e0
[ 2364.640715] Process md1_resync (pid: 3433, threadinfo ffff81007d5d2000, task ffff81007fcb47d0)
[ 2364.649325] Stack: ffffffff8806a93b ffff8100109cce90 ffff81007dc304c8 ffff81007df60488
[ 2364.657182] ffff81007df73528 ffff81007d5d3ca8 ffffffff8807070e 0000000000000001
[ 2364.665238] ffff81007df73528 ffff81007dd8b9a8
[ 2364.670310] Call Trace: <ffffffff8806a93b>{:scsi_mod:scsi_dispatch_cmd+381}
[ 2364.677308] <ffffffff8807070e>{:scsi_mod:scsi_request_fn+735} <ffffffff802cbdad>{__generic_unplug_device+36}
[ 2364.687790] <ffffffff802cbed9>{generic_unplug_device+0} <ffffffff802cbef7>{generic_unplug_device+30}
[ 2364.697570] <ffffffff880d753c>{:dm_mod:dm_table_unplug_all+52} <ffffffff880d51cf>{:dm_mod:dm_unplug_all+35}
[ 2364.707964] <ffffffff8811272e>{:raid5:unplug_slaves+105} <ffffffff88112a7d>{:raid5:raid5_unplug_device+107}
[ 2364.718350] <ffffffff880ec66a>{:md_mod:md_do_sync+1300} <ffffffff803acb5d>{thread_return+0}
[ 2364.727361] <ffffffff880e67e8>{:md_mod:md_thread+0} <ffffffff880e68ed>{:md_mod:md_thread+261}
[ 2364.736542] <ffffffff880e67e8>{:md_mod:md_thread+0} <ffffffff80239e67>{kthread+207}
[ 2364.744858] <ffffffff8020a522>{child_rip+8} <ffffffff80239d98>{kthread+0}
[ 2364.752300] <ffffffff8020a51a>{child_rip+0}
[ 2364.757303]
[ 2364.757304] Code: Bad RIP value.
[ 2364.762134] RIP [<ffffffff403ae552>] RSP <ffff81007d5d3c40>
[ 2364.767722] CR2: ffffffff403ae552

This is about the same crash as I used to see with -rc2.

I guess it's time for memtest86, just in case I botched the RAM while moving it...

/* Steinar */
--
Homepage: http://www.sesse.net/

2006-05-13 22:37:43

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

What's the device configuration here:
e.g. linear dm logical volume over md raid5?

[e.g. post output of dmsetup info -c' and 'dmsetup table'
and 'dmsetup status' remembering to blank out the
long hex string if you have dm-crypt]

Alasdair
--
[email protected]

2006-05-13 23:28:47

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (might be dm related)

On Sat, May 13, 2006 at 07:14:14PM +0200, Steinar H. Gunderson wrote:
> I guess it's time for memtest86, just in case I botched the RAM while
> moving it...

Good news! :-) memtest86+ refused to boot, since I have too much memory (2GB)
and this somehow interferes with grub, but mprime (my other favourite tool
for this kind of testing) quite quickly bombed out. (This in itself usually
means bad RAM, bad motherboard or some interrupt handler messing up badly.)

I moved the RAM from DIMM slot 3+4 to 1+2, and suddenly things became a _lot_
better. I definitely can't say it's oops-free yet, since it's only been
running a couple of hours, but this certainly points towards that the larger
part of my issues were some odd sort of motherboard hiccup.

I take it this also frees the dm, md and NUMA code of most suspicion --
consider the issues fixed for now, and I'll be sure to bug you again if I
start to get oopses again. Sorry for wasting your time :-)

/* Steinar */
--
Homepage: http://www.sesse.net/