-tip testing found the following boot crash on a 64-bit x86 system:
[ 0.405247] initcall spawn_softlockup_task+0x0/0xa5 returned 0 after 0 usecs
[ 0.410004] calling relay_init+0x0/0x40 @ 1
[ 0.420005] initcall relay_init+0x0/0x40 returned 0 after 0 usecs
[ 0.426355] lockdep: fixing up alternatives.
[ 0.430110] Booting processor 1 APIC 0x1 ip 0x6000
[ 0.030000] Initializing CPU#1
[ 0.030000] masked ExtINT on CPU#1
[ 0.520060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 0.530000] IP: [<ffffffff81071317>] queue_work_on+0x27/0x70
[ 0.530000] PGD 0
[ 0.530000] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 0.530000] last sysfs file:
[ 0.530000] CPU 0
[ 0.530000] Modules linked in:
[ 0.530000] Pid: 1, comm: swapper Not tainted 2.6.31-rc8-tip #1613
[ 0.530000] RIP: 0010:[<ffffffff81071317>] [<ffffffff81071317>] queue_work_on+0x27/0x70
[ 0.530000] RSP: 0018:ffff880009007d40 EFLAGS: 00010246
[ 0.530000] RAX: 0000000000000000 RBX: ffffffff81e1d0c0 RCX: 0000000000000000
[ 0.530000] RDX: ffffffff824a1fa0 RSI: 0000000000000000 RDI: 0000000000000000
[ 0.530000] RBP: ffff880009007d50 R08: 0000000000000000 R09: 0000000000000000
[ 0.530000] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000025cb39a8
[ 0.530000] R13: ffffffff824a0c40 R14: ffff880009007e50 R15: 0000000000000100
[ 0.530000] FS: 0000000000000000(0000) GS:ffff880009004000(0000) knlGS:0000000000000000
[ 0.530000] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 0.530000] CR2: 0000000000000020 CR3: 0000000001001000 CR4: 00000000000006b0
[ 0.530000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.530000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.530000] Process swapper (pid: 1, threadinfo ffff88003f0da000, task ffff88003f0e0000)
[ 0.530000] Stack:
[ 0.530000] ffffffff824a0c40 00000000b4b7426a ffff880009007d70 ffffffff8107157d
[ 0.530000] <0> ffffffff81083846 00000000b4b7426a ffff880009007d90 ffffffff810715c9
[ 0.530000] <0> 0000000000000000 00000000b4b7426a ffff880009007dc0 ffffffff81083927
[ 0.530000] Call Trace:
[ 0.530000] <IRQ>
[ 0.530000] [<ffffffff8107157d>] queue_work+0x2d/0x50
[ 0.530000] [<ffffffff81083846>] ? clocksource_watchdog+0x26/0x240
[ 0.530000] [<ffffffff810715c9>] schedule_work+0x29/0x50
[ 0.530000] [<ffffffff81083927>] clocksource_watchdog+0x107/0x240
[ 0.530000] [<ffffffff81065eee>] run_timer_softirq+0x21e/0x380
[ 0.530000] [<ffffffff81065e29>] ? run_timer_softirq+0x159/0x380
[ 0.530000] [<ffffffff81083820>] ? clocksource_watchdog+0x0/0x240
[ 0.530000] [<ffffffff810600aa>] __do_softirq+0x10a/0x200
[ 0.530000] [<ffffffff8100cb1c>] call_softirq+0x1c/0x90
[ 0.530000] [<ffffffff8100e955>] do_softirq+0x95/0xc0
[ 0.530000] [<ffffffff8105f4c5>] irq_exit+0x75/0x90
[ 0.530000] [<ffffffff81029c20>] smp_apic_timer_interrupt+0x80/0xd0
[ 0.530000] [<ffffffff8100c4f3>] apic_timer_interrupt+0x13/0x20
[ 0.530000] <EOI>
[ 0.530000] [<ffffffff812ff997>] ? delay_tsc+0x47/0x80
[ 0.530000] [<ffffffff812ffab4>] ? __const_udelay+0x64/0x80
[ 0.530000] [<ffffffff818b3fff>] ? do_boot_cpu+0x498/0x6b3
[ 0.530000] [<ffffffff818b4567>] ? do_fork_idle+0x0/0x56
[ 0.530000] [<ffffffff810468d2>] ? complete+0x32/0x80
[ 0.530000] [<ffffffff818b4375>] ? native_cpu_up+0x15b/0x208
[ 0.530000] [<ffffffff818b69d8>] ? _cpu_up+0xf3/0x1a4
[ 0.530000] [<ffffffff818b6b4d>] ? cpu_up+0xc4/0xd7
[ 0.530000] [<ffffffff821d3f97>] ? smp_init+0x118/0x126
[ 0.530000] [<ffffffff821d40c7>] ? kernel_init+0x82/0xec
[ 0.530000] [<ffffffff8100ca1a>] ? child_rip+0xa/0x20
[ 0.530000] [<ffffffff8100c3bc>] ? restore_args+0x0/0x30
[ 0.530000] [<ffffffff821d4045>] ? kernel_init+0x0/0xec
[ 0.530000] [<ffffffff8100ca10>] ? child_rip+0x0/0x20
[ 0.530000] Code: ff 0f 1f 00 55 41 89 f8 48 89 e5 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 f0 0f ba 2a 00 19 c0 31 c9 85 c0 75 2a <44> 8b 4e 20 48 8b 3e 48 89 d6 45 85 c9 44 0f 45 05 b4 f4 f7 00
[ 0.530000] RIP [<ffffffff81071317>] queue_work_on+0x27/0x70
[ 0.530000] RSP <ffff880009007d40>
[ 0.530000] CR2: 0000000000000020
[ 0.530006] ---[ end trace a7919e7f17c0a725 ]---
[ 0.540001] Kernel panic - not syncing: Fatal exception in interrupt
Config and full bootlog attached.
Note, i have other systems that do not trigger this crash, so it's
something specific to this one. It has an hpet for example:
[ 0.000000] ACPI: HPET 000000003fef7e90 00038 (v01 INTEL D975XBX 000004B9 MSFT 01000013)
[ 0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[ 0.000000] hpet clockevent registered
i tried to bisect it but it's inconclusive:
# bad: [32beef9c] Merge branch 'perfcounters/core'
# bad: [b6413360] manual merge of x86/platform
# bad: [d9e5f39a] Merge branch 'auto-oprofile-next' into auto-latest
# bad: [cbaff272] Merge branch 'auto-timers-next' into auto-latest
as the bisection comes up with that merge commit. Perhaps the
combination of the x86/platform changes and the clocksource changes
triggered it?
Ingo
* Ingo Molnar <[email protected]> wrote:
> -tip testing found the following boot crash on a 64-bit x86 system:
>
> [ 0.405247] initcall spawn_softlockup_task+0x0/0xa5 returned 0 after 0 usecs
> [ 0.410004] calling relay_init+0x0/0x40 @ 1
> [ 0.420005] initcall relay_init+0x0/0x40 returned 0 after 0 usecs
> [ 0.426355] lockdep: fixing up alternatives.
> [ 0.430110] Booting processor 1 APIC 0x1 ip 0x6000
> [ 0.030000] Initializing CPU#1
> [ 0.030000] masked ExtINT on CPU#1
> [ 0.520060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> [ 0.530000] IP: [<ffffffff81071317>] queue_work_on+0x27/0x70
> [ 0.530000] PGD 0
> [ 0.530000] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 0.530000] last sysfs file:
> [ 0.530000] CPU 0
> [ 0.530000] Modules linked in:
> [ 0.530000] Pid: 1, comm: swapper Not tainted 2.6.31-rc8-tip #1613
> [ 0.530000] RIP: 0010:[<ffffffff81071317>] [<ffffffff81071317>] queue_work_on+0x27/0x70
> [ 0.530000] RSP: 0018:ffff880009007d40 EFLAGS: 00010246
> [ 0.530000] RAX: 0000000000000000 RBX: ffffffff81e1d0c0 RCX: 0000000000000000
> [ 0.530000] RDX: ffffffff824a1fa0 RSI: 0000000000000000 RDI: 0000000000000000
> [ 0.530000] RBP: ffff880009007d50 R08: 0000000000000000 R09: 0000000000000000
> [ 0.530000] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000025cb39a8
> [ 0.530000] R13: ffffffff824a0c40 R14: ffff880009007e50 R15: 0000000000000100
> [ 0.530000] FS: 0000000000000000(0000) GS:ffff880009004000(0000) knlGS:0000000000000000
> [ 0.530000] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 0.530000] CR2: 0000000000000020 CR3: 0000000001001000 CR4: 00000000000006b0
> [ 0.530000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 0.530000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 0.530000] Process swapper (pid: 1, threadinfo ffff88003f0da000, task ffff88003f0e0000)
> [ 0.530000] Stack:
> [ 0.530000] ffffffff824a0c40 00000000b4b7426a ffff880009007d70 ffffffff8107157d
> [ 0.530000] <0> ffffffff81083846 00000000b4b7426a ffff880009007d90 ffffffff810715c9
> [ 0.530000] <0> 0000000000000000 00000000b4b7426a ffff880009007dc0 ffffffff81083927
> [ 0.530000] Call Trace:
> [ 0.530000] <IRQ>
> [ 0.530000] [<ffffffff8107157d>] queue_work+0x2d/0x50
> [ 0.530000] [<ffffffff81083846>] ? clocksource_watchdog+0x26/0x240
> [ 0.530000] [<ffffffff810715c9>] schedule_work+0x29/0x50
> [ 0.530000] [<ffffffff81083927>] clocksource_watchdog+0x107/0x240
> [ 0.530000] [<ffffffff81065eee>] run_timer_softirq+0x21e/0x380
> [ 0.530000] [<ffffffff81065e29>] ? run_timer_softirq+0x159/0x380
> [ 0.530000] [<ffffffff81083820>] ? clocksource_watchdog+0x0/0x240
> [ 0.530000] [<ffffffff810600aa>] __do_softirq+0x10a/0x200
> [ 0.530000] [<ffffffff8100cb1c>] call_softirq+0x1c/0x90
> [ 0.530000] [<ffffffff8100e955>] do_softirq+0x95/0xc0
> [ 0.530000] [<ffffffff8105f4c5>] irq_exit+0x75/0x90
> [ 0.530000] [<ffffffff81029c20>] smp_apic_timer_interrupt+0x80/0xd0
> [ 0.530000] [<ffffffff8100c4f3>] apic_timer_interrupt+0x13/0x20
> [ 0.530000] <EOI>
> [ 0.530000] [<ffffffff812ff997>] ? delay_tsc+0x47/0x80
> [ 0.530000] [<ffffffff812ffab4>] ? __const_udelay+0x64/0x80
> [ 0.530000] [<ffffffff818b3fff>] ? do_boot_cpu+0x498/0x6b3
> [ 0.530000] [<ffffffff818b4567>] ? do_fork_idle+0x0/0x56
> [ 0.530000] [<ffffffff810468d2>] ? complete+0x32/0x80
> [ 0.530000] [<ffffffff818b4375>] ? native_cpu_up+0x15b/0x208
> [ 0.530000] [<ffffffff818b69d8>] ? _cpu_up+0xf3/0x1a4
> [ 0.530000] [<ffffffff818b6b4d>] ? cpu_up+0xc4/0xd7
> [ 0.530000] [<ffffffff821d3f97>] ? smp_init+0x118/0x126
> [ 0.530000] [<ffffffff821d40c7>] ? kernel_init+0x82/0xec
> [ 0.530000] [<ffffffff8100ca1a>] ? child_rip+0xa/0x20
btw., the crash itself seems to happen because we got a timer IRQ on
CPU#0, which tries to queue work to CPU#1 but CPU#1 is not fully
initialized yet?
Ingo
* Ingo Molnar <[email protected]> wrote:
> i tried to bisect it but it's inconclusive:
>
> # bad: [32beef9c] Merge branch 'perfcounters/core'
> # bad: [b6413360] manual merge of x86/platform
> # bad: [d9e5f39a] Merge branch 'auto-oprofile-next' into auto-latest
> # bad: [cbaff272] Merge branch 'auto-timers-next' into auto-latest
>
> as the bisection comes up with that merge commit. Perhaps the
> combination of the x86/platform changes and the clocksource
> changes triggered it?
That seems to be the case - i just tested a combination merge of
tip:auto-timers-next and tip:auto-x86-next and the result crashed in
a similar way too.
Since normal bisection cannot find such breakages, i did a topical
bisection (merging the finegrained tip:x86/* topics into the timer
tree gradually and testing each merge).
That way i could exclude: x86/platform, x86/pat, x86/asm, x86/apic,
x86/percpu, x86/cpu, x86/mm and arrived to x86/tsc - which contains
a single commit:
d3b8f88: x86: Make tsc=reliable override boot time stability checks
Reverting that commit from tip:master gives me a non-crashing
bootup.
Ingo
On Thu, 2009-09-03 at 20:58 +0200, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
> > i tried to bisect it but it's inconclusive:
> >
> > # bad: [32beef9c] Merge branch 'perfcounters/core'
> > # bad: [b6413360] manual merge of x86/platform
> > # bad: [d9e5f39a] Merge branch 'auto-oprofile-next' into auto-latest
> > # bad: [cbaff272] Merge branch 'auto-timers-next' into auto-latest
> >
> > as the bisection comes up with that merge commit. Perhaps the
> > combination of the x86/platform changes and the clocksource
> > changes triggered it?
>
> That seems to be the case - i just tested a combination merge of
> tip:auto-timers-next and tip:auto-x86-next and the result crashed in
> a similar way too.
>
> Since normal bisection cannot find such breakages, i did a topical
> bisection (merging the finegrained tip:x86/* topics into the timer
> tree gradually and testing each merge).
>
> That way i could exclude: x86/platform, x86/pat, x86/asm, x86/apic,
> x86/percpu, x86/cpu, x86/mm and arrived to x86/tsc - which contains
> a single commit:
>
> d3b8f88: x86: Make tsc=reliable override boot time stability checks
>
> Reverting that commit from tip:master gives me a non-crashing
> bootup.
Huh. Does dropping the last chunk of the patch make the issue go away?
I'm suspecting fixing the bug Thomas noticed in the tsc_unstable
assignment (we set tsc_unstable before calling mark_tsc_unstable,
causing the TSC rating to not change) that I included in this patch is
colliding with the clocksource rework from Martin.
Although I'm not sure I see that in the backtrace, so I'm likely wrong.
Hrmm..
-john