Hi Venkatesh,
I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to
boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog
lockups. git-bisect determined that the patch "[PATCH] x86-64: Fix
interrupt race in idle callback (3rd try)" was the source of these
problems, and I can work around the problem either by passing
"idle=poll" to get avoid mwait_idle or by reverting the patch.
Other non-Dempsey Xeon machines with mwait support do not exhibit these
symptoms. I will try to determine if this is a bug specific to Dempsey
CPUs or this particular type of machine. I suspect the latter, but I
don't know enough about monitor/mwait to pursue this much further.
What else can I do to diagnose this?
--D
-----------------
[ 81.794792] Parsing all Control Methods:
[ 81.798710] Table [SSDT](id 002F) - 5 Objects with 0 Devices 2 Methods 0 Regions
[ 81.806410] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] OemTableId [ Cpu0Ist] [20060707]
[ 81.815967] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
[ 81.821837] ACPI: Processor [CPU0] (supports 8 throttling states)
[ 81.831290] Parsing all Control Methods:
[ 81.835283] Table [SSDT](id 0032) - 3 Objects with 0 Devices 2 Methods 0 Regions
[ 81.842988] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] OemTableId [ Cpu1Ist] [20060707]
[ 87.276183] NMI Watchdog detected LOCKUP on CPU 3
[ 87.280944] CPU 3
[ 87.283081] Modules linked in: processor fan unix
[ 87.288109] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[ 87.294253] RIP: 0010:[<ffffffff80149c75>] [<ffffffff80149c75>] cpu_idle+0x61/0xc7
[ 87.302039] RSP: 0018:ffff8100059ebed8 EFLAGS: 00000086
[ 87.307398] RAX: 00000000ffffffff RBX: ffffffff88015daf RCX: ffffffff80156564
[ 87.314574] RDX: ffff8100059ebec8 RSI: 0000000000000002 RDI: 0000000000000001
[ 87.321760] RBP: ffff8100059ebee8 R08: 0000000000000001 R09: 0000000000000001
[ 87.328942] R10: ffffffff80160cff R11: 0000000000000246 R12: fffffffffffffff7
[ 87.336125] R13: 0000000000000040 R14: 0000000000000246 R15: 0000000000000000
[ 87.343309] FS: 0000000000000000(0000) GS:ffff8100059970a0(0000) knlGS:0000000000000000
[ 87.351453] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 87.357249] CR2: 0000000000602148 CR3: 0000000000101000 CR4: 00000000000006e0
[ 87.364435] Process swapper (pid: 0, threadinfo ffff8100059ea000, task ffff8100059a7040)
[ 87.372577] Stack: ffff8100059ebee8 ffff8100052a5600 ffff8100059ebf48 ffffffff80174dc2
[ 87.380966] 0000000000000000 ffffffff80502e38 ffffffff80503038 000000000000037d
[ 87.388684] 00000000000006e8 0000000000000000 0000000000000000 0000000000000000
[ 87.396148] Call Trace:
[ 87.398896] [<ffffffff80174dc2>] start_secondary+0x46f/0x47e
[ 87.404695]
[ 87.406245]
[ 87.406247] Code: 75 25 e8 dc 06 04 00 0f 09 0f ae f0 65 48 8b 14 25 08 00 00
[ 87.416355] <3>BUG: sleeping function called from invalid context at /home/djwong/linux-2.6.20-rc4-dic94xx/kernel/rwsem.c:20
[ 87.427809] in_atomic():1, irqs_disabled():0
[ 87.432130] no locks held by swapper/0.
[ 87.436015]
[ 87.436017] Call Trace:
[ 87.440072] <NMI> [<ffffffff8019f84e>] debug_show_held_locks+0x9/0xb
[ 87.446718] [<ffffffff8010ba32>] __might_sleep+0xc6/0xc8
[ 87.452170] [<ffffffff8019da22>] down_read+0x1d/0x45
[ 87.457276] [<ffffffff8019556d>] blocking_notifier_call_chain+0x1b/0x41
[ 87.464029] [<ffffffff8018cabc>] profile_task_exit+0x15/0x17
[ 87.469824] [<ffffffff80114b87>] do_exit+0x25/0x870
[ 87.474844] [<ffffffff801646ff>] oops_end+0x42/0x62
[ 87.479863] [<ffffffff80164a05>] sync_regs+0x0/0x71
[ 87.484883] [<ffffffff801650ac>] nmi_watchdog_tick+0x156/0x240
[ 87.490856] [<ffffffff80164c07>] default_do_nmi+0x81/0x1c6
[ 87.496480] [<ffffffff801651c2>] do_nmi+0x2c/0x40
[ 87.501326] [<ffffffff801645ef>] nmi+0x7f/0x90
[ 87.505920] [<ffffffff88015daf>] :processor:acpi_processor_idle+0x0/0x4ad
[ 87.512852] [<ffffffff80160cff>] __sched_text_start+0xa7f/0xaba
[ 87.518905] [<ffffffff80156564>] mwait_idle+0x47/0x4c
[ 87.524097] [<ffffffff80149c75>] cpu_idle+0x61/0xc7
[ 87.529116] <<EOE>> [<ffffffff80174dc2>] start_secondary+0x46f/0x47e
[ 87.535770]
[ 87.537315] Kernel panic - not syncing: Attempted to kill the idle task!
[ 87.537315] Kernel panic - not syncing: Attempted to kill the idle task!
[ 87.544064] NMI Watchdog detected LOCKUP on CPU 2
[ 92.609986] CPU 2
[ 92.612110] Modules linked in: processor fan unix
[ 92.617096] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[ 92.623230] RIP: 0010:[<ffffffff80149ca9>] [<ffffffff80149ca9>] cpu_idle+0x95/0xc7
[ 92.630998] RSP: 0018:ffff8100059b9ed8 EFLAGS: 00000046
[ 92.636358] RAX: 0000000000000000 RBX: ffffffff88015daf RCX: ffffffff80156564
[ 92.643533] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000001
[ 92.650708] RBP: ffff8100059b9ee8 R08: 0000000000000001 R09: 0000000000000001
[ 92.657883] R10: ffffffff80166fbd R11: 0000000000000070 R12: ffffffffffffffe6
[ 92.665057] R13: 0000000000000040 R14: 0000000000000246 R15: 0000000000000000
[ 92.672233] FS: 0000000000000000(0000) GS:ffff810005997858(0000) knlGS:0000000000000000
[ 92.680368] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 92.686156] CR2: 000000000060c270 CR3: 0000000000101000 CR4: 00000000000006e0
[ 92.693334] Process swapper (pid: 0, threadinfo ffff8100059b8000, task ffff81000599c080)
[ 92.701467] Stack: ffff8100059b9ee8 ffff81000529b100 ffff8100059b9f48 ffffffff80174dc2
[ 92.709813] 0000000000000000 ffffffff80502e30 ffffffff80503030 00000000000002d8
[ 92.717490] 000000000000057c 0000000000000000 0000000000000000 0000000000000000
[ 92.724926] Call Trace:
[ 92.727668] [<ffffffff80174dc2>] start_secondary+0x46f/0x47e
[ 92.733457]
[ 92.735000]
[ 92.735000] Code: 65 48 8b 04 25 10 00 00 00 8b 80 38 e0 ff ff a8 08 0f 84 6f
[ 92.745073] NMI Watchdog detected LOCKUP on CPU 7
[ 98.269459] CPU 7
[ 98.271582] Modules linked in: processor fan unix
[ 98.276578] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[ 98.282719] RIP: 0010:[<ffffffff80149c50>] [<ffffffff80149c50>] cpu_idle+0x3c/0xc7
[ 98.290487] RSP: 0018:ffff8100bf8f7ed8 EFLAGS: 00000046
[ 98.295846] RAX: ffff8100052ce400 RBX: ffffffff88015daf RCX: ffffffff80156564
[ 98.303023] RDX: ffff810084d78200 RSI: 0000000000000002 RDI: 0000000000000001
[ 98.310207] RBP: ffff8100bf8f7ee8 R08: 0000000000000001 R09: 0000000000000001
[ 98.317392] R10: ffffffff80166fbd R11: ffff8100052d0fe0 R12: ffffffffffffffc8
[ 98.324573] R13: 0000000000000040 R14: 0000000000000246 R15: 0000000000000000
[ 98.331748] FS: 0000000000000000(0000) GS:ffff8100bf82ee48(0000) knlGS:0000000000000000
[ 98.339893] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 98.345689] CR2: 000000000043d360 CR3: 0000000000101000 CR4: 00000000000006e0
[ 98.352866] Process swapper (pid: 0, threadinfo ffff8100bf8f6000, task ffff8100bf8d6080)
[ 98.361007] Stack: ffff8100bf8f7ee8 ffff8100052cea00 ffff8100bf8f7f48 ffffffff80174dc2
[ 98.369396] 0000000000000000 ffffffff80502e58 ffffffff80503058 0000000000000398
[ 98.377108] 00000000000006c1 0000000000000000 0000000000000000 0000000000000000
[ 98.384579] Call Trace:
[ 98.387326] [<ffffffff80174dc2>] start_secondary+0x46f/0x47e
[ 98.393117]
[ 98.394660]
[ 98.394661] Code: 48 8b 1d b1 03 42 00 48 c7 c0 1b a6 16 80 48 85 db 48 0f 44
(repeat until the machine reboots)
Darrick,
I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked
fine. No watchdog lockups.
Can you try idle routine with hlt instead of mwait. There is no boot
option for this in x86_64, but you can change
arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait.
With that default kernel should use hlt based idle.
Also, worth seeing will be, what happens when nmi_watchdog=0,
nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us
whether nmi_watchdog is raising some false alarm or the CPUs are indeed
getting locked up here..
Thanks,
Venki
>-----Original Message-----
>From: Darrick J. Wong [mailto:[email protected]]
>Sent: Friday, January 12, 2007 1:01 PM
>To: Pallipadi, Venkatesh
>Cc: Linux Kernel Mailing List
>Subject: [BUG] NMI watchdog lockups caused by mwait_idle
>
>Hi Venkatesh,
>
>I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to
>boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog
>lockups. git-bisect determined that the patch "[PATCH] x86-64: Fix
>interrupt race in idle callback (3rd try)" was the source of these
>problems, and I can work around the problem either by passing
>"idle=poll" to get avoid mwait_idle or by reverting the patch.
>
>Other non-Dempsey Xeon machines with mwait support do not exhibit these
>symptoms. I will try to determine if this is a bug specific to Dempsey
>CPUs or this particular type of machine. I suspect the latter, but I
>don't know enough about monitor/mwait to pursue this much further.
>
>What else can I do to diagnose this?
>
>--D
>
Pallipadi, Venkatesh wrote:
> Darrick,
>
> I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked
> fine. No watchdog lockups.
> Can you try idle routine with hlt instead of mwait. There is no boot
> option for this in x86_64, but you can change
> arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait.
> With that default kernel should use hlt based idle.
>
> Also, worth seeing will be, what happens when nmi_watchdog=0,
> nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us
> whether nmi_watchdog is raising some false alarm or the CPUs are indeed
> getting locked up here..
>
Locks up with hlt-based idle too. :(
Here's what I get with nmi_watchdog=0:
[ 206.088703] BUG: soft lockup detected on CPU#0!
[ 206.093284]
[ 206.093286] Call Trace:
[ 206.097324] <IRQ> [<ffffffff801b1f89>] softlockup_tick+0xd4/0xe9
[ 206.103618] [<ffffffff80173c55>] do_flush_tlb_all+0x0/0x68
[ 206.109238] [<ffffffff8014d8f8>] run_local_timers+0x13/0x15
[ 206.114949] [<ffffffff80192844>] update_process_times+0x4c/0x78
[ 206.121008] [<ffffffff80174fcd>] smp_local_timer_interrupt+0x34/0x51
[ 206.127498] [<ffffffff801756b1>] smp_apic_timer_interrupt+0x49/0x60
[ 206.133901] [<ffffffff8015cd16>] apic_timer_interrupt+0x66/0x70
[ 206.139956] <EOI> [<ffffffff80173baa>] __smp_call_function+0x66/0x87
[ 206.146594] [<ffffffff80173ba6>] __smp_call_function+0x62/0x87
[ 206.152564] [<ffffffff80173c55>] do_flush_tlb_all+0x0/0x68
[ 206.158188] [<ffffffff80173c55>] do_flush_tlb_all+0x0/0x68
[ 206.163813] [<ffffffff80173cef>] smp_call_function+0x32/0x49
[ 206.169611] [<ffffffff80173c55>] do_flush_tlb_all+0x0/0x68
[ 206.175236] [<ffffffff8018e117>] on_each_cpu+0x30/0x67
[ 206.180514] [<ffffffff80173d46>] flush_tlb_all+0x1c/0x1e
[ 206.185965] [<ffffffff80150f2a>] unmap_vm_area+0x1c3/0x265
[ 206.191590] [<ffffffff80101c20>] init_level4_pgt+0xc20/0x1000
[ 206.197474] [<ffffffff801bfc47>] remove_vm_area+0x41/0x67
[ 206.203010] [<ffffffff8017c33c>] iounmap+0x8e/0xc8
[ 206.207933] [<ffffffff80230032>] acpi_os_unmap_memory+0x9/0xb
[ 206.213810] [<ffffffff8023aaff>] acpi_ev_system_memory_region_setup+0x52/0x105
[ 206.221174] [<ffffffff80259465>] acpi_ut_delete_internal_obj+0x2c4/0x3b2
[ 206.228012] [<ffffffff802596d3>] acpi_ut_update_ref_count+0x180/0x1d2
[ 206.234587] [<ffffffff80259885>] acpi_ut_update_object_reference+0x160/0x207
[ 206.241770] [<ffffffff802599e1>] acpi_ut_remove_reference+0xb5/0xd5
[ 206.248173] [<ffffffff8024da8a>] acpi_ns_detach_object+0xca/0xee
[ 206.254318] [<ffffffff8024b08a>] acpi_ns_delete_namespace_by_owner+0xcf/0x154
[ 206.261597] [<ffffffff80234481>] acpi_ds_terminate_control_method+0xb5/0x14f
[ 206.268779] [<ffffffff8024ef7c>] acpi_ps_parse_aml+0x242/0x3a0
[ 206.274750] [<ffffffff80250a00>] acpi_ps_execute_pass+0xd5/0x10b
[ 206.280895] [<ffffffff80250c3c>] acpi_ps_execute_method+0x1bf/0x2cb
[ 206.287298] [<ffffffff8024b4da>] acpi_ns_evaluate+0x1f8/0x315
[ 206.293180] [<ffffffff8024abf1>] acpi_evaluate_object+0x1d9/0x2fa
[ 206.299411] [<ffffffff8010ab03>] kmem_cache_alloc+0xce/0xda
[ 206.305125] [<ffffffff880146a9>] :processor:acpi_processor_start+0x656/0x6fd
[ 206.312307] [<ffffffff801cc2a0>] kmem_cache_zalloc+0xce/0xf4
[ 206.318103] [<ffffffff80261097>] acpi_start_single_object+0x2a/0x54
[ 206.324509] [<ffffffff8026192d>] acpi_bus_register_driver+0xcd/0x14c
[ 206.331001] [<ffffffff88022061>] :processor:acpi_processor_init+0x61/0xb7
[ 206.337923] [<ffffffff801a4d6e>] sys_init_module+0xac/0x16c
[ 206.343630] [<ffffffff8015c11e>] system_call+0x7e/0x83
nmi_watchdog={1,2} produce the same errors.
--D