Sorry for the bump on this thread, but bringing up CPU0 on 4.16 causes ~60 sec
delay and then injects another 120 seconds of dumps. 4.14 has no delay:
Hardware: Supermicro H8DM8-2/H8DM8-2, 2-Quad Core Opetron
dmesg Summary (full dmesg attached):
[ 0.464392] NET: Registered protocol family 1
[ 0.464769] PCI Interrupt Link [LUB0] enabled at IRQ 23
<snip>
[ 10.412957] perf: interrupt took too long (10078 > 2500), lowering
kernel.perf_event_max_sample_rate to 19800
[ 10.466267] INFO: NMI handler (perf_event_nmi_handler) took too long to
run: 3.398 msecs
[ 10.466267] perf: interrupt took too long (23235 > 12597), lowering
kernel.perf_event_max_sample_rate to 8400
[ 11.139636] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to
run: 38.086 msecs
[ 11.138340] perf: interrupt took too long (979330 > 911352), lowering
kernel.perf_event_max_sample_rate to 300
[ 11.391081] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to
run: 72.692 msecs
[ 11.139636] INFO: NMI handler (perf_event_nmi_handler) took too long to
run: 381.378 msecs
[ 12.138796] perf: interrupt took too long (1897735 > 1224162), lowering
kernel.perf_event_max_sample_rate to 300
[ 25.142864] INFO: NMI handler (perf_event_nmi_handler) took too long to
run: 1947.833 msecs
[ 28.140139] perf: interrupt took too long (13755889 > 2372168), lowering
kernel.perf_event_max_sample_rate to 300
[ 28.138850] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[ 28.138850] Modules linked in:
[ 28.138850] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.13-1-ARCH #1
[ 86.138261] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014 11/06/2009
[ 90.139366] RIP: 0010:__do_softirq+0x75/0x2e0
[ 90.139366] RSP: 0018:ffff9b36e7c03f70 EFLAGS: 00000297 ORIG_RAX:
ffffffffffffff12
[ 90.139366] RAX: 0000000000021000 RBX: ffff9b36e7c15f00 RCX: 000000003543891f
[ 91.138917] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to
run: 467.840 msecs
[ 91.138917] RDX: 0000000000000000 RSI: 000000003543891f RDI: 0000000000000100
[ 91.138917] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000078
[ 91.138917] R10: 000000005b1397ce R11: 0000000000000000 R12: 0000000000000000
[ 91.138917] R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000000
[ 91.138917] FS: 0000000000000000(0000) GS:ffff9b36e7c00000(0000)
knlGS:0000000000000000
[ 91.138917] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 91.138917] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 92.138146] CR2: ffff9b33d2655000 CR3: 000000011200a000 CR4: 00000000000006f0
[ 92.138146] Call Trace:
[ 92.138146] <IRQ>
[ 92.138146] irq_exit+0xc9/0xe0
[ 93.138944] smp_apic_timer_interrupt+0x73/0x160
[ 94.137773] Modules linked in:
[ 94.137772] apic_timer_interrupt+0xf/0x20
[ 95.137578] </IRQ>
[ 95.137578] RIP: 0010:native_safe_halt+0x2/0x10
[ 95.137578] RSP: 0018:ffffffffbf003ea0 EFLAGS: 00000202
[ 96.136670] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.16.13-1-ARCH #1
[ 97.138510] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014 11/06/2009
[ 97.138081] ORIG_RAX: ffffffffffffff12
<Roughly 900 lines snipped>
then continues
[ 151.438482] Sending NMI from CPU 1 to CPUs 3:
[ 151.469721] NMI backtrace for cpu 3
[ 151.469722] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G L
4.16.13-1-ARCH #1
[ 151.469723] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014 11/06/2009
[ 151.469724] RIP: 0010:native_queued_spin_lock_slowpath+0x19a/0x1c0
[ 151.469725] RSP: 0018:ffff9b36e7cc3ef8 EFLAGS: 00000046
[ 151.469726] RAX: 0000000000000000 RBX: ffffffffbf051640 RCX: ffff9b36e7ce2480
[ 151.469726] RDX: 0000000000000002 RSI: ffffffffbee70e1c RDI: 0000000000100000
[ 151.469727] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000101
[ 151.469728] R10: 0000000000338518 R11: 0000000000000000 R12: ffff9b36e7ce2480
[ 151.469728] R13: ffffffffbeed81f2 R14: 0000000000000009 R15: 0000000000000000
[ 151.469729] FS: 0000000000000000(0000) GS:ffff9b36e7cc0000(0000)
knlGS:0000000000000000
[ 151.469730] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 151.469731] CR2: 0000000000000000 CR3: 000000011200a000 CR4: 00000000000006e0
[ 151.469731] Call Trace:
[ 151.469732] <IRQ>
[ 151.469733] _raw_spin_lock_irqsave+0x44/0x50
[ 151.469735] rcu_process_callbacks+0xce/0x3b0
[ 151.469737] __do_softirq+0xf1/0x2e0
[ 151.469739] irq_exit+0xc9/0xe0
[ 151.469741] smp_apic_timer_interrupt+0x73/0x160
[ 151.469743] apic_timer_interrupt+0xf/0x20
[ 151.469744] </IRQ>
[ 151.469745] RIP: 0010:native_safe_halt+0x2/0x10
[ 151.469745] RSP: 0018:ffff9d6a431bbec0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff12
[ 151.469747] RAX: 0000000000000003 RBX: 0000000000000003 RCX: 0000000000000000
[ 151.469747] RDX: 0000000000000000 RSI: ffffffffbee70e1c RDI: ffffffffbee7107a
[ 151.469748] RBP: ffffffffbf132b60 R08: 0000000000000000 R09: 0000000000000000
[ 151.469748] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffbeed81f2
[ 151.469749] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 151.469752] default_idle+0x18/0x100
[ 151.469753] do_idle+0x179/0x1b0
[ 151.469755] cpu_startup_entry+0x6f/0x80
[ 151.469757] start_secondary+0x1aa/0x200
[ 151.469758] secondary_startup_64+0xa5/0xb0
[ 151.469760] Code: 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 80 24 02
00 48 03 04 d5 80 04 ef be 48 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85>
c0 74 f7 48 8b 01 48 85 c0 0f 84 54 ff ff ff 49 89 c0 0f 0d
[ 151.438482] Sending NMI from CPU 1 to CPUs 4:
[ 151.470739] NMI backtrace for cpu 4 skipped: idling at
native_safe_halt+0x2/0x10
[ 151.438482] Sending NMI from CPU 1 to CPUs 5:
[ 151.470007] NMI backtrace for cpu 5 skipped: idling at
native_safe_halt+0x2/0x10
[ 151.438482] Sending NMI from CPU 1 to CPUs 7:
[ 151.471011] NMI backtrace for cpu 7 skipped: idling at
native_safe_halt+0x2/0x10
[ 151.474216] PCI Interrupt Link [LUB2] enabled at IRQ 22
[ 151.490261] pci 0000:00:00.0: Found enabled HT MSI Mapping
and boot then completes normally.
All earlier kernels continue to boot fine without this delay. I have Archlinux
with both 4.16.13 and 4.14.47-LTS installed on this server, and 4.14.47 has no
such delay.
I have had no success trying to bisect the differences (I'm not that practiced
at it), but need help figuring out what changed and why I now have a 150
second delay where some test no loops through every core dumping info -- just
to figure out everything is OK and then continue the boot.
I have attached the full dmesg output here. If you need anything further, just
ask.
--
David C. Rankin, J.D.,P.E.