Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754107AbbHaSEh (ORCPT ); Mon, 31 Aug 2015 14:04:37 -0400 Received: from relay2.sgi.com ([192.48.180.65]:36018 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753850AbbHaSEg (ORCPT ); Mon, 31 Aug 2015 14:04:36 -0400 Date: Mon, 31 Aug 2015 13:04:33 -0500 From: Alex Thorlton To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra , Ingo Molnar , John Stultz , Thomas Gleixner , Russ Anderson , Dimitri Sivanich , Alex Thorlton Subject: [BUG] Boot hangs at clocksource_done_booting on large configs Message-ID: <20150831180432.GQ20615@asylum.americas.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7556 Lines: 169 Hey guys, We've been hitting issues with machines with 4096 or more logical cores hanging up at clocksource_done_booting, seemingly forever. The problem occurs somewhat intermittently, though it hits more consistently as you increase the core count. We've observed this problem on kernels starting at 3.10, all the way up through 4.2. We initially thought there was some issue with the stop_machine code, but Thomas G. cleared that up for us with a thorough explanation of the stop_machine mechanism. After ruling out issues in stop_machine, we started investigating possible hardware related scaling issues. I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule out any scaling issues related to multiple concurrent reads to our RTC's MMR. We start getting hung task timeouts during clocksource_done_booting: 8<---- calling clocksource_done_booting+0x0/0x42 @ 1 INFO: task swapper/0:1 blocked for more than 480 seconds. Not tainted 4.2.0-rc1-nortc+ #384 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. swapper/0 D ffff881efa933b48 0 1 0 0x00000000 ffff881efa933b48 ffff881ef45e0000 ffffb5de7a130000 000000fdfa933b38 ffff881efa934000 7fffffffffffffff ffff881efa933ca0 ffffb5de7a130000 0000000000000000 ffff881efa933b68 ffffffff8158eae7 ffff881efa933b68 Call Trace: [] schedule+0x37/0x80 [] schedule_timeout+0x1e7/0x270 [] ? _find_next_bit.part.0+0x19/0x70 [] wait_for_completion+0xa0/0x100 [] ? wake_up_q+0x70/0x70 [] ? cpu_stop_should_run+0x50/0x50 [] __stop_cpus+0x50/0x70 [] ? console_unlock+0x2dd/0x4a0 [] ? cpu_stop_should_run+0x50/0x50 [] stop_cpus+0x35/0x50 [] ? tk_setup_internals.constprop.10+0x140/0x140 [] __stop_machine+0xb2/0xe0 [] ? tk_setup_internals.constprop.10+0x140/0x140 [] ? tk_setup_internals.constprop.10+0x140/0x140 [] stop_machine+0x2e/0x50 [] timekeeping_notify+0x2d/0x50 [] __clocksource_select+0xcb/0x120 [] ? boot_override_clock+0x4d/0x4d [] clocksource_done_booting+0x32/0x42 [] do_one_initcall+0x169/0x1d0 [] kernel_init_freeable+0x1bb/0x248 [] ? rest_init+0x80/0x80 [] kernel_init+0xe/0xe0 [] ret_from_fork+0x3f/0x70 [] ? rest_init+0x80/0x80 ---->8 After this I NMId the machine to get backtraces for the other CPUs. Most are stuck around here: 8<---- CPU: 4867 PID: 19730 Comm: migration/4867 Tainted: G D 4.2.0-rc1-nortc+ #384 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013 task: ffffad5e798cc500 ti: ffffad5e798d4000 task.ti: ffffad5e798d4000 RIP: 0010:[] [] uv_handle_nmi+0x19/0x790 RSP: 0000:ffffc90020156da8 EFLAGS: 00010092 RAX: 000002fe65b95384 RBX: 000000000000002d RCX: ffffc90020162a98 RDX: 000002fe65b95384 RSI: ffffc90020156ef8 RDI: 0000000000000001 RBP: ffffc90020156e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000020 R12: 000002fe65b95384 R13: 0000000000000000 R14: ffffffff81a38880 R15: ffffc90020156ef8 FS: 0000000000000000(0000) GS:ffffc90020151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000da80 CR3: 0000000001a0b000 CR4: 00000000001406e0 Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000000000000002d 000002fe65b95384 0000000000000000 Call Trace: [] ? cpu_stop_should_run+0x50/0x50 [] nmi_handle+0x79/0x100 [] ? nmi_handle+0x83/0x100 [] ? cpu_stop_should_run+0x50/0x50 [] unknown_nmi_error+0x1c/0x90 [] default_do_nmi+0xd8/0xf0 [] do_nmi+0x83/0xb0 [] end_repeat_nmi+0x1e/0x2e [] ? cpu_stop_should_run+0x50/0x50 [] ? read_hpet+0x16/0x20 [] ? read_hpet+0x16/0x20 [] ? read_hpet+0x16/0x20 <> [] ktime_get+0x37/0xb0 [] clockevents_program_event+0x4b/0x130 [] tick_handle_periodic+0x57/0x70 [] local_apic_timer_interrupt+0x3c/0x70 [] smp_apic_timer_interrupt+0x41/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? update_blocked_averages+0x381/0x860 [] ? multi_cpu_stop+0x98/0xf0 [] ? __switch_to+0x26a/0x590 [] ? cpu_stop_should_run+0x50/0x50 [] cpu_stopper_thread+0x88/0x110 [] ? __schedule+0x338/0x900 [] smpboot_thread_fn+0x135/0x190 [] ? sort_range+0x30/0x30 [] kthread+0xdb/0x100 [] ? kthread_create_on_node+0x180/0x180 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x180/0x180 Code: 31 c0 e8 a0 c2 52 00 e9 39 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f7 53 48 83 ec 48 <48> 8b 1d 00 f2 fa 7e 65 44 8b 2d 70 b8 fa 7e 9c 58 0f 1f 44 00 ---->8 I can provide full logs if desired. Anyways, based on my limited understanding of what's going on here, I tried disabling interrupts a bit further out in the cpu_stopper_thread code, to try and avoid that apic_timer_interrupt during the clocksource change. change_clocksource already disables interrupts, but, from the backtraces, it appears that we're taking an interrupt after the stopper thread wakes up, but before change_clocksource actually gets called. I used this patch to try and avoid taking a apic_timer_interrupt at any point during the clocksource change.: 8<--- diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index fd643d8..8502521 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu) { struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); struct cpu_stop_work *work; + unsigned long flags; int ret; + local_irq_save(flags); + repeat: work = NULL; spin_lock_irq(&stopper->lock); @@ -452,6 +455,8 @@ repeat: cpu_stop_signal_done(done, true); goto repeat; } + + local_irq_restore(flags); } extern void sched_set_stop_task(int cpu, struct task_struct *stop); --->8 This was a bit of a shot in the dark, but it has worked on every kernel and every machine we've tried it on so far. I'm hoping to get some input from the experts in this area, first of all, on whether the problem I'm seeing is actually what I think it is, and, if so, if I've solved it in the correct way. Any input is greatly appreciated! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/