Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752579AbbHaVMr (ORCPT ); Mon, 31 Aug 2015 17:12:47 -0400 Received: from www.linutronix.de ([62.245.132.108]:42362 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751815AbbHaVMq (ORCPT ); Mon, 31 Aug 2015 17:12:46 -0400 Date: Mon, 31 Aug 2015 23:12:12 +0200 (CEST) From: Thomas Gleixner To: Alex Thorlton cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Ingo Molnar , John Stultz , Russ Anderson , Dimitri Sivanich Subject: Re: [BUG] Boot hangs at clocksource_done_booting on large configs In-Reply-To: <20150831180432.GQ20615@asylum.americas.sgi.com> Message-ID: References: <20150831180432.GQ20615@asylum.americas.sgi.com> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2354 Lines: 59 On Mon, 31 Aug 2015, Alex Thorlton wrote: > I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule > out any scaling issues related to multiple concurrent reads to our > RTC's MMR. And to rule out scaling issues you replaced the RTC MMR with HPET. Not a very good choice: HPET does not scale either. It's uncached memory mapped I/O. See below. > I'm hoping to get some input from the experts in this area, first of > all, on whether the problem I'm seeing is actually what I think it is, > and, if so, if I've solved it in the correct way. I fear both the analysis and the solution is wrong. Up to the point where the actual clocksource change happens there is no reason why timer interrupts should not happen. And the code which actually changes the clocksource is definitely called with interrupts disabled. When that function returns the new clocksource is fully functional and interrupts can happen again. Now looking at your backtraces. Most CPUs are in the migration thread and a few (3073,3078,3079,3082) are in the idle task. >From the trace artifacts (? read_hpet) it looks like the clock source change has been done and the cpus are on the way back from stop machine. But they are obviously held off by something. And that something looks like the timekeeper sequence lock. Too bad, that we don't have a backtrace for CPU0 in the log. I really wonder how a machine that large works with HPET as clocksource at all. hpet_read() is uncached memory mapped IO which takes thousands of CPU cycles. Last time I looked it was around 1us. Let's take that number to do some math. If all CPUs do that access at the same time, then it takes NCPUS microseconds to complete if the memory mapped I/O scheduling is completely fair, which I doubt. So with 4k CPUs thats whopping 4.096ms and it gets worse if you go larger. That's more than a tick with HZ=250. I'm quite sure that you are staring at the HPET scalability bottleneck and not at some actual kernel bug. Your patch shifts some timing around so the issue does not happen, but that's certainly not a solution. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/