Date: Mon, 31 Aug 2015 23:12:12 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: Alex Thorlton <athorlton@sgi.com>
cc: linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@kernel.org>, John Stultz <john.stultz@linaro.org>,
        Russ Anderson <rja@sgi.com>, Dimitri Sivanich <sivanich@sgi.com>
Subject: Re: [BUG] Boot hangs at clocksource_done_booting on large configs
In-Reply-To: <20150831180432.GQ20615@asylum.americas.sgi.com>
Message-ID: <alpine.DEB.2.11.1508312139030.15006@nanos>
References: <20150831180432.GQ20615@asylum.americas.sgi.com>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2354
Lines: 59

On Mon, 31 Aug 2015, Alex Thorlton wrote:
> I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
> out any scaling issues related to multiple concurrent reads to our
> RTC's MMR.

And to rule out scaling issues you replaced the RTC MMR with HPET. Not
a very good choice:

HPET does not scale either. It's uncached memory mapped I/O. See
below.

> I'm hoping to get some input from the experts in this area, first of
> all, on whether the problem I'm seeing is actually what I think it is,
> and, if so, if I've solved it in the correct way.

I fear both the analysis and the solution is wrong. 

Up to the point where the actual clocksource change happens there is
no reason why timer interrupts should not happen. And the code which
actually changes the clocksource is definitely called with interrupts
disabled. When that function returns the new clocksource is fully
functional and interrupts can happen again.

Now looking at your backtraces. Most CPUs are in the migration thread
and a few (3073,3078,3079,3082) are in the idle task.

>From the trace artifacts (? read_hpet) it looks like the clock source
change has been done and the cpus are on the way back from stop
machine.

But they are obviously held off by something. And that something looks
like the timekeeper sequence lock. Too bad, that we don't have a
backtrace for CPU0 in the log.

I really wonder how a machine that large works with HPET as
clocksource at all. hpet_read() is uncached memory mapped IO which
takes thousands of CPU cycles. Last time I looked it was around
1us. Let's take that number to do some math.

If all CPUs do that access at the same time, then it takes NCPUS
microseconds to complete if the memory mapped I/O scheduling is
completely fair, which I doubt. So with 4k CPUs thats whopping 4.096ms
and it gets worse if you go larger. That's more than a tick with
HZ=250.

I'm quite sure that you are staring at the HPET scalability bottleneck
and not at some actual kernel bug.

Your patch shifts some timing around so the issue does not happen, but
that's certainly not a solution.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/