Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933358AbbFJHIw (ORCPT ); Wed, 10 Jun 2015 03:08:52 -0400 Received: from ns.horizon.com ([71.41.210.147]:12288 "HELO ns.horizon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1753778AbbFJHIo (ORCPT ); Wed, 10 Jun 2015 03:08:44 -0400 Date: 10 Jun 2015 03:08:42 -0400 Message-ID: <20150610070842.4980.qmail@ns.horizon.com> From: "George Spelvin" To: adrian.hunter@intel.com, linux@horizon.com, tglx@linutronix.de Subject: Discussion: quick_pit_calibrate is slow Cc: ak@linux.intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, luto@amacapital.net, mingo@kernel.org, torvalds@linux-foundation.org In-Reply-To: <5576AE4A.4010802@intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8903 Lines: 222 Adrian Hunter wrote: > A bigger issue for my case is that "slow" calibration is not that slow, > taking only 10ms anyway which is much better than the 50ms max for so-called > "quick" calibration. I read the code, and after figuring out which comments are wrong, this is absolutely right. The "quick" code takes 20 ms if it works, or 50 ms if it has problems. The fallback code can finish in 10 ms. When bit rot has set it to this degree, something needs to be done. Given that people would like to boot VMs in less than 1 second, 20 ms is 2% of that budget, and something that is worth trying to improve. * The current situation While it would be lovely if there was a reliable hardware register with the TSC frequency, there isn't. And as long as there's a battle between Intel's sales people (who love locking processor frequencies and charging more for faster ones) and motherboard makers and overclockers (who will go to great lengths to get around such limits), I doubt there will be a trustworthy way for the CPU to determine its own clock rate without using off-chip peripherals. ** quick_pit_calibrate() The first thing to explain is why quick_pit_calibrate() takes 20 ms in the best case. Quick_pit_calibrate() is polling the PIT and looking for a carry from the lsbyte to the msbyte. When it sees the msbyte change, it assigns an uncertaity window, which includes the TSC value when the carry happened. It starts before the last read of the old value, and ends after the first read of the new value. (Ths is returned in *deltap from pit_expect_msb().) This includes four inb() operations and two rdtsc. Given that inb() takes a bit over a microsecond on normal, working hardware, that's a total of 5 us of uncertainty. (5.4 us on the machine on my desk.) Add the uncertainty at both ends of the time interval, and you have 10 us. Now, quick_pit_calibrate() loops until the uncertainty is less than 1/2048 of the time interval, i.e. the TSC rate is estimated to 500 ppm. Meaning that it will stop after 20 ms. (22 ms for the machine on my desk.) ** native_pit_calibrate() The "slow" calibration, on the other hand, tries to do it in 10 ms (CAL_MS = 10; it falls back to CAL2_MS = 50 if 10 doesn't seem to be enough). The "slow" code uses the PIT and one extra timer if available: the HPET if possible, the ACPI PM timer (and all its fun hardware bugs; see https://lkml.org/lkml/2008/8/10/77 for details) if not. It tries a number of things, all of which can fail, and repeats until it gets something it feels it can go with. The core is pit_calibrate_tsc(), which programs the PIT for a 10 ms timeout, and alternates get_cycles() with watching inb(0x61) for the timer output bit to go high. During this process, it measures the duration (in TSC cycles) of each inb(), and if things look wonky (the maximum is more than 10x the minimum), fails (returns ULONG_MAX). Otherwise, it computes the TSC rate as kHz = (tsc2 - tsc1) / 10 ms. The code around that core reads the extra timer (capturing the TSC before and after), does pit_calibrate_tsc() to waste some time, and reads the extra timer again. There are two error cases in reading the extra timer: - It might not exist. - Trying 5 times may fail to read it in less than 50K clock cycles, indicating that there's some SMI shit going on. If reading the extra timer works *and* reading the PIT timer worked, *and* the two agree within 10%, then return the extra timer result immediately. This is the case that works for Adrian. If things are questionable, the whole process is repeated up to 3x. Throughout those, the minimum computed clock rate is kept for each of the calibrations. Delays will cause unexpectedly high TSC deltas, which will read as overestimates of the clock rate, so this makes sense. (But we can do better.) The third repetition, if pit_calibrate_tsc has failed twice, increase the delay to 50 ms. pit_calibrate_tsc will still fail, but the HPET/PM_TIMER will have an extra-long time interval. If the loop ends (i.e. things weren't perfect after three retries), then we'll have two results: - tsc_pit_min, the PIT calibration result (or ULONG_MAX if it failed) - tsc_ref_min, the extra timer calibration result (or ULONG_MAX) The following cases apply: - If both results indicate failure, disable the TSC and return failure. - If the pit failed, return tsc_ref_min (with a warning) - If the extra timer failed, return the PIT result - If both succeeded, they must disagree by mroe than 10%. Warn and use the PIT figure. * The problem with te above The regular calibration is full of ad-hoc constants and limits, and doesn't know how accurate a result it's producing. The quick_pit_calibrate at least has a well-defined goal (500 ppm) and a systematic way to achieve it. The big problem is that polling a port looking for a transition fundamentally leaves +/- one port read (i.e. +/-1.3 us) of uncertainty as to when the transition actually happened. * How to do better If, instead, we did tsc1 = get_cycles(); lsb = inb(0x42); tsc2 = get_cycles(); msb = inb(0x42); delta = tsc2 - tsc1; min_delta = min(min_delta, delta); uncertainty = delta - min_delta; and actually use the captured msb/lsb pair, we could do a lot better. First of all, it's only one read, not two. The 8254 timer latches the msbyte when the lsbyte is read and returns the latched value on the next read, so there's no need to include the second inb() in the uncertainty window. That gets you down to about +/-0.65 us. But we can do better! The idea comes from NTP. The point is that, as we're only comparing timer *rates*, we don't care about when in the tsc1..tsc2 window the PIT result was latched as long as it's consistent. The way to do that is to keep track, over all of the hundreds of iterations, the *minimum* TSC delta across the inb(). This can be assumed to be the "speed of light", with a perfectly consistent time offset. If a read takes longer than that, there is a delay. And we don't know whether the delay happened before the PIT latched its result or after. So we have to add the delay to the uncertainty estimate of the read. But the bulk of the 1.3 us of delay can be simply ignored. It doesn't matter if it's a slow emulated access via a VM or SMI trap; as long as the time taken is *consistent*, we can do an accurate calibration. This reduces the read uncertainty to +/-0.5 of a PIT tick, or +/-0.42 us. Plus the extra uncertainty due to the read time, but we can repeat the read until we get a fast one with low uncertainty. With a total of 0.84 us of read uncertaity (1/12 of quick_pit_calibrate currently), we can get within 500 ppm within 1.75 us. Or do better within 5 or 10. And the same technique can be applied to reading the HPET or PM timer. The PM timer's PIIX hardware bugs make things trickier, but with some careful coding, it's possible to capture the TSC around just the critical read, and not include the surrounding validation reads. So we can get the full benefit of the 280 ns clock period. ** How I'd like to replace the existing code The loop I'd write would start the PIC (and the RTC, if we want to) and then go round-robin reading all the time sources and associated TSC values. This gives us lots of reads to measure the minimum access time. After a mimimum number of iterations (so the minimum access time is stable), when the ratio of elapsed / (start uncertainty+end uncertainty) is satisfactory (e.g. 2048 for current 500 ppm spec), then various measures are sanity-checked against each other and the best one returned. Rather than program the PIT for 10 us, I'd program it for 55 us (0xffff), and run until I got something good enough, or it ran out. inb(0x61) & 0x20 is set when the PIT runs out and never implciitly cleared, so it's an excellent way to detect even extreme delays. As long as that bit is clear, the PIT value is a reliable absolute time in the calibration process. * Request for comments So, as I go about turning these ideas into code... any comments? I realize this is a far bigger overhaul than Adrian proposed, but do other people agree that some decruftification is warranted? I'd also like to cut the time required to do the calibration. Being able to compute the quality of the approximation so far is a huge help. Any suggestions for a reasonable time/quality tradeoff? 500 ppm ASAP? Best I can do in 10 ms? Wait until the PIT is 500 ppm and then use the better result from a higher-resolution timer if available? It's some pretty low-level mucking about, but is a nice simple single-threaded single-processor piece of code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/