Date: 10 Jun 2015 03:08:42 -0400
Message-ID: <20150610070842.4980.qmail@ns.horizon.com>
From: "George Spelvin" <linux@horizon.com>
To: adrian.hunter@intel.com, linux@horizon.com, tglx@linutronix.de
Subject: Discussion: quick_pit_calibrate is slow
Cc: ak@linux.intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
        luto@amacapital.net, mingo@kernel.org, torvalds@linux-foundation.org
In-Reply-To: <5576AE4A.4010802@intel.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8903
Lines: 222

Adrian Hunter wrote:
> A bigger issue for my case is that "slow" calibration is not that slow,
> taking only 10ms anyway which is much better than the 50ms max for so-called
> "quick" calibration.

I read the code, and after figuring out which comments are wrong,
this is absolutely right.  The "quick" code takes 20 ms if it works,
or 50 ms if it has problems.  The fallback code can finish in 10 ms.

When bit rot has set it to this degree, something needs to be done.

Given that people would like to boot VMs in less than 1 second, 20 ms is
2% of that budget, and something that is worth trying to improve.


* The current situation

While it would be lovely if there was a reliable hardware register
with the TSC frequency, there isn't.

And as long as there's a battle between Intel's sales people (who love
locking processor frequencies and charging more for faster ones) and
motherboard makers and overclockers (who will go to great lengths to
get around such limits), I doubt there will be a trustworthy way for the
CPU to determine its own clock rate without using off-chip peripherals.

** quick_pit_calibrate()

The first thing to explain is why quick_pit_calibrate() takes 20 ms
in the best case.

Quick_pit_calibrate() is polling the PIT and looking for a carry from
the lsbyte to the msbyte.  When it sees the msbyte change, it assigns
an uncertaity window, which includes the TSC value when the carry
happened.  It starts before the last read of the old value, and ends
after the first read of the new value.  (Ths is returned in *deltap
from pit_expect_msb().)

This includes four inb() operations and two rdtsc.  Given that inb()
takes a bit over a microsecond on normal, working hardware, that's a
total of 5 us of uncertainty.  (5.4 us on the machine on my desk.)

Add the uncertainty at both ends of the time interval, and you have 10 us.

Now, quick_pit_calibrate() loops until the uncertainty is less than 1/2048
of the time interval, i.e. the TSC rate is estimated to 500 ppm.  Meaning
that it will stop after 20 ms.  (22 ms for the machine on my desk.)

** native_pit_calibrate()

The "slow" calibration, on the other hand, tries to do it in 10 ms
(CAL_MS = 10; it falls back to CAL2_MS = 50 if 10 doesn't seem to
be enough).

The "slow" code uses the PIT and one extra timer if available: the
HPET if possible, the ACPI PM timer (and all its fun hardware bugs;
see https://lkml.org/lkml/2008/8/10/77 for details) if not.


It tries a number of things, all of which can fail, and
repeats until it gets something it feels it can go with.

The core is pit_calibrate_tsc(), which programs the PIT for a 10 ms
timeout, and alternates get_cycles() with watching inb(0x61)
for the timer output bit to go high.

During this process, it measures the duration (in TSC cycles) of
each inb(), and if things look wonky (the maximum is more than
10x the minimum), fails (returns ULONG_MAX).

Otherwise, it computes the TSC rate as kHz = (tsc2 - tsc1) / 10 ms.


The code around that core reads the extra timer (capturing the TSC
before and after), does pit_calibrate_tsc() to waste some time,
and reads the extra timer again.

There are two error cases in reading the extra timer:
- It might not exist.
- Trying 5 times may fail to read it in less than 50K clock cycles,
  indicating that there's some SMI shit going on.

If reading the extra timer works *and* reading the PIT timer worked,
*and* the two agree within 10%, then return the extra timer result
immediately.  This is the case that works for Adrian.

If things are questionable, the whole process is repeated up to 3x.
Throughout those, the minimum computed clock rate is kept for
each of the calibrations.  Delays will cause unexpectedly high
TSC deltas, which will read as overestimates of the clock rate,
so this makes sense.  (But we can do better.)

The third repetition, if pit_calibrate_tsc has failed twice, increase the
delay to 50 ms.  pit_calibrate_tsc will still fail, but the HPET/PM_TIMER
will have an extra-long time interval.


If the loop ends (i.e. things weren't perfect after three retries), then
we'll have two results:
- tsc_pit_min, the PIT calibration result (or ULONG_MAX if it failed)
- tsc_ref_min, the extra timer calibration result (or ULONG_MAX)

The following cases apply:
- If both results indicate failure, disable the TSC and return failure.
- If the pit failed, return tsc_ref_min (with a warning)
- If the extra timer failed, return the PIT result
- If both succeeded, they must disagree by mroe than 10%.
  Warn and use the PIT figure.


* The problem with te above

The regular calibration is full of ad-hoc constants and limits,
and doesn't know how accurate a result it's producing.  The
quick_pit_calibrate at least has a well-defined goal (500 ppm) and
a systematic way to achieve it.

The big problem is that polling a port looking for a transition
fundamentally leaves +/- one port read (i.e. +/-1.3 us) of uncertainty
as to when the transition actually happened.


* How to do better

If, instead, we did
	tsc1 = get_cycles();
	lsb = inb(0x42);
	tsc2 = get_cycles();
	msb = inb(0x42);
	delta = tsc2 - tsc1;
	min_delta = min(min_delta, delta);
	uncertainty = delta - min_delta;

and actually use the captured msb/lsb pair, we could do a lot better.

First of all, it's only one read, not two.  The 8254 timer latches the
msbyte when the lsbyte is read and returns the latched value on the
next read, so there's no need to include the second inb() in the
uncertainty window.

That gets you down to about +/-0.65 us.  But we can do better!

The idea comes from NTP.  The point is that, as we're only comparing
timer *rates*, we don't care about when in the tsc1..tsc2 window the
PIT result was latched as long as it's consistent.

The way to do that is to keep track, over all of the hundreds of
iterations, the *minimum* TSC delta across the inb().  This can
be assumed to be the "speed of light", with a perfectly consistent
time offset.

If a read takes longer than that, there is a delay.  And we don't
know whether the delay happened before the PIT latched its result
or after.  So we have to add the delay to the uncertainty estimate
of the read.

But the bulk of the 1.3 us of delay can be simply ignored.  It doesn't
matter if it's a slow emulated access via a VM or SMI trap; as long as
the time taken is *consistent*, we can do an accurate calibration.

This reduces the read uncertainty to +/-0.5 of a PIT tick, or +/-0.42 us.
Plus the extra uncertainty due to the read time, but we can repeat the
read until we get a fast one with low uncertainty.

With a total of 0.84 us of read uncertaity (1/12 of quick_pit_calibrate
currently), we can get within 500 ppm within 1.75 us.  Or do better
within 5 or 10.


And the same technique can be applied to reading the HPET or PM timer.

The PM timer's PIIX hardware bugs make things trickier, but with some
careful coding, it's possible to capture the TSC around just the critical
read, and not include the surrounding validation reads.  So we can get
the full benefit of the 280 ns clock period.


** How I'd like to replace the existing code

The loop I'd write would start the PIC (and the RTC, if we want to)
and then go round-robin reading all the time sources and associated
TSC values.

This gives us lots of reads to measure the minimum access time.

After a mimimum number of iterations (so the minimum access time is
stable), when the ratio of elapsed / (start uncertainty+end uncertainty)
is satisfactory (e.g. 2048 for current 500 ppm spec), then various
measures are sanity-checked against each other and the best one returned.


Rather than program the PIT for 10 us, I'd program it for 55 us (0xffff),
and run until I got something good enough, or it ran out.

inb(0x61) & 0x20 is set when the PIT runs out and never implciitly cleared,
so it's an excellent way to detect even extreme delays.  As long as
that bit is clear, the PIT value is a reliable absolute time in the
calibration process.


* Request for comments

So, as I go about turning these ideas into code... any comments?

I realize this is a far bigger overhaul than Adrian proposed, but do
other people agree that some decruftification is warranted?

I'd also like to cut the time required to do the calibration.  Being able
to compute the quality of the approximation so far is a huge help.

Any suggestions for a reasonable time/quality tradeoff?  500 ppm
ASAP?  Best I can do in 10 ms?  Wait until the PIT is 500 ppm and
then use the better result from a higher-resolution timer if
available?

It's some pretty low-level mucking about, but is a nice simple
single-threaded single-processor piece of code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/