Date: 23 Aug 2006 02:25:46 -0400
Message-ID: <20060823062546.12798.qmail@science.horizon.com>
From: linux@horizon.com
To: linux-kernel@vger.kernel.org
Subject: Re: Linux time code
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9106
Lines: 201

Just to summarize the kernel/NTP interaction for those not familiar....

The NTP daemon exchanges pings with a number of time sources.  Each ping
produces a round-trip time and a time offset; the latter is computed by
assuming that the one-way trip times are equal.  This is of course not
true, but is closest to true for pings with the shortest round-trip time,
and NTP tries to use those.

The sources are individually sanity-checked, then checked against each
other, in a rather complex way that has proved to be robust in practice.

I don't want to go into it in great detail, but there are three stages of
filtering after the per-source processing:
1) Selection, which takes all the sources' claims for the right time and
   the error in their claims, and finds the interval where the largest
   possible number of them overlap.  Sources that do not participate in
   the overlap do not proceed to clustering.
2) Clustering, where sources that are furthest from the average are
   repeatedly dropped to decrease the standard deviation.
3) Combining, where the remaining sources are averaged, weighted by
   their quality claims.

Note that a tight accuracy claim increases a source's weight in the third
stage, but makes it more likely that it'll be excluded by the first and
second stage filters.

The above operation is run periodically (whenever there is new data from
one of the sources), and the output is a report on the amount by which
the local clock disagrees with the "right time", and a few associated
quality metrics.

This single time offset (and associated quality estimates) is the input
to "clock disipline algorithm".  This is where it starts to get relevant
to the kernel.

NTP needs to divide the observed error into two categories: phase error
and frequency error.  The former is time offset that is uncorrelated
from sample to sample, and can be reduced by longer averaging.

The latter is due to the local clock's frequency changing, and cannot be
reduced by averaging.  Indeed, if you try to average together successive
errors of +1 ms, +2 ms, +3 ms, +4 ms, etc, the longer you average the
worse off you'll be before you start correcting and doing something
about the problem.

Now, phase error, a.k.a. jitter, is dominant over short time spans.
One simple source is the measurement granularity of your clock.
If you can only measure with 1 us resolution, you'll have +/-0.5 us of
jitter just from that.

Frequency error, a.k.a. wander, accumulates with time, so is dominant
over longer time intervals, particular intervals over 1000 seconds.


When NTP first starts up, it considers all error to be frequency error,
to get the clock into approximately the right range.  This is not
terribly accurate, but numerically very stable; it never oscillates.
After a while, it shifts to a phase-locked loop where most of the error
is deemed to be phase error, and only a bit is frequency error.  This can
produce the best time, but can freak out and overshoot if the offsets
used as input are bizarrely behaved.  NTP adjusts the polling interval,
to check with the clock sources more frequently, if it notices that
things are getting a bit weird, and falls back to the frequency-locked
loop if it notices that things have gotten really bad.


Anyway, at the end of this computation, you get a frequency and phase
(time) correction to be applied to the local clock.  This is then applied
by adjusting the frequency of the system clock.  The frequency correction
is applied permanently; the phase correction is applied by adjusting
the frequency a bit more for a short while.

To be precise, 1/64 of the current phase error estimate is corrected
per second, and the phase error estimate is reduced accordingly.
This continues until the phase error is reduced to zero, or a new phase
error is computed.  So if the phase error is 64 microseconds, the clock is
adjusted by 1 ppm for one second, then by 64/64 ppm for the next second,
and so on.  This gives a half-life of 44 seconds.


Anyway, implementing the exponential phase correction in the kernel is
optional; when NTP really wants is a knob to adjust the clock frequency.
It can just call that once per second to make phase corrections if
requited.

In practice, we pass the frequency and phase corrections into the
kernel via the adjtimex() call and let it amortize the phase correction.
Although more than strictly necessary, this is not all bad, as it avoids
the need to wake up a daemon every second to fiddle the clock frequency.
Theoretically, you can code all of this to not wake the kernel up every
tick, although it's not implemented that way right now.

Also note that the current exponential-average way of making gradual
phase corrections is not very critical.  You want to get the total
right, but typical closed-loop time-sync applications aren't even
very sensitive to errors there.  The details of the amortization
schedule are quite non-critical.

So if a tickless user-mode Linux instance is woken up after
a long sleep, it would be more than good enough to process the interval as:

	half_lives = interval / 44;
	interval -= half_lives * 44;
	correction = time_offset;
	correction -= time_offset >>= half_lives;
	xtime += correction;

	/* ... other necessary stuff from second_overflow */

	while (interval--)
		second_overflow();

--- Postscript: Tangents ---

My pet idea on how to do precision timestamping is to separate grabbing
a low-level timer read from converting to portable time units.  If you
can bound the time elapsed between those two events, you can just
keep the information needed to do the conversion around for that long.
If you can tolerate some slop in old time conversions, you can easily do
lossy compression on old conversion parameters.  If you use an piecewise
linear transformation between raw and portable timestamps (seconds =
raw * period + offset, valid for some interval), then you can collapse
two such segments into one by a suitable average of the clock periods.

The one problem there is that, given two adjacent raw samples, if one
is delayed a long time before being converted to a portable timestamp,
the lossy compression can violate monotonicity; i.e. the portable
timestamps might come out in the wrong order.  I have to confess, I
can't think of a 100% fix other than a hard bound on time to convert,
or a probably-messier-than-it's-worth registration of unconverted raw
timestamps which can be converted as part of pushing the conversion
parameters out of scope.  (The whole idea is to move work out of
hardirq handlers.)

However, is this sort of large conversion-delay skew for closely-spaced
timestamps likely?  It seems that they would have to come through
different code paths, and then does the ordering at the microsecond
level matter?


Oh, and if you're going to implement Posix gettimeofday(), have a look at
Markus Kuhn's UTS proposal (http://www.cl.cam.ac.uk/~mgk25/uts.txt).
Given that Posix mandates that days have 86400 seconds
(http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_14),
but the UTC standard maintained by the BIPM
(http://www.bipm.fr/en/scientific/tai/time_server.html) mandates
that days sometimes have 86401 seconds, his system is arguably
the least insane way to reconcile the irreconcilable.
(See also http://www.cl.cam.ac.uk/~mgk25/volatile/ITU-R-TF.460-4.pdf)

E.g.  The follwing handles positive leap seconds (only).
next_leap_second is the Posix timestamp of the next leap
second (23:59:60), which is always a multiple of 86400.
(http://hpiers.obspm.fr/eop-pc/earthor/utc/TAI-UTC_tab.html)

If you wanted to add the (strictly theoretical and never likely to be
used) negative leap second code, you could distinguish negative leap
seconds by the fact that next_leap_second is odd (congruent to -1
mod 86400).

#define MILLION 1000000
#define BILLION 1000000000

extern unsigned tai_minus_utc;	/* Currently 33 */
extern time_t next_leap_second;	/* UTC time after which tai_minus_utc++ */

	switch (clk_id) {
	case CLOCK_UTC:
		clock_gettime(CLOCK_TAI, tp);
		tp->tv_sec -= tai_minus_utc;
		/* Leap seconds per http://www.cl.cam.ac.uk/~mgk25/time/c/ */
		if (tp->tv_sec >= next_leap_second) {
			if (tp->tv_sec == next_leap_second)
				tp->tv_nsec += BILLION;
			tp->tv_sec--;
		}
		break;
	case CLOCK_UTS:
		/* Recommended for gettimeofday() & time() */
		/* See http://www.cl.cam.ac.uk/~mgk25/uts.txt */
		clock_gettime(CLOCK_TAI, tp);
		tp->tv_sec -= tai_minus_utc;
		if (tp->tv_sec > next_leap_second) {
			tp->tv_sec--;
		} else if (next_leap_second - tp->tv_sec < 1000) {
			/* 1000 UTC/TAI seconds = 999 UTS seconds */
			uint32_t offset = next_leap_second - tp->tv_sec + 1;
			offset *= MILLION;
			offset += (uint32_t)(BILLION - tp->tv_nsec)/1000;
			if ((tp->tv_nsec -= offset) < 0) {
				tp->tv_nsec += BILLION;
				tp->tv_sec--;
			}
		}
		break;
	}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/