1998-12-03 08:38:51

[permalink] [raw]
##### Subject: Linux timekeeping plans

I've been formulating a plan for improving Linux's NTP performance
and general timekeeping. There are a number of interesting problems
to deal with, and I have solutions to most of them.

I'll refer people to the NTP papers to see how that works. Basically,
you need to find a reliable estimate of how much your current clock is off
from the desired time and then use that information to correct the clock.
The problem of correcting as quickly as possible without overcorrecting
an becoming unstable is so interesting that there is an entire branch
of mathematics called "control theory" devoted to it.)

The NTP documentation talks about a clock that ticks at a fixed rate,
and every ticks adds some delta to the "current time", then changes
that delta appropriately.

There is an equivalent formulation which just uses a ticks count and
then computes current_time = time_base + ticks * delta whenever
current_time is desired. For the very high-resolution ticks
available nowadays (like the TSC in 586 processors and up) this is
the way to go.

If you want to change delta to delta' as of a specific change_ticks,
then you need to choose time_base' so that current_time doesn't change
at that moment, so
time_base + change_ticks * delta = time_base' + change_ticks * delta'.

This has the obvious solution
time_base' = time_base + (delta-delta') * change_ticks.

In fact, a slightly more efficient formulation is
current_time = time_base + (ticks - tick_base) * delta.
At the expense of periodically recomputing the numbers, this reduces the
cost to query the current time by making the multiplication smaller and
more manageable.

The only thing that you have to be careful of is that anyone using
the numbers reads them atomically.

Anyway, a typical PC has two clock crystals in it. One is a 14.385 MHz
crystal which is fed to a PLL chip to produce the 60/66/100 MHz bus
cloxk and the 12 MHz serial/USB clock as well as divided by 12 to
feed the programmable timer.

The second is a 32768 Hz timekeeping crystal used by the battery-backed
clock. While the higher-speed crystal is theoretically more stable,
it turns out that such crystals are used in applications which don't
care very much about long-term stability, so manufacturers don't pay
much attention to it. 32768 Hz crystals, on the other hand, are used
almost exclusively in timekeeping applications, and manufacturers do
take care to achieve good performance.

(See http://bul.eecs.umich.edu/uffc/quartz/vig/vigtoc.htm for a very thorough
education on the subject of quartz crystals.)

Because of this, I want to use the RTC as the primary long-term
clock in the system. The TSC (or programmable tick timer, for
systems without a TSC) will be slaved to the RTC. The TSC is
in turn used as the source of gettimeofday() and other
routine system timekeeping.

Slaving the clock does require some care, but because the interrupt
latency situation inside a single box is not nearly as messy as
internet delays that NTP deals with, the algorithms aren't as
complex. Interrupts can be delayed, but they never arrive early,
so I have a "minimum filter" that seems to work very well at
removing the noise from interrupt arrival times.

There are two ways to slave the TSC to the RTC clock. The nicest is
to have the RTC generate periodic interrupts. (This can be used with
the existing RTC interrupt support by programming the interrupts
to the highest requested rate and dividing down in software for the
application that wants them less often. Since RTC interrupts are
always a power of two per second, there is no error.)

Alan Cox has pointed out to me that there are some machines which Linux
runs on which have an RTC, but the RTC interrupt output is wired to
the reset line. (The idea is that the programmable alarm is used
as a watchdog timer.) In this case, more hair is required.
The system tick interrupt has to read the RTC and compare what
it reads with the predicted value. Since (unlike when asking the RTC
for interrupts) there's no way to tell where in a second you are when
you read the RTC, you have to do more measurement and averaging to
get an accurate read. This results in less accurate tracking, but the
number of systems affected is small.

(It's possible to limit the frequency of slow RTC reading by noticing that
the important times to read it are just before and just after the second
ticks over.)

One of the annoyances in using the TSC is that many systems don't keep
the clock rate constant, but slow it down for power saving. Fortunately,
it's easy for the idle task to set a flag saying that it's done something
(like called APM idle or halted the processor on Cyrix chips) which
might have messed up the TSC. It's also possible to set a flag telling
the idle task not to do those things for a while if we need a while to
recalibrate the clock.

Any time the TSC is misbehaving, we fall back to using the programmable
tick timer for timekeepimg. (Or at least verifying the TSC time
against that.) Of course, there are problems with the PIT (programmable
interval timer) too. The microsecond timer patches or the PC audio speaker
patch needs to reprogram the timer on the fly. This can, however, be
compensated for.

APM power down is even more fun, since we need to reinitialize the
higher-resolution clocks from the RTC on power up. But since the code
is continuously tracking the exact relationship between the RTC and
the TSC, it isn't terribly difficult. The only gotcha is taking less
than a second to do it. My current proposal is to, on wakeup, read
the RTC and set the system time to the beginning of the current second
(or the sleep time, if we have been asleep less than a second). Then,
when the second ticks over, jump the time forward by the appropriate
fraction of a second. This fractional forward jump is pretty harmless,
especially since it just looks like an unexpected pause to the software
and the APM power down has already messed up anything that is going
to be confused by an unexpected pause.

I've been using PC-architecture-specific terms, but most other machines
have equivalents to all of these things, and fall somewhere within the
spectrum of possibilities spanned by various PC hardware.

The one thing that I'm still stuck on is SMP issues. While it appears
that the TSC counters are kept synchronized by current SMP PC hardware,
this is not guaranteed, and other processors probably behave differently.
I have to find a way to keep everything ticking together by somehow
synchronizing all of the TSC registers to each other. The unpredictable
way that

I could really use an understanding of Linux's SMP interrupt architecture.
That is, I need to understand what is guaranteed on *all* platforms,
so I can write portable code, rather than taking avantage of what
happens to work.

Currently, as I understand it, external interrupts are delivered to one
processor, but how that processor is chosen is undefined. It may or
may not be the same processor that received the last instance of that
external interrupt. And in particular, there is no guarantee that a
particular processor will ever see a particular external interrupt.

This makes generating a single systemwide clock a bit tricky. Some
inter-processor interrupts will definitely be required, but I don't
know if I can create a special one for timekeeping purposes.
Can I create a broadcast interrupt which all processors use for
timekeeping?

The one tricky thing about interrupts is that they can be delayed.
Rather a lot, in fact. Fortunately, they can never arrive early,
so in the case of inter-processor synchronization, there's an easy
fix. If I suspect that processor 2's clock is fast relative to
processor 1, I have processor 2 send an interrupt to processor 1.
Each processor keeps track of the time of the interrupt. If the
time that processor 1 thinks it arrived is before the time that
processor 2 thinks it sent it, then there is definitely an error.

But keeping track of this back-and-forth is going to be just a bit hairy.

On a UP, one possible approach to timekeeping is to have the kernel
export the TSC->time conversion factors and let the C library
implement gettimeofday() entirely in user space. (Of course,
one of the state variables would be "TSC unreliable; trap to kernel
for more complex handling.") I don't see how this is possible on
SMP, since there's no atomic way to read the TSC *and* know which
processor you're reading it on.

Anyway, that's a (hopefully not *too* confusing) dump of my current
thinking.
--
-Colin

1998-12-03 14:31:51

[permalink] [raw]
##### Subject: Re: Linux timekeeping plans

On Wed, 2 Dec 1998, Colin Plumb wrote:

> The one thing that I'm still stuck on is SMP issues. While it appears
> that the TSC counters are kept synchronized by current SMP PC hardware,
> this is not guaranteed, and other processors probably behave differently.

on PC SMP hardware you can take this as guaranteed. If anything breaks
this, we will fix it up during SMP initialization.

> Currently, as I understand it, external interrupts are delivered to one
> processor, but how that processor is chosen is undefined. It may or
> may not be the same processor that received the last instance of that
> external interrupt. And in particular, there is no guarantee that a
> particular processor will ever see a particular external interrupt.

exactly.

> This makes generating a single systemwide clock a bit tricky. Some
> inter-processor interrupts will definitely be required, but I don't
> know if I can create a special one for timekeeping purposes.
> Can I create a broadcast interrupt which all processors use for
> timekeeping?

this timekeeping architecture is already in place. The concept is: all
processors have a 'local' interrupt, which is used for profiling, process
time accounting and process timeslice rescheduling. And there is a global
external interrupt that expires timers.

> The one tricky thing about interrupts is that they can be delayed.
> Rather a lot, in fact. Fortunately, they can never arrive early,
> so in the case of inter-processor synchronization, there's an easy
> fix. If I suspect that processor 2's clock is fast relative to
> processor 1, I have processor 2 send an interrupt to processor 1.
> Each processor keeps track of the time of the interrupt. If the
> time that processor 1 thinks it arrived is before the time that
> processor 2 thinks it sent it, then there is definitely an error.

you can safely assume that there is globally valid TSC on SMP systems.
(last i checked this was true for all SMP Intel, Alpha and Sparc systems)
If you use cli/sti you can make any readout of the TSC value and the timer
tick counter atomic.

-- mingo

1998-12-06 11:24:36

[permalink] [raw]
##### Subject: Re: Linux timekeeping plans

Alan Cox wrote:
> Andi Kleen wrote:
>> Colin Plumb wrote:
>>> volatile master_lock = 1, slave_lock = 0;
>>>
>>> master:
>>> tell_slave_to_start_calibration();
>>>
>>> for (i = 0; i < 100; i++) {
>>> /* Send to slave */
>>> while (slave_lock) /* Wait for slave to start spinning */
>>> ;
>>> nop(); /* Wait a moment for slave to really start spinning */
>>> master_time[2*i] = rdtsc();
>>> master_lock = 1;
>>>
>>> /* Receive from slave */
>>> master_lock = 0; /* Announce that we're spinning */
>>> while (!slave_lock)
>>> ;

>> I think you need mb() after the setting of master/slave_lock to make sure
>> that the other CPU sees it as early as possible.

> Also you will need to allow for the cross CPU migration of cache lines and
> the store ordering. That will throw your estimate out wildly.

Um, the code thee is buggy, so it's not worth defending, but I'd still
like to understand the comments.

I think I understand the mb() comment. I've been seeing much higher
run times on P6 systems, and after reading some of the manuals, I think
that's the write combining buffer which needs flushing.

But I don't understand the cross-CPU migration issue.

My current code looks like this:

/*
* Variables for interprocessor communications.
* Padded to make sure thre is no cache line interference.
*/
static volatile int pad0[16];
static volatile int receive_ready;
static volatile int pad1[15];
static volatile int signal_sent;
static volatile int pad2[15];

#define rdtsc(hi,lo) asm volatile("rdtsc" : "=a" (lo), "=d" (hi))

/* Send a signal, returning the TSC just before it is sent. */
static unsigned
send(void)
{
unsigned hi, lo;

/* Send to slave */
signal_sent = 0;
while (!receive_ready)
;
rdtsc(hi,lo); /* Waste time */
receive_ready = 0;
rdtsc(hi,lo);
signal_sent = 1;

return lo;
}

/* Receive a signal, returning the TSC just after it is received. */
static unsigned
receive(void)
{
unsigned hi, lo;

while (signal_sent)
;
receive_ready = 1;
while (!signal_sent)
;
rdtsc(hi,lo);
return lo;
}

Now, I think that after an iteration or two, this will settle into a steady
state. The sequence of memory operations will be:

- signal_sent = 1, shared in both caches. receive_ready = 0, exclusive in the
receiver's cache.
- Sender broadcasts invalidate and gets signal_sent exclusively to write to 0.
- Receiver reads, sender intercepts the read and writes the data out.
We end up back in a shared state with signal_sent = 0.

- Before the receiver is through mucking with that, the sender has advanced
to reading receive_ready. This causes a bus operation which is intercepted
by the receiver and the current value(0) forwarded.

- Then the receiver sets receive_ready to 1 and falls into a spin loop
checking signal_sent. Which is already shared, so there is no bus traffic.
- This causes the same invalidate steps as before.

- The sender sees receive_ready and then clears it (obtaining exclusive
access to it in the process) and otherwise wastes time to ensure that
the receiver is happily in the spin loop with branch prediction operating
to give it the fastest possible loop time.

- The sender now reads the tsc. Timing starts now.
One point of concern I realize now that I'm stepping through this is that
the receive_ready update probably isn't finished yet.

- Then the sender writes signal_sent. This causes the same invalidate
cascade as before, ending up with signal_sent = 1 shared in both caches
(the initial condition)

- The receiver's poll of signal_sent completes and the receiver reads
the tsc.

It seems that the steps that take place between the sender reading the
tsc and the receiver reading it are pretty well defined. How will this
not work?

Here's a revised version I'm considering:

/* Send a signal, returning the TSC just before it is sent. */
static unsigned
send(void)
{
unsigned hi, lo;

/* Send to slave */
while (!receive_ready)
;
receive_ready = 0; /* Waste some time */
mb(); /* Finish bus protocol */
rdtsc(hi,lo);
signal_sent = 1;
mb();

return lo;
}

/* Receive a signal, returning the TSC just after it is received. */
static unsigned
receive(void)
{
unsigned hi, lo;

signal_sent = 0;
mb();
receive_ready = 1;
mb();
while (!signal_sent)
;
rdtsc(hi,lo);
return lo;
}

This removes the need to spin waiting for signal_sent to be cleared,
since it's subsumed by the receive_ready spin.