2005-03-03 14:51:24

by Dror Cohen

[permalink] [raw]
Subject: Time Drift Compensation on Linux Clusters

Hi all,
While working on a Linux cluster with kernel version 2.4.27 we've
encountered a consistent clock drift problem. We have devised a fix
for this problem which is based on the Pentium's TSC clock. We'd
appreciate any comments on the validity of the proposed solution and
on whether it might cause unexpected (and undesired) problems in other
parts of the kernel.

Following is a detailed description of the problem and the fix,

Thanks everyone,
Dror
XIV ltd.

------------------------------------------------------------------------------------------------------------

Time Drift Compensation on Linux Clusters


1. Background

During the operation of the 2.4.27 kernel under intense IO conditions a clock
drift was detected. The system time, as received by gettimeofday() began lagging
behind the wall clock.

All PCs have a battery driven hardware clock. This clock is used at boot time to
initialize the system time. Linux then uses the Programmable Interval
Timer (PIT)
to keep an internal accounting of the flow of time. The PIT is set to raise an
interrupt every fixed interval. The number of interrupt per second is defined by
the kernel constant HZ, which typically equals 100 (10 on Alpha machines). The
interrupt handler kernel/timer.c:do_timer() increments a kernel wide
variable named
jiffies and invokes the bottom half to take care of timers and scheduling.

2. Analysis

In IO intensive environments a many CPU cycles are spent handling interrupts.
This is done by the device driver code for the different IO devices.
Typically, while
dealing with the hardware, device drivers block other IRQs, including the PIT's
(timer) IRQ.

When debugging drivers it is not uncommon to use printk() inside
driver code while
IRQs are still blocked. This method is by nature blocking and can only
be run once
even on SMP machines. This is because messages printed by printk() must reach
the console before any other operation might cause the kernel to panic or block.

In cluster environments a serial console can be used to access
machines remotely.
One drawback of using such a console is its longer response time which
significantly
lengthens the blocking time of printk().

Putting all this together results in an interrupt rich environments in
which many
interrupt handlers block the IRQs for long periods of time. This
matters when 'long'
becomes longer than 1 second / HZ, which means that the CPU misses timer
interrupts.

Each time a time interrupt is missed the system looses one jiffy.
These lost jiffies
accumulate into a consisting clock drift.

3. Fix

The fix proposed here relies on a quality of the Pentium processor and thus is
irrelevant to other architectures or to older (80386, 80486) CPUs.

The Pentium has an internal cycles counter called the Time Stamp Counter (TSC).
This counter is a 64 bit register which is incremented internally
every CPU cycle and
can be read using a special assembly instruction. On a 1GHz machine
this register
wraps to zero once every 584 years.

The suggested fix adds code to the timer interrupt handler which
tracks the value
of this counter. When an interrupt occurs, the jiffies counter is
increased not just
by one but by the number of 1 second / HZ intervals that actually
passed according
to the TSC value change.

This involves modifying two kernel files:

kernel/timer.c:
modify the interrupt handler do_timer()

698a699,703
> #ifdef X86_TSC_DRIFT_COMPENSATION
> __u64 xtdc_step;
> __u64 xtdc_last = 0;
> #endif
>
700a706,712
> #ifdef X86_TSC_DRIFT_COMPENSATION
> __u64 cycles = get_cycles();
> while (xtdc_last < cycles) {
> (*(unsigned long *)&jiffies)++;
> xtdc_last += xtdc_step;
> }
> #else
701a714,715
> #endif
>

arch/i386/kernel/time.c

modify calibrate_tsc() to get the initial TSC count and to calculate the number
of TSC cycles per 1 second / HZ interval

768a769,777
> #ifdef X86_TSC_DRIFT_COMPENSATION
>
> extern __u64 xtdc_step;
> extern __u64 xtdc_last;
> #define CALIBRATE_LATCH (4 * LATCH)
> #define CALIBRATE_TIME (4 * 1000020/HZ)
>
> #else
>
771a781,782
> #endif
>
805a817,820
> #ifdef X86_TSC_DRIFT_COMPENSATION
> xtdc_last = (((__u64) endhigh) << 32) + (__u64) endlow;
> #endif
>
811a827,830
>
> #ifdef X86_TSC_DRIFT_COMPENSATION
> xtdc_step = ((((__u64) endhigh) << 32) + (__u64) endlow) >> 2;
> #endif


2005-03-03 16:27:06

by Chris Friesen

[permalink] [raw]
Subject: Re: Time Drift Compensation on Linux Clusters

Dror Cohen wrote:
> Hi all,
> While working on a Linux cluster with kernel version 2.4.27 we've
> encountered a consistent clock drift problem. We have devised a fix
> for this problem which is based on the Pentium's TSC clock.

Any reason why you can't just use NTP?

Chris

2005-03-03 16:27:42

by Baruch Even

[permalink] [raw]
Subject: Re: Time Drift Compensation on Linux Clusters

Dror Cohen wrote:
> Hi all,
> While working on a Linux cluster with kernel version 2.4.27 we've
> encountered a consistent clock drift problem. We have devised a fix
> for this problem which is based on the Pentium's TSC clock. We'd
> appreciate any comments on the validity of the proposed solution and
> on whether it might cause unexpected (and undesired) problems in other
> parts of the kernel.

That's an interesting solution to a problem I've seen in the past.

An issue with submission to LKML, you'd better submit a full patch in
unified diff format with context, the best command for this is:
diff -Nurp kernel-old kernel-new

The possible problem from jumping multiple jiffies is that calculations
might be skewed off, some are likely to be done based on the assumption
of a single jiffie increase and some by looking at the difference of
jiffies from some time in the past.

For example, the call immediately after the increment of jiffies assumes
a single jiffie passed, but it is not necessarily true. I do not know
what effects this might have probably nothing that didn't happen with
the clock skew anyway.

What happens to the TSC when the CPU is halted due to idleness? I don't
remember exactly but there are cases when it stops ticking and then your
patch will not advance the clock at all.

You also advance the clock once too much, the condition should probably
be (xtdc_last + xtdc_step < cycles).

There is also the issue of a loop with 64 bit operations/comparisons in
the timer. Maybe this should be offloaded to the bh? The cost should be
better thought of.

Baruch

2005-03-03 17:37:15

by Andi Kleen

[permalink] [raw]
Subject: Re: Time Drift Compensation on Linux Clusters

Dror Cohen <[email protected]> writes:

> Hi all,
> While working on a Linux cluster with kernel version 2.4.27 we've
> encountered a consistent clock drift problem. We have devised a fix

The normal fix is to use NTP.

The clock drift problem on lost ticks is known, but I don't
think your change is the right fix.

>
> In IO intensive environments a many CPU cycles are spent handling interrupts.
> This is done by the device driver code for the different IO devices.
> Typically, while
> dealing with the hardware, device drivers block other IRQs, including the PIT's
> (timer) IRQ.

Drivers are not supposed to block interrupts for a long time.
If they do in your environment it would be better to fix this.

> Each time a time interrupt is missed the system looses one jiffy.
> These lost jiffies
> accumulate into a consisting clock drift.

It doesn't. jiffies is not used in gettimeofday(), only xtime and
wall_jiffies is. Fixing up jiffies would at best help you against a
"uptime drift" and make the kernel internal timers a bit more accurate.

Did you really test your code? I bet it doesn't help.

In general using the TSC like this is also dangerous, because
there are system where its rate is not stable (e.g. it can vary
when the CPU changes to lower frequencies for power saving)

BTW when you submit any patches do it in diff -u format and ready
to apply. Your form is very reviewer and user unfriendly.

> > #ifdef X86_TSC_DRIFT_COMPENSATION
> > __u64 cycles = get_cycles();
> > while (xtdc_last < cycles) {
> > (*(unsigned long *)&jiffies)++;

This is also wrong because jiffies is really a 64bit variable.

-Andi