Date: Mon, 22 Nov 2004 16:59:40 -0800
From: Tim Mann <mann@vmware.com>
To: linux-kernel@vger.kernel.org
Cc: mann@vmware.com
Subject: Spurious "lost ticks"
Message-Id: <20041122165940.465312ce@mann-lx.vmware.com>
Organization: VMware, Inc.
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5635
Lines: 99

[I posted a version of this bug report earlier to the
high-res-timers-discourse list, and John Stultz suggested passing it on
to LKML too.  Various major timer changes that John and others such as
George Anzinger are working on should fix it, but it's still a bug in
the present kernel code, and it seems likely to be there for quite a
while until the larger changes are ready to go into the mainstream
kernel.]

In 2.6, some code has been added to watch for "lost ticks" and
increment the jiffies counter to compensate for them.  A "lost tick"
is when timer interrupts are masked for so long that ticks pile up and
the kernel doesn't see each one individually, so it loses count.

Lost ticks are a real problem, especially in 2.6 with the base
interrupt rate having been increased to 1000 Hz, and it's good that
the kernel tries to correct for them.  However, detecting when a tick
has truly been lost is tricky. The code that has been added (both in
timer_tsc.c's mark_offset_tsc and timer_pm.c's mark_offset_pmtmr) is
overly simplistic and can get false positives.  Each time this
happens, a spurious extra tick gets added in, causing the kernel's
clock to go faster than real time.

The lost ticks code in timer_pm.c essentially works as follows.
Whenever we handle a timer tick interrupt, we note the current time as
measured on a finer-grained clock (namely the PM timer).  Let delta =
current_tick_time - last_tick_time.  If delta >= 2.0 ticks, then we
assume that the last floor(delta) - 1 ticks were lost and add this
amount into the jiffies counter.  The timer_tsc.c code is more complex
but shares the same basic concept.

What's wrong with this?  The problem is that when we get around to
reading the PM timer or TSC in the timer interrupt handler, there may
already be *another* timer interrupt pending.  As folks on this list
probably know, there is a very small amount of queuing in x86 interrupt
controllers (PIC or APIC), to handle the case where a device needs to
request another interrupt in the window between when its previous
interrupt request has been passed on from the controller to the CPU and
when the OS's interrupt handler has run to completion and unmasked the
interrupt. When this case happens, the CPU gets interrupted again as
soon as the interrupt is unmasked.  The queue length here is only 1, but
it's not 0.

This queuing means that if we are being slow about responding to timer
interrupts (due to having interrupts masked for too long, say), then
when we finally get into the interrupt handler for timer interrupt
number T, interrupt number T+1 may already be pending.  If we handled
interrupt T-1 on time, then at this point delta will be a little more
than 2.0 ticks, because it's now past time for tick T+1 to happen, so
the "lost ticks" code will fire and add an extra tick.  But no ticks
were really lost.  We are handling tick T right now, and as soon as we
return from the interrupt service routine and unmask the clock
interrupt, we will immediately get another clock interrupt, the one for
tick T+1.

So, checking whether delta >= 2.0 will give us false positives.  How to
fix this?  Because of the queuing, I believe there's no way to detect
lost ticks without either false positives or negatives just by looking
at the spacing between the current tick and the last tick. The best idea
I'm aware of is this: if we compare the number of ticks over a long
period with another clock, we can tell whether we're currently up to
date with all ticks we should have received or are behind by some number
of ticks. Because the interrupt queue length is only 1, I think if we're
behind by N ticks, we must have lost at least N-1 ticks and possibly
(but not certainly) N.  If we conservatively add N-1 ticks, in the worst
case we may lose one tick and never correct it, but we won't fall more
than 1 tick behind.

If we do this, there's a possibility of incorrectly adding/removing an
occasional tick if the other clock we're comparing with is less
accurate (or we know its rate less accurately) than the clock that's
generating timer interrupts.  In particular, a fairly common case is
that timer interrupts come from the PIT, which runs at a specified
rate, while the other clock we have to compare it with is the TSC,
whose rate we know only by measuring it against the PIT.  This is
tricky to deal with, and I don't really want to go into the issues in
this message.  Check the discussion at
http://sourceforge.net/mailarchive/forum.php?forum=high-res-timers-discourse
for more on this.

I should say that so far I haven't tried to test how much of an effect
this bug has on real hardware, but it certainly can happen on any
system where the lost ticks code is needed at all.  It has a big
effect in VMware VMs.  I've seen time in 2.6 kernels run as much as
10% fast using the code in timer_tsc.c (kernel command line option
clock=tsc), and I've seen a gain of roughly 1 second per hour with
clock=pmtmr.  I understand why VM's would tickle the bug a lot more
than real hardware does, and unfortunately it's not something I can do
enough about within the VM implementation.  Until it's fixed on the
Linux side all I can do is tell people to use timer=pit when they run
2.6 in a VM, which turns off all lost ticks compensation.

--
Tim Mann  work: mann@vmware.com  home: tim@tim-mann.org         
          http://www.vmware.com  http://tim-mann.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/