Date: Tue, 15 Jun 2010 13:27:55 -0700
From: Randy Dunlap <randy.dunlap@oracle.com>
To: Zachary Amsden <zamsden@redhat.com>
Cc: avi@redhat.com, mtosatti@redhat.com, glommer@redhat.com,
       kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 17/17] Add timekeeping documentation
Message-Id: <20100615132755.9b8d9041.randy.dunlap@oracle.com>
In-Reply-To: <1276587259-32319-18-git-send-email-zamsden@redhat.com>
References: <1276587259-32319-1-git-send-email-zamsden@redhat.com>
	<1276587259-32319-18-git-send-email-zamsden@redhat.com>
Organization: Oracle Linux Eng.
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 33188
Lines: 707

On Mon, 14 Jun 2010 21:34:19 -1000 Zachary Amsden wrote:

> Basic informational document about x86 timekeeping and how KVM
> is affected.

Nice job/information.  Thanks.

Just some typos etc. inline below.


> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  Documentation/kvm/timekeeping.txt |  599 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 599 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/kvm/timekeeping.txt
> 
> diff --git a/Documentation/kvm/timekeeping.txt b/Documentation/kvm/timekeeping.txt
> new file mode 100644
> index 0000000..4ce8edf
> --- /dev/null
> +++ b/Documentation/kvm/timekeeping.txt
> @@ -0,0 +1,599 @@
> +
> +	Timekeeping Virtualization for X86-Based Architectures
> +
> +	Zachary Amsden <zamsden@redhat.com>
> +	Copyright (c) 2010, Red Hat.  All rights reserved.
> +
> +1) Overview
> +2) Timing Devices
> +3) TSC Hardware
> +4) Virtualization Problems
> +
> +=========================================================================
> +
> +1) Overview
> +
> +One of the most complicated parts of the X86 platform, and specifically,
> +the virtualization of this platform is the plethora of timing devices available
> +and the complexity of emulating those devices.  In addition, virtualization of
> +time introduces a new set of challenges because it introduces a multiplexed
> +division of time beyond the control of the guest CPU.
> +
> +First, we will describe the various timekeeping hardware available, then
> +present some of the problems which arise and solutions available, giving
> +specific recommendations for certain classes of KVM guests.
> +
> +The purpose of this document is to collect data and information relevant to
> +time keeping which may be difficult to find elsewhere, specifically,

   timekeeping

> +information relevant to KVM and hardware based virtualization.

                                   hardware-based

> +
> +=========================================================================
> +
> +2) Timing Devices
> +
> +First we discuss the basic hardware devices available.  TSC and the related
> +KVM clock are special enough to warrant a full exposition and are described in
> +the following section.
> +
> +2.1) i8254 - PIT
> +
> +One of the first timer devices available is the programmable interrupt timer,
> +or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
> +channels which can be programmed to deliver periodic or one-shot interrupts.
> +These three channels can be configured in different modes and have individual
> +counters.  Channel 1 and 2 were not available for general use in the original
> +IBM PC, and historically were connected to control RAM refresh and the PC
> +speaker.  Now the PIT is typically integrated as part of an emulated chipset
> +and a separate physical PIT is not used.
> +
> +The PIT uses I/O ports 0x40h - 0x43.  Access to the 16-bit counters is done

           drop the 'h'       ^

> +using single or multiple byte access to the I/O ports.  There are 6 modes
> +available, but not all modes are available to all timers, as only timer 2
> +has a connected gate input, required for modes 1 and 5.  The gate line is
> +controlled by port 61h, bit 0, as illustrated in the following diagram.
> +
> + --------------             ---------------- 
> +|              |           |                |
> +|  1.1932 MHz  |---------->| CLOCK      OUT | ---------> IRQ 0
> +|    Clock     |   |       |                |
> + --------------    |    +->| GATE  TIMER 0  |
> +                   |        ---------------- 
> +                   |
> +                   |        ----------------
> +                   |       |                |
> +                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
> +                   |       |                |            (aka /dev/null)
> +                   |    +->| GATE  TIMER 1  |
> +                   |        ----------------
> +                   |
> +                   |        ---------------- 
> +                   |       |                |
> +                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
> +                           |                |      |
> +Port 61h, bit 0 ---------->| GATE  TIMER 2  |       \_.----
> +                            ----------------         _|    )---- Speaker
> +                                                    / *----
> +Port 61h, bit 1 -----------------------------------/
> +
> +The timer modes are now described.
> +
> +Mode 0: Single Timeout.   This is a one shot software timeout that counts down

                                       one-shot

> + when the gate is high (always true for timers 0 and 1).  When the count
> + reaches zero, the output goes high.
> +
> +Mode 1: Triggered One Shot.  The output is intially set high.  When the gate

                     One-shot (or One-Shot)

> + line is set high, a countdown is initiated (which does not stop if the gate is
> + lowered), during which the output is set low.  When the count reaches zero,
> + the output goes high.
> +
> +Mode 2: Rate Generator.  The output is initially set high.  When the countdown
> + reaches 1, the output goes low for one count and then returns high.  The value
> + is reloaded and the countdown automatically resume.  If the gate line goes

                                                resumes.

> + low, the count is halted.  If the output is low when the gate is lowered, the
> + output automatically goes high (this only affects timer 2).
> +
> +Mode 3: Square Wave.   This generates a sine wave.  The count determines the

a sine wave is a square wave???

> + length of the pulse, which alternates between high and low when zero is
> + reached.  The count only proceeds when gate is high and is automatically
> + reloaded on reaching zero.  The count is decremented twice at each clock.
> + If the count is even, the clock remains high for N/2 counts and low for N/2
> + counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
> + for (N-1)/2 counts.  Only even values are latched by the counter, so odd
> + values are not observed when reading.
> +
> +Mode 4: Software Strobe.  After programming this mode and loading the counter,
> + the output remains high until the counter reaches zero.  Then the output
> + goes low for 1 clock cycle and returns high.  The counter is not reloaded.
> + Counting only occurs when gate is high.
> +
> +Mode 5: Hardware Strobe.  After programming and loading the counter, the
> + output remains high.  When the gate is raised, a countdown is initiated
> + (which does not stop if the gate is lowered).  When the counter reaches zero,
> + the output goes low for 1 clock cycle and then returns high.  The counter is
> + not reloaded.
> +
> +In addition to normal binary counting, the PIT supports BCD counting.  The
> +command port, 0x43h is used to set the counter and mode for each of the three

                 0x43

> +timers.
> +
> +PIT commands, issued to port 0x43, using the following bit encoding:
> +
> +Bit 7-4: Command (See table below)
> +Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
> +Bit 0  : Binary (0) / BCD (1)
> +
> +Command table:
> +
> +0000 - Latch Timer 0 count for port 0x40
> +	sample and hold the count to be read in port 0x40;
> +	additional commands ignored until counter is read;
> +	mode bits ignored.
> +
> +0001 - Set Timer 0 LSB mode for port 0x40
> +	set timer to read LSB only and force MSB to zero;
> +	mode bits set timer mode
> +	
> +0010 - Set Timer 0 MSB mode for port 0x40
> +	set timer to read MSB only and force LSB to zero;
> +	mode bits set timer mode
> +
> +0011 - Set Timer 0 16-bit mode for port 0x40
> +	set timer to read / write LSB first, then MSB;
> +	mode bits set timer mode
> +
> +0100 - Latch Timer 1 count for port 0x41 - as described above
> +0101 - Set Timer 1 LSB mode for port 0x41 - as described above
> +0110 - Set Timer 1 MSB mode for port 0x41 - as described above
> +0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
> +
> +1000 - Latch Timer 2 count for port 0x42 - as described above
> +1001 - Set Timer 2 LSB mode for port 0x42 - as described above
> +1010 - Set Timer 2 MSB mode for port 0x42 - as described above
> +1011 - Set Timer 2 16-bit mode for port 0x42 as described above
> +
> +1101 - General counter latch
> +	Latch combination of counters into corresponding ports
> +	Bit 3 = Counter 2
> +	Bit 2 = Counter 1
> +	Bit 1 = Counter 0
> +	Bit 0 = Unused
> +
> +1110 - Latch timer status
> +	Latch combination of counter mode into corresponding ports
> +	Bit 3 = Counter 2
> +	Bit 2 = Counter 1
> +	Bit 1 = Counter 0
> +
> +	The output of ports 0x40-0x42 following this command will be:
> +	
> +	Bit 7 = Output pin
> +	Bit 6 = Count loaded (0 if timer has expired)
> +	Bit 5-4 = Read / Write mode
> +	    01 = MSB only
> +	    10 = LSB only
> +	    11 = LSB / MSB (16-bit)
> +	Bit 3-1 = Mode
> +	Bit 0 = Binary (0) / BCD mode (1)
> +
> +2.2) RTC
> +
> +The second device which was available in the original PC was the MC146818 real
> +time clock.  The original device is now obsolete, and usually emulated by the
> +system chipset, sometimes by an HPET and some frankenstein IRQ routing.
> +
> +The RTC is accessed through CMOS variables, which uses an index register to
> +control which bytes are read.  Since there is only one index register, read
> +of the CMOS and read of the RTC require lock protection (in addition, it is
> +dangerous to allow userspace utilities such as hwclock to have direct RTC
> +access, as they could corrupt kernel reads and writes of CMOS memory).
> +
> +The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
> +can function as a once a day alarm, a periodic alarm, and can issue interrupts

(once a day is periodic ;)

> +after an update of the CMOS registers by the MC146818 is complete.  The type of
> +interrupt is signalled in the RTC status registers.
> +
> +The RTC will update the current time fields by battery power even while the
> +system is off.  The current time fields should not be read while an update is
> +in progress, as indicated in the status register.
> +
> +The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
> +programmed to a 32kHz divider if the RTC is to count seconds.
> +
> +This is the RAM map originally used for the RTC/CMOS:
> +
> +Location    Size    Description
> +------------------------------------------
> +00h         byte    Current second (BCD)
> +01h         byte    Seconds alarm (BCD)
> +02h         byte    Current minute (BCD)
> +03h         byte    Minutes alarm (BCD)
> +04h         byte    Current hour (BCD)
> +05h         byte    Hours alarm (BCD)
> +06h         byte    Current day of week (BCD)
> +07h         byte    Current day of month (BCD)
> +08h         byte    Current month (BCD)
> +09h         byte    Current year (BCD)
> +0Ah         byte    Register A
> +                       bit 7   = Update in progress
> +                       bit 6-4 = Divider for clock
> +                                  000 = 4.194 MHz
> +                                  001 = 1.049 MHz
> +                                  010 = 32 kHz
> +                                  10X = test modes
> +                                  110 = reset / disable
> +                                  111 = reset / disable
> +                       bit 3-0 = Rate selection for periodic interrupt
> +                                  000 = periodic timer disabled
> +                                  001 = 3.90625 uS
> +                                  010 = 7.8125 uS
> +                                  011 = .122070 mS
> +                                  100 = .244141 mS
> +                                     ...
> +                                 1101 = 125 mS
> +                                 1110 = 250 mS
> +                                 1111 = 500 mS
> +0Bh         byte    Register B
> +                       bit 7   = Run (0) / Halt (1)
> +                       bit 6   = Periodic interrupt enable
> +                       bit 5   = Alarm interrupt enable
> +                       bit 4   = Update-ended interrupt enable
> +                       bit 3   = Square wave interrupt enable
> +                       bit 2   = BCD calendar (0) / Binary (1)
> +                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
> +                       bit 0   = 0 (DST off) / 1 (DST enabled)
> +OCh         byte    Register C (read only)
> +                       bit 7   = interrupt request flag (IRQF)
> +                       bit 6   = periodic interrupt flag (PF)
> +                       bit 5   = alarm interrupt flag (AF)
> +                       bit 4   = update interrupt flag (UF)
> +                       bit 3-0 = reserved
> +ODh         byte    Register D (read only)
> +                       bit 7   = RTC has power
> +                       bit 6-0 = reserved
> +32h         byte    Current century BCD (*)
> +  (*) location vendor specific and now determined from ACPI global tables
> +
> +2.3) APIC
> +
> +On Pentium and later processors, an on-board timer is available to each CPU
> +as part of the Advanced Programmable Interrupt Controller.  The APIC is
> +accessed through memory mapped registers and provides interrupt service to each

                    memory-mapped

> +CPU, used for IPIs and local timer interrupts.
> +
> +Although in theory the APIC is a safe and stable source for local interrupts,
> +in practice, many bugs and glitches have occurred due to the special nature of
> +the APIC CPU-local memory mapped hardware.  Beware that CPU errata may affect

                      memory-mapped

> +the use of the APIC and that workarounds may be required.  In addition, some of
> +these workarounds pose unique constraints for virtualization - requiring either
> +extra overhead incurred from extra reads of memory mapped I/O or additional

                                               ditto

> +functionality that may be more computationally expensive to implement.
> +
> +Since the APIC is documented quite well in the Intel and AMD manuals, we will
> +avoid repititon of the detail here.  It should be pointed out that the APIC
> +timer is programmed through the LVT (local vector timer) register, is capable
> +of one-shot or periodic operation, and is based on the bus clock divided down
> +by the programmable divider register.
> +
> +2.4) HPET
> +
> +HPET is quite complex, and was originally intended to replace the PIT / RTC
> +support of the X86 PC.  It remains to be seen whether that will be the case, as
> +the de-facto standard of PC hardware is to emulate these older devices.  Some

       de facto

> +systems designated as legacy free may support only the HPET as a hardware timer
> +device.
> +
> +The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
> +but allowing implementation freedom to support many more.  It also imposes no
> +fixed rate on the timer frequency, but does impose some extremal values on
> +frequency, error and slew.
> +
> +In general, the HPET is recommended as a high precision (compared to PIT /RTC)
> +time source which is independent of local variation (as there is only one HPET
> +in any given system).  The HPET is also memory mapped, and its presence is

                                           memory-mapped,

> +indicated through ACPI table by the BIOS.

                     "an ACPI table" or "ACPI tables"

> +
> +Detailed specification of the HPET is beyond the current scope of this
> +document, as it is also very well documented elsewhere.
> +
> +2.5) Offboard Timers
> +
> +Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
> +timing chips built into the cards which may have registers which are accessible
> +to kernel or user drivers.  To the author's knowledge, using these to generate
> +a clocksource for a Linux or other kernel has not yet been attempted and is in
> +general frowned upon as not playing by the agreed rules of the game.  Such a
> +timer device would require additional support to be virtualized properly and is
> +not considered important at this time as no known operating system does this.
> +
> +=========================================================================
> +
> +3) TSC Hardware
> +
> +The TSC or time stamp counter is relatively simple in theory; it counts
> +instruction cycles issued by the processor, which can be used as a measure of
> +time.  In practice, due to a number of problems, it is the most complicated
> +time keeping device to use.

   timekeeping

> +
> +The TSC is represented internally as a 64-bit MSR which can be read with the
> +RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
> +limitations made it possible to write the TSC, but generally on old hardware it
> +was only possible to write the low 32-bits of the 64-bit counter, and the upper
> +32-bits of the counter were cleared.  Now, however, on Intel processors family
> +0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
> +has been lifted and all 64-bits are writable.  On AMD systems, the ability to
> +write the TSC MSR is not an architectural guarantee.
> +
> +The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
> +means of the CR4.TSD bit, which disables CPL > 0 TSC access.
> +
> +Some vendors have implemented an additional instruction, RDTSCP, which returns
> +atomically not just the TSC, but an indicator which corresponds to the
> +processor number.  This can be used to index into an array of TSC variables to
> +determine offset information in SMP systems where TSCs are not synchronized.
> +
> +Both VMX and SVM provide extension fields in the virtualization hardware which
> +allows the guest visible TSC to be offset by a constant.  Newer implementations
> +promise to allow the TSC to additionally be scaled, but this hardware is not
> +yet widely available.
> +
> +3.1) TSC synchronization
> +
> +The TSC is a CPU-local clock in most implementations.  This means, on SMP
> +platforms, the TSCs of different CPUs may start at different times depending
> +on when the CPUs are powered on.  Generally, CPUs on the same die will share
> +the same clock, however, this is not always the case.
> +
> +The BIOS may attempt to resynchronize the TSCs as a result during the poweron

"as a result" of what?  That phrase seems superfluous.

> +process and the operating system or other system software may attempt to do
> +this as well.  Several hardware limitations make the problem worse - if it is
> +not possible to write the full 32-bits of the TSC, it may be impossible to
> +match the TSC in newly arriving CPUs to that of the rest of the system,
> +resulting in unsynchronized TSCs.  This may be done by BIOS or system software,
> +but in practice, getting a perfectly synchronized TSC will not be possible
> +unless all values are read from the same clock, which generally only is
> +possible on single socket systems or those with special hardware support.
> +
> +3.2) TSC and CPU hotplug
> +
> +As touched on already, CPUs which arrive later than the boot time of the system
> +may not have a TSC value that is synchronized with the rest of the system.
> +Either system software, BIOS, or SMM code may actually try to establish the TSC
> +to a value matching the rest of the system, but a perfect match is usually not
> +a guarantee.
> +
> +3.3) TSC and multi-socket / NUMA
> +
> +Multi-socket systems, especially large multi-socket systems are likely to have
> +individual clocksources rather than a single, universally distributed clock.
> +Since these clocks are driven by different crystals, they will not have
> +perfectly matched frequency, and temperature and electrical variations will
> +cause the cpu clocks, and thus the TSCs to drift over time.  Depending on the

             CPU

> +exact clock and bus design, the drift may or may not be fixed in absolute
> +error, and may accumulate over time.
> +
> +In addition, very large systems may deliberately slew the clocks of individual
> +cores.  This technique, known as spread-spectrum clocking, reduces EMI at the
> +clock frequency and harmonics of it, which may be required to pass FCC
> +standards for telecommunications and computer equipment.
> +
> +It is recommended not to trust the TSCs to remain synchronized on NUMA or
> +multiple socket systems for these reasons.
> +
> +3.4) TSC and C-states
> +
> +C-states, or idling states of the processor, especially C1E and deeper sleep
> +states may be problematic for TSC as well.  The TSC may stop advancing in such
> +a state, resulting in a TSC which is behind that of other CPUs when execution
> +is resumed.  Such CPUs must be detected and flagged by the operating system
> +based on CPU and chipset identifications.
> +
> +The TSC in such a case may be corrected by catching it up to a known external
> +clocksource.
> +
> +3.5) TSC frequency change / P-states
> +
> +To make things slightly more interesting, some CPUs may change requency.  They

                                                                  frequency.

> +may or may not run the TSC at the same rate, and because the frequency change
> +may be staggered or slewed, at some points in time, the TSC rate may not be
> +known other than falling within a range of values.  In this case, the TSC will
> +not be a stable time source, and must be calibrated against a known, stable,
> +external clock to be a usable source of time.
> +
> +Whether the TSC runs at a constant rate or scales with the P-state is model
> +dependent and must be determined by inspecting CPUID, chipset or various MSR
> +fields.
> +
> +In addition, some vendors have known bugs where the P-state is actually
> +compensated for properly during normal operation, but when the processor is
> +inactive, the P-state may be raised temporarily to service cache misses from
> +other processors.  In such cases, the TSC on halted CPUs could advance faster
> +than that of non-halted processors.  AMD Turion processors are known to have
> +this problem.
> +
> +3.6) TSC and STPCLK / T-states
> +
> +External signals given to the processor may also have the affect of stopping
> +the TSC.  This is typically done for thermal emergency power control to prevent
> +an overheating condition, and typically, there is no way to detect that this
> +condition has happened.
> +
> +3.7) TSC virtualization - VMX
> +
> +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
> +instructions, which is enough for full virtualization of TSC in any manner.  In
> +addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
> +field specified in the VMCS.  Special instructions must be used to read and
> +write the VMCS field.
> +
> +3.8) TSC virtualization - SVM
> +
> +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP

   SVM

> +instructions, which is enough for full virtualization of TSC in any manner.  In
> +addition, SVM allows passing through the host TSC plus an additional offset
> +ield specified in SVM control block.
> +
> +3.9) TSC feature bits in Linux
> +
> +In summary, there is no way to guarantee the TSC remains in perfect
> +synchronization unless it is explicitly guaranteed by the architecture.  Even
> +if so, the TSCs in multi-sockets or NUMA systems may still run independently
> +despite being locally consistent.
> +
> +The following feature bits are used by Linux to signal various TSC attributes,
> +but they can only be taken to be meaningful for UP or single node systems.
> +
> +X86_FEATURE_TSC 		: The TSC is available in hardware
> +X86_FEATURE_RDTSCP		: The RDTSCP instruction is available
> +X86_FEATURE_CONSTANT_TSC 	: The TSC rate is unchanged with P-states
> +X86_FEATURE_NONSTOP_TSC		: The TSC does not stop in C-states
> +X86_FEATURE_TSC_RELIABLE	: TSC sync checks are skipped (VMware)
> +
> +4) Virtualization Problems
> +
> +Timekeeping is especially problematic for virtualization because a number of
> +challenges arise.  The most obvious problem is that time is now shared between
> +the host and, potentially, a number of virtual machines.  This happens
> +naturally on X86 systems when SMM mode is used by the BIOS, but not to such a
> +degree nor with such frequency.  However, the fact that SMM mode may cause
> +similar problems to virtualization makes it a good justification for solving
> +many of these problems on bare metal.
> +
> +4.1) Interrupt clocking
> +
> +One of the most immediate problems that occurs with legacy operating systems
> +is that the system timekeeping routines are often designed to keep track of
> +time by counting periodic interrupts.  These interrupts may come from the PIT
> +or the RTC, but the problem is the same: the host virtualization engine may not
> +be able to deliver the proper number of interrupts per second, and so guest
> +time may fall behind.  This is especially problematic if a high interrupt rate
> +is selected, such as 1000 HZ, which is unfortunately the default for many Linux
> +guests.
> +
> +There are three approaches to solving this problem; first, it may be possible
> +to simply ignore it.  Guests which have a separate time source for tracking
> +'wall clock' or 'real time' may not need any adjustment of their interrupts to
> +maintain proper time.  If this is not sufficient, it may be necessary to inject
> +additional interrupts into the guest in order to increase the effective
> +interrupt rate.  This approach leads to complications in extreme conditions,
> +where host load or guest lag is too much to compensate for, and thus another
> +solution to the problem has risen: the guest may need to become aware of lost
> +ticks and compensate for them internally.  Although promising in theory, the
> +implementation of this policy in Linux has been extremely error prone, and a
> +number of buggy variants of lost tick compensation are distributed across
> +commonly used Linux systems.
> +
> +Windows uses periodic RTC clocking as a means of keeping time internally, and
> +thus requires interrupt slewing to keep proper time.  It does use a low enough
> +rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
> +practice.
> +
> +4.2) TSC sampling and serialization
> +
> +As the highest precision time source available, the cycle counter of the CPU
> +has aroused much interest from developers.  As explained above, this timer has
> +many problems unique to its nature as a local, potentially unstable and
> +potentially unsynchronized source.  One issue which is not unique to the TSC,
> +but is highlighted because of it's very precise nature is sampling delay.  By

                                 its

> +definition, the counter, once read is already old.  However, it is also
> +possible for the counter to be read ahead of the actual use of the result.
> +This is a consequence of the superscalar execution of the instruction stream,
> +which may execute instructions out of order.  Such execution is called
> +non-serialized.  Forcing serialized execution is necessary for precise
> +measurement with the TSC, and requires a serializing instruction, such as CPUID
> +or an MSR read.
> +
> +Since CPUID may actually be virtualized by a trap and emulate mechanism, this
> +serialization can pose a performance issue for hardware virtualization.  An
> +accurate time stamp counter reading may therefore not always be available, and
> +it may be necessary for an implementation to guard against "backwards" reads of
> +the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
> +system.
> +
> +4.3) Timespec aliasing
> +
> +Additionally, this lack of serialization from the TSC poses another challenge
> +when using results of the TSC when measured against another time source.  As
> +the TSC is much higher precision, many possible values of the TSC may be read
> +while another clock is still expressing the same value.
> +
> +That is, you may read (T,T+10) while external clock C maintains the same value.
> +Due to non-serialized reads, you may actually end up with a range which
> +fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
> +calibrated against an external value may have a range of valid values.
> +Re-calibrating this computation may actually cause time, as computed after the
> +calibration, to go backwards, compared with time computed before the
> +calibration.
> +
> +This problem is particularly pronounced with an internal time source in Linux,
> +the kernel time, which is expressed in the theoretically high resultion

                                                                 resolution

> +timespec - but which advances in much larger granularity intervals, sometimes
> +at the rate of jiffies, and possibly in catchup modes, at a much larger step.
> +
> +This aliasing requires care in the computation and recalibration of kvmclock
> +and any other values derived from TSC computation (such as TSC virtualization
> +itself).
> +
> +4.4) Migration
> +
> +Migration of a virtual machine raises problems for timekeeping in two ways.
> +First, the migration itself may take time, during which interrupts can not be

                                                                      cannot

> +delivered, and after which, the guest time may need to be caught up.  NTP may
> +be able to help to some degree here, as the clock correction required is
> +typically small enough to fall in the NTP-correctable window.
> +
> +An additional concern is that timers based off the TSC (or HPET, if the raw bus
> +clock is exposed) may now be running at different rates, requiring compensation
> +in some may in the hypervisor by virtualizing these timers.  In addition,
> +migrating to a faster machine may preclude the use of a passthrough TSC, as a
> +faster clock can not be made visible to a guest without the potential of time

                cannot

> +advancing faster than usual.  A slower clock is less of a problem, as it can
> +always be caught up to the original rate.  KVM clock avoids these problems by
> +simply storing multipliers and offsets gainst the TSC for the guest to convert
> +back into nanosecond resolution values.
> +
> +4.5) Scheduling
> +
> +Since scheduling may be based on precise timing and firing of interrupts, the
> +scheduling algorithms of an operating system may be adversely affected by
> +virtualization.  In theory, the effect is random and should be universally
> +distributed, but in contrived as well as real scenarios (guest device access,
> +causes virtualization exits, possible context switch), this may not always be
> +the case.  The effect of this has not been well studied (ed: has it?  any
> +published results?).
> +
> +In an attempt to workaround this, several implementations have provided a

                    work around this,

> +paravirtualized scheduler clock, which reveals the true amount of CPU time for
> +which a virtual machine has been running.
> +
> +4.6) Watchdogs
> +
> +Watchdog timers, such as the lock detector in Linux may fire accidentally when
> +running under hardware virtualization due to timer interrupts being delayed or
> +misinterpretation of the passage of real time.  Usually, these warnings are
> +spurious and can be ignored, but in some circumstances it may be necessary to
> +disable such detection.
> +
> +4.7) Delays and precision timing
> +
> +Precise timing and delays may not be possible in a virtualized system.  This
> +can happen if the system is controlling physical hardware, or issues delays to
> +compensate for slower I/O to and from devices.  The first issue is not solvable
> +in general for a virtualized system; hardware control software can't be
> +adequately virtualized without a full real-time operating system, which would
> +require an RT aware virtualization platform.
> +
> +The second issue may cause performance problems, but this is unlikely to be a
> +significant issue.  In many cases these delays may be eliminated through
> +configuration or paravirtualization.
> +
> +4.8) Covert channels and leaks
> +
> +In addition to the above problems, time information will inevitably leak to the
> +guest about the host in anything but a perfect implementation of virtualized
> +time.  This may allow the guest to infer the presence of a hypervisor (as in a
> +red-pill type detection), and it may allow information to leak between guests
> +by using CPU utilization itself as a signalling channel.  Preventing such
> +problems would require completely isolated virtual time which may not track
> +real time any longer.  This may be useful in certain security or QA contexts,
> +but in general isn't recommended for real-world deployment scenarios.
> +
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/