Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932433Ab0FOU3G (ORCPT ); Tue, 15 Jun 2010 16:29:06 -0400 Received: from rcsinet10.oracle.com ([148.87.113.121]:55137 "EHLO rcsinet10.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757927Ab0FOU3B (ORCPT ); Tue, 15 Jun 2010 16:29:01 -0400 Date: Tue, 15 Jun 2010 13:27:55 -0700 From: Randy Dunlap To: Zachary Amsden Cc: avi@redhat.com, mtosatti@redhat.com, glommer@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 17/17] Add timekeeping documentation Message-Id: <20100615132755.9b8d9041.randy.dunlap@oracle.com> In-Reply-To: <1276587259-32319-18-git-send-email-zamsden@redhat.com> References: <1276587259-32319-1-git-send-email-zamsden@redhat.com> <1276587259-32319-18-git-send-email-zamsden@redhat.com> Organization: Oracle Linux Eng. X-Mailer: Sylpheed 2.7.1 (GTK+ 2.16.6; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Auth-Type: Internal IP X-Source-IP: rcsinet15.oracle.com [148.87.113.117] X-CT-RefId: str=0001.0A090203.4C17E281.0206:SCFMA4539811,ss=1,fgs=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 33188 Lines: 707 On Mon, 14 Jun 2010 21:34:19 -1000 Zachary Amsden wrote: > Basic informational document about x86 timekeeping and how KVM > is affected. Nice job/information. Thanks. Just some typos etc. inline below. > Signed-off-by: Zachary Amsden > --- > Documentation/kvm/timekeeping.txt | 599 +++++++++++++++++++++++++++++++++++++ > 1 files changed, 599 insertions(+), 0 deletions(-) > create mode 100644 Documentation/kvm/timekeeping.txt > > diff --git a/Documentation/kvm/timekeeping.txt b/Documentation/kvm/timekeeping.txt > new file mode 100644 > index 0000000..4ce8edf > --- /dev/null > +++ b/Documentation/kvm/timekeeping.txt > @@ -0,0 +1,599 @@ > + > + Timekeeping Virtualization for X86-Based Architectures > + > + Zachary Amsden > + Copyright (c) 2010, Red Hat. All rights reserved. > + > +1) Overview > +2) Timing Devices > +3) TSC Hardware > +4) Virtualization Problems > + > +========================================================================= > + > +1) Overview > + > +One of the most complicated parts of the X86 platform, and specifically, > +the virtualization of this platform is the plethora of timing devices available > +and the complexity of emulating those devices. In addition, virtualization of > +time introduces a new set of challenges because it introduces a multiplexed > +division of time beyond the control of the guest CPU. > + > +First, we will describe the various timekeeping hardware available, then > +present some of the problems which arise and solutions available, giving > +specific recommendations for certain classes of KVM guests. > + > +The purpose of this document is to collect data and information relevant to > +time keeping which may be difficult to find elsewhere, specifically, timekeeping > +information relevant to KVM and hardware based virtualization. hardware-based > + > +========================================================================= > + > +2) Timing Devices > + > +First we discuss the basic hardware devices available. TSC and the related > +KVM clock are special enough to warrant a full exposition and are described in > +the following section. > + > +2.1) i8254 - PIT > + > +One of the first timer devices available is the programmable interrupt timer, > +or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three > +channels which can be programmed to deliver periodic or one-shot interrupts. > +These three channels can be configured in different modes and have individual > +counters. Channel 1 and 2 were not available for general use in the original > +IBM PC, and historically were connected to control RAM refresh and the PC > +speaker. Now the PIT is typically integrated as part of an emulated chipset > +and a separate physical PIT is not used. > + > +The PIT uses I/O ports 0x40h - 0x43. Access to the 16-bit counters is done drop the 'h' ^ > +using single or multiple byte access to the I/O ports. There are 6 modes > +available, but not all modes are available to all timers, as only timer 2 > +has a connected gate input, required for modes 1 and 5. The gate line is > +controlled by port 61h, bit 0, as illustrated in the following diagram. > + > + -------------- ---------------- > +| | | | > +| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0 > +| Clock | | | | > + -------------- | +->| GATE TIMER 0 | > + | ---------------- > + | > + | ---------------- > + | | | > + |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM > + | | | (aka /dev/null) > + | +->| GATE TIMER 1 | > + | ---------------- > + | > + | ---------------- > + | | | > + |------>| CLOCK OUT | ---------> Port 61h, bit 5 > + | | | > +Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- > + ---------------- _| )---- Speaker > + / *---- > +Port 61h, bit 1 -----------------------------------/ > + > +The timer modes are now described. > + > +Mode 0: Single Timeout. This is a one shot software timeout that counts down one-shot > + when the gate is high (always true for timers 0 and 1). When the count > + reaches zero, the output goes high. > + > +Mode 1: Triggered One Shot. The output is intially set high. When the gate One-shot (or One-Shot) > + line is set high, a countdown is initiated (which does not stop if the gate is > + lowered), during which the output is set low. When the count reaches zero, > + the output goes high. > + > +Mode 2: Rate Generator. The output is initially set high. When the countdown > + reaches 1, the output goes low for one count and then returns high. The value > + is reloaded and the countdown automatically resume. If the gate line goes resumes. > + low, the count is halted. If the output is low when the gate is lowered, the > + output automatically goes high (this only affects timer 2). > + > +Mode 3: Square Wave. This generates a sine wave. The count determines the a sine wave is a square wave??? > + length of the pulse, which alternates between high and low when zero is > + reached. The count only proceeds when gate is high and is automatically > + reloaded on reaching zero. The count is decremented twice at each clock. > + If the count is even, the clock remains high for N/2 counts and low for N/2 > + counts; if the clock is odd, the clock is high for (N+1)/2 counts and low > + for (N-1)/2 counts. Only even values are latched by the counter, so odd > + values are not observed when reading. > + > +Mode 4: Software Strobe. After programming this mode and loading the counter, > + the output remains high until the counter reaches zero. Then the output > + goes low for 1 clock cycle and returns high. The counter is not reloaded. > + Counting only occurs when gate is high. > + > +Mode 5: Hardware Strobe. After programming and loading the counter, the > + output remains high. When the gate is raised, a countdown is initiated > + (which does not stop if the gate is lowered). When the counter reaches zero, > + the output goes low for 1 clock cycle and then returns high. The counter is > + not reloaded. > + > +In addition to normal binary counting, the PIT supports BCD counting. The > +command port, 0x43h is used to set the counter and mode for each of the three 0x43 > +timers. > + > +PIT commands, issued to port 0x43, using the following bit encoding: > + > +Bit 7-4: Command (See table below) > +Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) > +Bit 0 : Binary (0) / BCD (1) > + > +Command table: > + > +0000 - Latch Timer 0 count for port 0x40 > + sample and hold the count to be read in port 0x40; > + additional commands ignored until counter is read; > + mode bits ignored. > + > +0001 - Set Timer 0 LSB mode for port 0x40 > + set timer to read LSB only and force MSB to zero; > + mode bits set timer mode > + > +0010 - Set Timer 0 MSB mode for port 0x40 > + set timer to read MSB only and force LSB to zero; > + mode bits set timer mode > + > +0011 - Set Timer 0 16-bit mode for port 0x40 > + set timer to read / write LSB first, then MSB; > + mode bits set timer mode > + > +0100 - Latch Timer 1 count for port 0x41 - as described above > +0101 - Set Timer 1 LSB mode for port 0x41 - as described above > +0110 - Set Timer 1 MSB mode for port 0x41 - as described above > +0111 - Set Timer 1 16-bit mode for port 0x41 - as described above > + > +1000 - Latch Timer 2 count for port 0x42 - as described above > +1001 - Set Timer 2 LSB mode for port 0x42 - as described above > +1010 - Set Timer 2 MSB mode for port 0x42 - as described above > +1011 - Set Timer 2 16-bit mode for port 0x42 as described above > + > +1101 - General counter latch > + Latch combination of counters into corresponding ports > + Bit 3 = Counter 2 > + Bit 2 = Counter 1 > + Bit 1 = Counter 0 > + Bit 0 = Unused > + > +1110 - Latch timer status > + Latch combination of counter mode into corresponding ports > + Bit 3 = Counter 2 > + Bit 2 = Counter 1 > + Bit 1 = Counter 0 > + > + The output of ports 0x40-0x42 following this command will be: > + > + Bit 7 = Output pin > + Bit 6 = Count loaded (0 if timer has expired) > + Bit 5-4 = Read / Write mode > + 01 = MSB only > + 10 = LSB only > + 11 = LSB / MSB (16-bit) > + Bit 3-1 = Mode > + Bit 0 = Binary (0) / BCD mode (1) > + > +2.2) RTC > + > +The second device which was available in the original PC was the MC146818 real > +time clock. The original device is now obsolete, and usually emulated by the > +system chipset, sometimes by an HPET and some frankenstein IRQ routing. > + > +The RTC is accessed through CMOS variables, which uses an index register to > +control which bytes are read. Since there is only one index register, read > +of the CMOS and read of the RTC require lock protection (in addition, it is > +dangerous to allow userspace utilities such as hwclock to have direct RTC > +access, as they could corrupt kernel reads and writes of CMOS memory). > + > +The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt > +can function as a once a day alarm, a periodic alarm, and can issue interrupts (once a day is periodic ;) > +after an update of the CMOS registers by the MC146818 is complete. The type of > +interrupt is signalled in the RTC status registers. > + > +The RTC will update the current time fields by battery power even while the > +system is off. The current time fields should not be read while an update is > +in progress, as indicated in the status register. > + > +The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be > +programmed to a 32kHz divider if the RTC is to count seconds. > + > +This is the RAM map originally used for the RTC/CMOS: > + > +Location Size Description > +------------------------------------------ > +00h byte Current second (BCD) > +01h byte Seconds alarm (BCD) > +02h byte Current minute (BCD) > +03h byte Minutes alarm (BCD) > +04h byte Current hour (BCD) > +05h byte Hours alarm (BCD) > +06h byte Current day of week (BCD) > +07h byte Current day of month (BCD) > +08h byte Current month (BCD) > +09h byte Current year (BCD) > +0Ah byte Register A > + bit 7 = Update in progress > + bit 6-4 = Divider for clock > + 000 = 4.194 MHz > + 001 = 1.049 MHz > + 010 = 32 kHz > + 10X = test modes > + 110 = reset / disable > + 111 = reset / disable > + bit 3-0 = Rate selection for periodic interrupt > + 000 = periodic timer disabled > + 001 = 3.90625 uS > + 010 = 7.8125 uS > + 011 = .122070 mS > + 100 = .244141 mS > + ... > + 1101 = 125 mS > + 1110 = 250 mS > + 1111 = 500 mS > +0Bh byte Register B > + bit 7 = Run (0) / Halt (1) > + bit 6 = Periodic interrupt enable > + bit 5 = Alarm interrupt enable > + bit 4 = Update-ended interrupt enable > + bit 3 = Square wave interrupt enable > + bit 2 = BCD calendar (0) / Binary (1) > + bit 1 = 12-hour mode (0) / 24-hour mode (1) > + bit 0 = 0 (DST off) / 1 (DST enabled) > +OCh byte Register C (read only) > + bit 7 = interrupt request flag (IRQF) > + bit 6 = periodic interrupt flag (PF) > + bit 5 = alarm interrupt flag (AF) > + bit 4 = update interrupt flag (UF) > + bit 3-0 = reserved > +ODh byte Register D (read only) > + bit 7 = RTC has power > + bit 6-0 = reserved > +32h byte Current century BCD (*) > + (*) location vendor specific and now determined from ACPI global tables > + > +2.3) APIC > + > +On Pentium and later processors, an on-board timer is available to each CPU > +as part of the Advanced Programmable Interrupt Controller. The APIC is > +accessed through memory mapped registers and provides interrupt service to each memory-mapped > +CPU, used for IPIs and local timer interrupts. > + > +Although in theory the APIC is a safe and stable source for local interrupts, > +in practice, many bugs and glitches have occurred due to the special nature of > +the APIC CPU-local memory mapped hardware. Beware that CPU errata may affect memory-mapped > +the use of the APIC and that workarounds may be required. In addition, some of > +these workarounds pose unique constraints for virtualization - requiring either > +extra overhead incurred from extra reads of memory mapped I/O or additional ditto > +functionality that may be more computationally expensive to implement. > + > +Since the APIC is documented quite well in the Intel and AMD manuals, we will > +avoid repititon of the detail here. It should be pointed out that the APIC > +timer is programmed through the LVT (local vector timer) register, is capable > +of one-shot or periodic operation, and is based on the bus clock divided down > +by the programmable divider register. > + > +2.4) HPET > + > +HPET is quite complex, and was originally intended to replace the PIT / RTC > +support of the X86 PC. It remains to be seen whether that will be the case, as > +the de-facto standard of PC hardware is to emulate these older devices. Some de facto > +systems designated as legacy free may support only the HPET as a hardware timer > +device. > + > +The HPET spec is rather loose and vague, requiring at least 3 hardware timers, > +but allowing implementation freedom to support many more. It also imposes no > +fixed rate on the timer frequency, but does impose some extremal values on > +frequency, error and slew. > + > +In general, the HPET is recommended as a high precision (compared to PIT /RTC) > +time source which is independent of local variation (as there is only one HPET > +in any given system). The HPET is also memory mapped, and its presence is memory-mapped, > +indicated through ACPI table by the BIOS. "an ACPI table" or "ACPI tables" > + > +Detailed specification of the HPET is beyond the current scope of this > +document, as it is also very well documented elsewhere. > + > +2.5) Offboard Timers > + > +Several cards, both proprietary (watchdog boards) and commonplace (e1000) have > +timing chips built into the cards which may have registers which are accessible > +to kernel or user drivers. To the author's knowledge, using these to generate > +a clocksource for a Linux or other kernel has not yet been attempted and is in > +general frowned upon as not playing by the agreed rules of the game. Such a > +timer device would require additional support to be virtualized properly and is > +not considered important at this time as no known operating system does this. > + > +========================================================================= > + > +3) TSC Hardware > + > +The TSC or time stamp counter is relatively simple in theory; it counts > +instruction cycles issued by the processor, which can be used as a measure of > +time. In practice, due to a number of problems, it is the most complicated > +time keeping device to use. timekeeping > + > +The TSC is represented internally as a 64-bit MSR which can be read with the > +RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware > +limitations made it possible to write the TSC, but generally on old hardware it > +was only possible to write the low 32-bits of the 64-bit counter, and the upper > +32-bits of the counter were cleared. Now, however, on Intel processors family > +0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction > +has been lifted and all 64-bits are writable. On AMD systems, the ability to > +write the TSC MSR is not an architectural guarantee. > + > +The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by > +means of the CR4.TSD bit, which disables CPL > 0 TSC access. > + > +Some vendors have implemented an additional instruction, RDTSCP, which returns > +atomically not just the TSC, but an indicator which corresponds to the > +processor number. This can be used to index into an array of TSC variables to > +determine offset information in SMP systems where TSCs are not synchronized. > + > +Both VMX and SVM provide extension fields in the virtualization hardware which > +allows the guest visible TSC to be offset by a constant. Newer implementations > +promise to allow the TSC to additionally be scaled, but this hardware is not > +yet widely available. > + > +3.1) TSC synchronization > + > +The TSC is a CPU-local clock in most implementations. This means, on SMP > +platforms, the TSCs of different CPUs may start at different times depending > +on when the CPUs are powered on. Generally, CPUs on the same die will share > +the same clock, however, this is not always the case. > + > +The BIOS may attempt to resynchronize the TSCs as a result during the poweron "as a result" of what? That phrase seems superfluous. > +process and the operating system or other system software may attempt to do > +this as well. Several hardware limitations make the problem worse - if it is > +not possible to write the full 32-bits of the TSC, it may be impossible to > +match the TSC in newly arriving CPUs to that of the rest of the system, > +resulting in unsynchronized TSCs. This may be done by BIOS or system software, > +but in practice, getting a perfectly synchronized TSC will not be possible > +unless all values are read from the same clock, which generally only is > +possible on single socket systems or those with special hardware support. > + > +3.2) TSC and CPU hotplug > + > +As touched on already, CPUs which arrive later than the boot time of the system > +may not have a TSC value that is synchronized with the rest of the system. > +Either system software, BIOS, or SMM code may actually try to establish the TSC > +to a value matching the rest of the system, but a perfect match is usually not > +a guarantee. > + > +3.3) TSC and multi-socket / NUMA > + > +Multi-socket systems, especially large multi-socket systems are likely to have > +individual clocksources rather than a single, universally distributed clock. > +Since these clocks are driven by different crystals, they will not have > +perfectly matched frequency, and temperature and electrical variations will > +cause the cpu clocks, and thus the TSCs to drift over time. Depending on the CPU > +exact clock and bus design, the drift may or may not be fixed in absolute > +error, and may accumulate over time. > + > +In addition, very large systems may deliberately slew the clocks of individual > +cores. This technique, known as spread-spectrum clocking, reduces EMI at the > +clock frequency and harmonics of it, which may be required to pass FCC > +standards for telecommunications and computer equipment. > + > +It is recommended not to trust the TSCs to remain synchronized on NUMA or > +multiple socket systems for these reasons. > + > +3.4) TSC and C-states > + > +C-states, or idling states of the processor, especially C1E and deeper sleep > +states may be problematic for TSC as well. The TSC may stop advancing in such > +a state, resulting in a TSC which is behind that of other CPUs when execution > +is resumed. Such CPUs must be detected and flagged by the operating system > +based on CPU and chipset identifications. > + > +The TSC in such a case may be corrected by catching it up to a known external > +clocksource. > + > +3.5) TSC frequency change / P-states > + > +To make things slightly more interesting, some CPUs may change requency. They frequency. > +may or may not run the TSC at the same rate, and because the frequency change > +may be staggered or slewed, at some points in time, the TSC rate may not be > +known other than falling within a range of values. In this case, the TSC will > +not be a stable time source, and must be calibrated against a known, stable, > +external clock to be a usable source of time. > + > +Whether the TSC runs at a constant rate or scales with the P-state is model > +dependent and must be determined by inspecting CPUID, chipset or various MSR > +fields. > + > +In addition, some vendors have known bugs where the P-state is actually > +compensated for properly during normal operation, but when the processor is > +inactive, the P-state may be raised temporarily to service cache misses from > +other processors. In such cases, the TSC on halted CPUs could advance faster > +than that of non-halted processors. AMD Turion processors are known to have > +this problem. > + > +3.6) TSC and STPCLK / T-states > + > +External signals given to the processor may also have the affect of stopping > +the TSC. This is typically done for thermal emergency power control to prevent > +an overheating condition, and typically, there is no way to detect that this > +condition has happened. > + > +3.7) TSC virtualization - VMX > + > +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP > +instructions, which is enough for full virtualization of TSC in any manner. In > +addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET > +field specified in the VMCS. Special instructions must be used to read and > +write the VMCS field. > + > +3.8) TSC virtualization - SVM > + > +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP SVM > +instructions, which is enough for full virtualization of TSC in any manner. In > +addition, SVM allows passing through the host TSC plus an additional offset > +ield specified in SVM control block. > + > +3.9) TSC feature bits in Linux > + > +In summary, there is no way to guarantee the TSC remains in perfect > +synchronization unless it is explicitly guaranteed by the architecture. Even > +if so, the TSCs in multi-sockets or NUMA systems may still run independently > +despite being locally consistent. > + > +The following feature bits are used by Linux to signal various TSC attributes, > +but they can only be taken to be meaningful for UP or single node systems. > + > +X86_FEATURE_TSC : The TSC is available in hardware > +X86_FEATURE_RDTSCP : The RDTSCP instruction is available > +X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states > +X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states > +X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware) > + > +4) Virtualization Problems > + > +Timekeeping is especially problematic for virtualization because a number of > +challenges arise. The most obvious problem is that time is now shared between > +the host and, potentially, a number of virtual machines. This happens > +naturally on X86 systems when SMM mode is used by the BIOS, but not to such a > +degree nor with such frequency. However, the fact that SMM mode may cause > +similar problems to virtualization makes it a good justification for solving > +many of these problems on bare metal. > + > +4.1) Interrupt clocking > + > +One of the most immediate problems that occurs with legacy operating systems > +is that the system timekeeping routines are often designed to keep track of > +time by counting periodic interrupts. These interrupts may come from the PIT > +or the RTC, but the problem is the same: the host virtualization engine may not > +be able to deliver the proper number of interrupts per second, and so guest > +time may fall behind. This is especially problematic if a high interrupt rate > +is selected, such as 1000 HZ, which is unfortunately the default for many Linux > +guests. > + > +There are three approaches to solving this problem; first, it may be possible > +to simply ignore it. Guests which have a separate time source for tracking > +'wall clock' or 'real time' may not need any adjustment of their interrupts to > +maintain proper time. If this is not sufficient, it may be necessary to inject > +additional interrupts into the guest in order to increase the effective > +interrupt rate. This approach leads to complications in extreme conditions, > +where host load or guest lag is too much to compensate for, and thus another > +solution to the problem has risen: the guest may need to become aware of lost > +ticks and compensate for them internally. Although promising in theory, the > +implementation of this policy in Linux has been extremely error prone, and a > +number of buggy variants of lost tick compensation are distributed across > +commonly used Linux systems. > + > +Windows uses periodic RTC clocking as a means of keeping time internally, and > +thus requires interrupt slewing to keep proper time. It does use a low enough > +rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in > +practice. > + > +4.2) TSC sampling and serialization > + > +As the highest precision time source available, the cycle counter of the CPU > +has aroused much interest from developers. As explained above, this timer has > +many problems unique to its nature as a local, potentially unstable and > +potentially unsynchronized source. One issue which is not unique to the TSC, > +but is highlighted because of it's very precise nature is sampling delay. By its > +definition, the counter, once read is already old. However, it is also > +possible for the counter to be read ahead of the actual use of the result. > +This is a consequence of the superscalar execution of the instruction stream, > +which may execute instructions out of order. Such execution is called > +non-serialized. Forcing serialized execution is necessary for precise > +measurement with the TSC, and requires a serializing instruction, such as CPUID > +or an MSR read. > + > +Since CPUID may actually be virtualized by a trap and emulate mechanism, this > +serialization can pose a performance issue for hardware virtualization. An > +accurate time stamp counter reading may therefore not always be available, and > +it may be necessary for an implementation to guard against "backwards" reads of > +the TSC as seen from other CPUs, even in an otherwise perfectly synchronized > +system. > + > +4.3) Timespec aliasing > + > +Additionally, this lack of serialization from the TSC poses another challenge > +when using results of the TSC when measured against another time source. As > +the TSC is much higher precision, many possible values of the TSC may be read > +while another clock is still expressing the same value. > + > +That is, you may read (T,T+10) while external clock C maintains the same value. > +Due to non-serialized reads, you may actually end up with a range which > +fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but > +calibrated against an external value may have a range of valid values. > +Re-calibrating this computation may actually cause time, as computed after the > +calibration, to go backwards, compared with time computed before the > +calibration. > + > +This problem is particularly pronounced with an internal time source in Linux, > +the kernel time, which is expressed in the theoretically high resultion resolution > +timespec - but which advances in much larger granularity intervals, sometimes > +at the rate of jiffies, and possibly in catchup modes, at a much larger step. > + > +This aliasing requires care in the computation and recalibration of kvmclock > +and any other values derived from TSC computation (such as TSC virtualization > +itself). > + > +4.4) Migration > + > +Migration of a virtual machine raises problems for timekeeping in two ways. > +First, the migration itself may take time, during which interrupts can not be cannot > +delivered, and after which, the guest time may need to be caught up. NTP may > +be able to help to some degree here, as the clock correction required is > +typically small enough to fall in the NTP-correctable window. > + > +An additional concern is that timers based off the TSC (or HPET, if the raw bus > +clock is exposed) may now be running at different rates, requiring compensation > +in some may in the hypervisor by virtualizing these timers. In addition, > +migrating to a faster machine may preclude the use of a passthrough TSC, as a > +faster clock can not be made visible to a guest without the potential of time cannot > +advancing faster than usual. A slower clock is less of a problem, as it can > +always be caught up to the original rate. KVM clock avoids these problems by > +simply storing multipliers and offsets gainst the TSC for the guest to convert > +back into nanosecond resolution values. > + > +4.5) Scheduling > + > +Since scheduling may be based on precise timing and firing of interrupts, the > +scheduling algorithms of an operating system may be adversely affected by > +virtualization. In theory, the effect is random and should be universally > +distributed, but in contrived as well as real scenarios (guest device access, > +causes virtualization exits, possible context switch), this may not always be > +the case. The effect of this has not been well studied (ed: has it? any > +published results?). > + > +In an attempt to workaround this, several implementations have provided a work around this, > +paravirtualized scheduler clock, which reveals the true amount of CPU time for > +which a virtual machine has been running. > + > +4.6) Watchdogs > + > +Watchdog timers, such as the lock detector in Linux may fire accidentally when > +running under hardware virtualization due to timer interrupts being delayed or > +misinterpretation of the passage of real time. Usually, these warnings are > +spurious and can be ignored, but in some circumstances it may be necessary to > +disable such detection. > + > +4.7) Delays and precision timing > + > +Precise timing and delays may not be possible in a virtualized system. This > +can happen if the system is controlling physical hardware, or issues delays to > +compensate for slower I/O to and from devices. The first issue is not solvable > +in general for a virtualized system; hardware control software can't be > +adequately virtualized without a full real-time operating system, which would > +require an RT aware virtualization platform. > + > +The second issue may cause performance problems, but this is unlikely to be a > +significant issue. In many cases these delays may be eliminated through > +configuration or paravirtualization. > + > +4.8) Covert channels and leaks > + > +In addition to the above problems, time information will inevitably leak to the > +guest about the host in anything but a perfect implementation of virtualized > +time. This may allow the guest to infer the presence of a hypervisor (as in a > +red-pill type detection), and it may allow information to leak between guests > +by using CPU utilization itself as a signalling channel. Preventing such > +problems would require completely isolated virtual time which may not track > +real time any longer. This may be useful in certain security or QA contexts, > +but in general isn't recommended for real-world deployment scenarios. > + > -- --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/