2010-06-15 07:34:25

by Zachary Amsden

[permalink] [raw]
Subject: TSC cleanups, fixes, documentation for KVM

This is an attempt to cleanup and plug some of the many existing holes
with TSC and as a direct result, kvmclock. It is not a perfect attempt;
no trap and emulate is yet done for TSC in cases in which it can not be
perfectly virtualized. In particular, on platforms with unstable /
buggy TSCs, it is not possible to even get a perfectly synchronized TSC.
TSC virtualization should be much improved and many bugs with suspend,
cpufreq as well as general synchronization errors will be eliminated.


2010-06-15 07:34:30

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 01/17] Eliminate duplicated timer code

Move duplicated timer related code into arch_vcpu_load rather
than vendor callouts. Should be an isomorphic transformation.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/svm.c | 2 --
arch/x86/kvm/vmx.c | 3 +--
arch/x86/kvm/x86.c | 4 ++++
3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2ae0c39..09b2145 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -960,8 +960,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (is_nested(svm))
svm->nested.hsave->control.tsc_offset += delta;
}
- vcpu->cpu = cpu;
- kvm_migrate_timers(vcpu);
svm->asid_generation = 0;
}

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b1542b9..a657e08 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -898,14 +898,12 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
struct desc_ptr dt;
unsigned long sysenter_esp;

- kvm_migrate_timers(vcpu);
set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests);
local_irq_disable();
list_add(&vmx->local_vcpus_link,
&per_cpu(vcpus_on_cpu, cpu));
local_irq_enable();

- vcpu->cpu = cpu;
/*
* Linux uses per-cpu TSS and GDT, so set these when switching
* processors.
@@ -4106,6 +4104,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)

cpu = get_cpu();
vmx_vcpu_load(&vmx->vcpu, cpu);
+ vmx->vcpu.cpu = cpu;
err = vmx_vcpu_setup(vmx);
vmx_vcpu_put(&vmx->vcpu);
put_cpu();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a4073b..05754c1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1799,6 +1799,9 @@ out:

void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
+ if (unlikely(vcpu->cpu != cpu)) {
+ kvm_migrate_timers(vcpu);
+ }
kvm_x86_ops->vcpu_load(vcpu, cpu);
if (unlikely(per_cpu(cpu_tsc_khz, cpu) == 0)) {
unsigned long khz = cpufreq_quick_get(cpu);
@@ -1807,6 +1810,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
per_cpu(cpu_tsc_khz, cpu) = khz;
}
kvm_request_guest_time_update(vcpu);
+ vcpu->cpu = cpu;
}

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
--
1.7.1

2010-06-15 07:34:36

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 04/17] Fix deep C-state TSC desynchronization

When CPUs with unstable TSCs enter deep C-state, TSC may stop
running. This causes us to require resynchronization. Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c8289d0..618c435 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1822,7 +1822,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
+
vcpu->arch.last_host_tsc = native_read_tsc();
+
+ /*
+ * When potentially leaving a CPU with unstable TSCs, we risk
+ * that the CPU enters deep C-state. If it does, the TSC may
+ * go out of sync but we will not recalibrate because the test
+ * vcpu->cpu != cpu can not detect this condition. So set
+ * vcpu->cpu = -1 to force the recalibration above.
+ */
+ if (check_tsc_unstable())
+ vcpu->cpu = -1;
}

static int is_efer_nx(void)
--
1.7.1

2010-06-15 07:34:43

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 07/17] Perform hardware_enable in CPU_STARTING callback

The CPU_STARTING callback was added upstream with the intention
of being used for KVM, specifically for the hardware enablement
that must be done before we can run in hardware virt. It had
bugs on the x86_64 architecture at the time, where it was called
after CPU_ONLINE. The arches have since merged and the bug is
gone.

It might be noted other features should probably start making
use of this callback; microcode updates in particular which
might be fixing important erratums would be best applied before
beginning to run user tasks.

Signed-off-by: Zachary Amsden <[email protected]>
---
virt/kvm/kvm_main.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 84a0906..01a8876 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1957,10 +1957,10 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
cpu);
hardware_disable(NULL);
break;
- case CPU_ONLINE:
+ case CPU_STARTING:
printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
cpu);
- smp_call_function_single(cpu, hardware_enable, NULL, 1);
+ hardware_enable(NULL);
break;
}
return NOTIFY_OK;
@@ -2095,7 +2095,6 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,

static struct notifier_block kvm_cpu_notifier = {
.notifier_call = kvm_cpu_hotplug,
- .priority = 20, /* must be > scheduler priority */
};

static int vm_stat_get(void *_offset, u64 *val)
--
1.7.1

2010-06-15 07:34:50

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 08/17] Add clock sync request to hardware enable

If there are active VCPUs which are marked as belonging to
a particular hardware CPU, request a clock sync for them when
enabling hardware; the TSC could be desynchronized on a newly
arriving CPU, and we need to recompute guests system time
relative to boot after a suspend event.

This covers both cases.

Note that it is acceptable to take the spinlock, as either
no other tasks will be running and no locks held (BSP after
resume), or other tasks will be guaranteed to drop the lock
relatively quickly (AP on CPU_STARTING).

Noting we now get clock synchronization requests for VCPUs
which are starting up (or restarting), it is tempting to
attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
at this time, however it is not correct to do so; they are
required for systems with non-constant TSC as the frequency
may not be known immediately after the processor has started
until the cpufreq driver has had a chance to run and query
the chipset.

Updated: implement better locking semantics for hardware_enable

Removed the hack of dropping and retaking the lock by adding the
semantic that we always hold kvm_lock when hardware_enable is
called. The one place that doesn't need to worry about it is
resume, as resuming a frozen CPU, the spinlock won't be taken.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 8 ++++++++
virt/kvm/kvm_main.c | 6 +++++-
2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b15d03..05c559d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5442,7 +5442,15 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)

int kvm_arch_hardware_enable(void *garbage)
{
+ struct kvm *kvm;
+ struct kvm_vcpu *vcpu;
+ int i;
+
kvm_shared_msr_cpu_online();
+ list_for_each_entry(kvm, &vm_list, vm_list)
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ if (vcpu->cpu == smp_processor_id())
+ kvm_request_guest_time_update(vcpu);
return kvm_x86_ops->hardware_enable(garbage);
}

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 01a8876..e35419a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1960,7 +1960,9 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
case CPU_STARTING:
printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
cpu);
+ spin_lock(&kvm_lock);
hardware_enable(NULL);
+ spin_unlock(&kvm_lock);
break;
}
return NOTIFY_OK;
@@ -2165,8 +2167,10 @@ static int kvm_suspend(struct sys_device *dev, pm_message_t state)

static int kvm_resume(struct sys_device *dev)
{
- if (kvm_usage_count)
+ if (kvm_usage_count) {
+ WARN_ON(spin_is_locked(&kvm_lock));
hardware_enable(NULL);
+ }
return 0;
}

--
1.7.1

2010-06-15 07:34:59

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 14/17] Fix SVM VMCB reset

On reset, VMCB TSC should be set to zero. Instead, code was setting
tsc_offset to zero, which passes through the underlying TSC.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/svm.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4654507..a032e62 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -767,7 +767,7 @@ static void init_vmcb(struct vcpu_svm *svm)

control->iopm_base_pa = iopm_base;
control->msrpm_base_pa = __pa(svm->msrpm);
- control->tsc_offset = 0;
+ guest_write_tsc(&svm->vcpu, 0);
control->int_ctl = V_INTR_MASKING_MASK;

init_seg(&save->es);
--
1.7.1

2010-06-15 07:35:07

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 16/17] TSC reset compensation

Attempt to synchronize TSCs which are reset to the same value. In the
case of a reliable hardware TSC, we can just re-use the same offset, but
on non-reliable hardware, we can get closer by adjusting the offset to
match the elapsed time.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
1 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e836e9..cedb71f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
set_bit(KVM_REQ_CLOCK_SYNC, &v->requests);
}

+static inline int kvm_tsc_reliable(void)
+{
+ return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+ boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+ !check_tsc_unstable());
+}
+
void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
{
struct kvm *kvm = vcpu->kvm;
- u64 offset;
+ u64 offset, ns, elapsed;

spin_lock(&kvm->arch.tsc_write_lock);
offset = data - native_read_tsc();
- kvm->arch.last_tsc_nsec = get_kernel_ns();
+ ns = get_kernel_ns();
+ elapsed = ns - kvm->arch.last_tsc_nsec;
+
+ /*
+ * Special case: identical write to TSC within 5 seconds of
+ * another CPU is interpreted as an attempt to synchronize
+ * (the 5 seconds is to accomodate host load / swapping).
+ *
+ * In that case, for a reliable TSC, we can match TSC offsets,
+ * or make a best guest using kernel_ns value.
+ */
+ if (data == kvm->arch.last_tsc_write && elapsed < 5 * NSEC_PER_SEC) {
+ if (kvm_tsc_reliable()) {
+ offset = kvm->arch.last_tsc_offset;
+ pr_debug("kvm: matched tsc offset for %llu\n", data);
+ } else {
+ u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
+ tsc_delta = tsc_delta / USEC_PER_SEC;
+ offset -= tsc_delta;
+ pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
+ }
+ ns = kvm->arch.last_tsc_nsec;
+ }
+ kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;
kvm->arch.last_tsc_offset = offset;
kvm_x86_ops->write_tsc_offset(vcpu, offset);
--
1.7.1

2010-06-15 07:35:29

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 17/17] Add timekeeping documentation

Basic informational document about x86 timekeeping and how KVM
is affected.

Signed-off-by: Zachary Amsden <[email protected]>
---
Documentation/kvm/timekeeping.txt | 599 +++++++++++++++++++++++++++++++++++++
1 files changed, 599 insertions(+), 0 deletions(-)
create mode 100644 Documentation/kvm/timekeeping.txt

diff --git a/Documentation/kvm/timekeeping.txt b/Documentation/kvm/timekeeping.txt
new file mode 100644
index 0000000..4ce8edf
--- /dev/null
+++ b/Documentation/kvm/timekeeping.txt
@@ -0,0 +1,599 @@
+
+ Timekeeping Virtualization for X86-Based Architectures
+
+ Zachary Amsden <[email protected]>
+ Copyright (c) 2010, Red Hat. All rights reserved.
+
+1) Overview
+2) Timing Devices
+3) TSC Hardware
+4) Virtualization Problems
+
+=========================================================================
+
+1) Overview
+
+One of the most complicated parts of the X86 platform, and specifically,
+the virtualization of this platform is the plethora of timing devices available
+and the complexity of emulating those devices. In addition, virtualization of
+time introduces a new set of challenges because it introduces a multiplexed
+division of time beyond the control of the guest CPU.
+
+First, we will describe the various timekeeping hardware available, then
+present some of the problems which arise and solutions available, giving
+specific recommendations for certain classes of KVM guests.
+
+The purpose of this document is to collect data and information relevant to
+time keeping which may be difficult to find elsewhere, specifically,
+information relevant to KVM and hardware based virtualization.
+
+=========================================================================
+
+2) Timing Devices
+
+First we discuss the basic hardware devices available. TSC and the related
+KVM clock are special enough to warrant a full exposition and are described in
+the following section.
+
+2.1) i8254 - PIT
+
+One of the first timer devices available is the programmable interrupt timer,
+or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
+channels which can be programmed to deliver periodic or one-shot interrupts.
+These three channels can be configured in different modes and have individual
+counters. Channel 1 and 2 were not available for general use in the original
+IBM PC, and historically were connected to control RAM refresh and the PC
+speaker. Now the PIT is typically integrated as part of an emulated chipset
+and a separate physical PIT is not used.
+
+The PIT uses I/O ports 0x40h - 0x43. Access to the 16-bit counters is done
+using single or multiple byte access to the I/O ports. There are 6 modes
+available, but not all modes are available to all timers, as only timer 2
+has a connected gate input, required for modes 1 and 5. The gate line is
+controlled by port 61h, bit 0, as illustrated in the following diagram.
+
+ -------------- ----------------
+| | | |
+| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
+| Clock | | | |
+ -------------- | +->| GATE TIMER 0 |
+ | ----------------
+ |
+ | ----------------
+ | | |
+ |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
+ | | | (aka /dev/null)
+ | +->| GATE TIMER 1 |
+ | ----------------
+ |
+ | ----------------
+ | | |
+ |------>| CLOCK OUT | ---------> Port 61h, bit 5
+ | | |
+Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.----
+ ---------------- _| )---- Speaker
+ / *----
+Port 61h, bit 1 -----------------------------------/
+
+The timer modes are now described.
+
+Mode 0: Single Timeout. This is a one shot software timeout that counts down
+ when the gate is high (always true for timers 0 and 1). When the count
+ reaches zero, the output goes high.
+
+Mode 1: Triggered One Shot. The output is intially set high. When the gate
+ line is set high, a countdown is initiated (which does not stop if the gate is
+ lowered), during which the output is set low. When the count reaches zero,
+ the output goes high.
+
+Mode 2: Rate Generator. The output is initially set high. When the countdown
+ reaches 1, the output goes low for one count and then returns high. The value
+ is reloaded and the countdown automatically resume. If the gate line goes
+ low, the count is halted. If the output is low when the gate is lowered, the
+ output automatically goes high (this only affects timer 2).
+
+Mode 3: Square Wave. This generates a sine wave. The count determines the
+ length of the pulse, which alternates between high and low when zero is
+ reached. The count only proceeds when gate is high and is automatically
+ reloaded on reaching zero. The count is decremented twice at each clock.
+ If the count is even, the clock remains high for N/2 counts and low for N/2
+ counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
+ for (N-1)/2 counts. Only even values are latched by the counter, so odd
+ values are not observed when reading.
+
+Mode 4: Software Strobe. After programming this mode and loading the counter,
+ the output remains high until the counter reaches zero. Then the output
+ goes low for 1 clock cycle and returns high. The counter is not reloaded.
+ Counting only occurs when gate is high.
+
+Mode 5: Hardware Strobe. After programming and loading the counter, the
+ output remains high. When the gate is raised, a countdown is initiated
+ (which does not stop if the gate is lowered). When the counter reaches zero,
+ the output goes low for 1 clock cycle and then returns high. The counter is
+ not reloaded.
+
+In addition to normal binary counting, the PIT supports BCD counting. The
+command port, 0x43h is used to set the counter and mode for each of the three
+timers.
+
+PIT commands, issued to port 0x43, using the following bit encoding:
+
+Bit 7-4: Command (See table below)
+Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
+Bit 0 : Binary (0) / BCD (1)
+
+Command table:
+
+0000 - Latch Timer 0 count for port 0x40
+ sample and hold the count to be read in port 0x40;
+ additional commands ignored until counter is read;
+ mode bits ignored.
+
+0001 - Set Timer 0 LSB mode for port 0x40
+ set timer to read LSB only and force MSB to zero;
+ mode bits set timer mode
+
+0010 - Set Timer 0 MSB mode for port 0x40
+ set timer to read MSB only and force LSB to zero;
+ mode bits set timer mode
+
+0011 - Set Timer 0 16-bit mode for port 0x40
+ set timer to read / write LSB first, then MSB;
+ mode bits set timer mode
+
+0100 - Latch Timer 1 count for port 0x41 - as described above
+0101 - Set Timer 1 LSB mode for port 0x41 - as described above
+0110 - Set Timer 1 MSB mode for port 0x41 - as described above
+0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
+
+1000 - Latch Timer 2 count for port 0x42 - as described above
+1001 - Set Timer 2 LSB mode for port 0x42 - as described above
+1010 - Set Timer 2 MSB mode for port 0x42 - as described above
+1011 - Set Timer 2 16-bit mode for port 0x42 as described above
+
+1101 - General counter latch
+ Latch combination of counters into corresponding ports
+ Bit 3 = Counter 2
+ Bit 2 = Counter 1
+ Bit 1 = Counter 0
+ Bit 0 = Unused
+
+1110 - Latch timer status
+ Latch combination of counter mode into corresponding ports
+ Bit 3 = Counter 2
+ Bit 2 = Counter 1
+ Bit 1 = Counter 0
+
+ The output of ports 0x40-0x42 following this command will be:
+
+ Bit 7 = Output pin
+ Bit 6 = Count loaded (0 if timer has expired)
+ Bit 5-4 = Read / Write mode
+ 01 = MSB only
+ 10 = LSB only
+ 11 = LSB / MSB (16-bit)
+ Bit 3-1 = Mode
+ Bit 0 = Binary (0) / BCD mode (1)
+
+2.2) RTC
+
+The second device which was available in the original PC was the MC146818 real
+time clock. The original device is now obsolete, and usually emulated by the
+system chipset, sometimes by an HPET and some frankenstein IRQ routing.
+
+The RTC is accessed through CMOS variables, which uses an index register to
+control which bytes are read. Since there is only one index register, read
+of the CMOS and read of the RTC require lock protection (in addition, it is
+dangerous to allow userspace utilities such as hwclock to have direct RTC
+access, as they could corrupt kernel reads and writes of CMOS memory).
+
+The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
+can function as a once a day alarm, a periodic alarm, and can issue interrupts
+after an update of the CMOS registers by the MC146818 is complete. The type of
+interrupt is signalled in the RTC status registers.
+
+The RTC will update the current time fields by battery power even while the
+system is off. The current time fields should not be read while an update is
+in progress, as indicated in the status register.
+
+The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
+programmed to a 32kHz divider if the RTC is to count seconds.
+
+This is the RAM map originally used for the RTC/CMOS:
+
+Location Size Description
+------------------------------------------
+00h byte Current second (BCD)
+01h byte Seconds alarm (BCD)
+02h byte Current minute (BCD)
+03h byte Minutes alarm (BCD)
+04h byte Current hour (BCD)
+05h byte Hours alarm (BCD)
+06h byte Current day of week (BCD)
+07h byte Current day of month (BCD)
+08h byte Current month (BCD)
+09h byte Current year (BCD)
+0Ah byte Register A
+ bit 7 = Update in progress
+ bit 6-4 = Divider for clock
+ 000 = 4.194 MHz
+ 001 = 1.049 MHz
+ 010 = 32 kHz
+ 10X = test modes
+ 110 = reset / disable
+ 111 = reset / disable
+ bit 3-0 = Rate selection for periodic interrupt
+ 000 = periodic timer disabled
+ 001 = 3.90625 uS
+ 010 = 7.8125 uS
+ 011 = .122070 mS
+ 100 = .244141 mS
+ ...
+ 1101 = 125 mS
+ 1110 = 250 mS
+ 1111 = 500 mS
+0Bh byte Register B
+ bit 7 = Run (0) / Halt (1)
+ bit 6 = Periodic interrupt enable
+ bit 5 = Alarm interrupt enable
+ bit 4 = Update-ended interrupt enable
+ bit 3 = Square wave interrupt enable
+ bit 2 = BCD calendar (0) / Binary (1)
+ bit 1 = 12-hour mode (0) / 24-hour mode (1)
+ bit 0 = 0 (DST off) / 1 (DST enabled)
+OCh byte Register C (read only)
+ bit 7 = interrupt request flag (IRQF)
+ bit 6 = periodic interrupt flag (PF)
+ bit 5 = alarm interrupt flag (AF)
+ bit 4 = update interrupt flag (UF)
+ bit 3-0 = reserved
+ODh byte Register D (read only)
+ bit 7 = RTC has power
+ bit 6-0 = reserved
+32h byte Current century BCD (*)
+ (*) location vendor specific and now determined from ACPI global tables
+
+2.3) APIC
+
+On Pentium and later processors, an on-board timer is available to each CPU
+as part of the Advanced Programmable Interrupt Controller. The APIC is
+accessed through memory mapped registers and provides interrupt service to each
+CPU, used for IPIs and local timer interrupts.
+
+Although in theory the APIC is a safe and stable source for local interrupts,
+in practice, many bugs and glitches have occurred due to the special nature of
+the APIC CPU-local memory mapped hardware. Beware that CPU errata may affect
+the use of the APIC and that workarounds may be required. In addition, some of
+these workarounds pose unique constraints for virtualization - requiring either
+extra overhead incurred from extra reads of memory mapped I/O or additional
+functionality that may be more computationally expensive to implement.
+
+Since the APIC is documented quite well in the Intel and AMD manuals, we will
+avoid repititon of the detail here. It should be pointed out that the APIC
+timer is programmed through the LVT (local vector timer) register, is capable
+of one-shot or periodic operation, and is based on the bus clock divided down
+by the programmable divider register.
+
+2.4) HPET
+
+HPET is quite complex, and was originally intended to replace the PIT / RTC
+support of the X86 PC. It remains to be seen whether that will be the case, as
+the de-facto standard of PC hardware is to emulate these older devices. Some
+systems designated as legacy free may support only the HPET as a hardware timer
+device.
+
+The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
+but allowing implementation freedom to support many more. It also imposes no
+fixed rate on the timer frequency, but does impose some extremal values on
+frequency, error and slew.
+
+In general, the HPET is recommended as a high precision (compared to PIT /RTC)
+time source which is independent of local variation (as there is only one HPET
+in any given system). The HPET is also memory mapped, and its presence is
+indicated through ACPI table by the BIOS.
+
+Detailed specification of the HPET is beyond the current scope of this
+document, as it is also very well documented elsewhere.
+
+2.5) Offboard Timers
+
+Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
+timing chips built into the cards which may have registers which are accessible
+to kernel or user drivers. To the author's knowledge, using these to generate
+a clocksource for a Linux or other kernel has not yet been attempted and is in
+general frowned upon as not playing by the agreed rules of the game. Such a
+timer device would require additional support to be virtualized properly and is
+not considered important at this time as no known operating system does this.
+
+=========================================================================
+
+3) TSC Hardware
+
+The TSC or time stamp counter is relatively simple in theory; it counts
+instruction cycles issued by the processor, which can be used as a measure of
+time. In practice, due to a number of problems, it is the most complicated
+time keeping device to use.
+
+The TSC is represented internally as a 64-bit MSR which can be read with the
+RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
+limitations made it possible to write the TSC, but generally on old hardware it
+was only possible to write the low 32-bits of the 64-bit counter, and the upper
+32-bits of the counter were cleared. Now, however, on Intel processors family
+0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
+has been lifted and all 64-bits are writable. On AMD systems, the ability to
+write the TSC MSR is not an architectural guarantee.
+
+The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
+means of the CR4.TSD bit, which disables CPL > 0 TSC access.
+
+Some vendors have implemented an additional instruction, RDTSCP, which returns
+atomically not just the TSC, but an indicator which corresponds to the
+processor number. This can be used to index into an array of TSC variables to
+determine offset information in SMP systems where TSCs are not synchronized.
+
+Both VMX and SVM provide extension fields in the virtualization hardware which
+allows the guest visible TSC to be offset by a constant. Newer implementations
+promise to allow the TSC to additionally be scaled, but this hardware is not
+yet widely available.
+
+3.1) TSC synchronization
+
+The TSC is a CPU-local clock in most implementations. This means, on SMP
+platforms, the TSCs of different CPUs may start at different times depending
+on when the CPUs are powered on. Generally, CPUs on the same die will share
+the same clock, however, this is not always the case.
+
+The BIOS may attempt to resynchronize the TSCs as a result during the poweron
+process and the operating system or other system software may attempt to do
+this as well. Several hardware limitations make the problem worse - if it is
+not possible to write the full 32-bits of the TSC, it may be impossible to
+match the TSC in newly arriving CPUs to that of the rest of the system,
+resulting in unsynchronized TSCs. This may be done by BIOS or system software,
+but in practice, getting a perfectly synchronized TSC will not be possible
+unless all values are read from the same clock, which generally only is
+possible on single socket systems or those with special hardware support.
+
+3.2) TSC and CPU hotplug
+
+As touched on already, CPUs which arrive later than the boot time of the system
+may not have a TSC value that is synchronized with the rest of the system.
+Either system software, BIOS, or SMM code may actually try to establish the TSC
+to a value matching the rest of the system, but a perfect match is usually not
+a guarantee.
+
+3.3) TSC and multi-socket / NUMA
+
+Multi-socket systems, especially large multi-socket systems are likely to have
+individual clocksources rather than a single, universally distributed clock.
+Since these clocks are driven by different crystals, they will not have
+perfectly matched frequency, and temperature and electrical variations will
+cause the cpu clocks, and thus the TSCs to drift over time. Depending on the
+exact clock and bus design, the drift may or may not be fixed in absolute
+error, and may accumulate over time.
+
+In addition, very large systems may deliberately slew the clocks of individual
+cores. This technique, known as spread-spectrum clocking, reduces EMI at the
+clock frequency and harmonics of it, which may be required to pass FCC
+standards for telecommunications and computer equipment.
+
+It is recommended not to trust the TSCs to remain synchronized on NUMA or
+multiple socket systems for these reasons.
+
+3.4) TSC and C-states
+
+C-states, or idling states of the processor, especially C1E and deeper sleep
+states may be problematic for TSC as well. The TSC may stop advancing in such
+a state, resulting in a TSC which is behind that of other CPUs when execution
+is resumed. Such CPUs must be detected and flagged by the operating system
+based on CPU and chipset identifications.
+
+The TSC in such a case may be corrected by catching it up to a known external
+clocksource.
+
+3.5) TSC frequency change / P-states
+
+To make things slightly more interesting, some CPUs may change requency. They
+may or may not run the TSC at the same rate, and because the frequency change
+may be staggered or slewed, at some points in time, the TSC rate may not be
+known other than falling within a range of values. In this case, the TSC will
+not be a stable time source, and must be calibrated against a known, stable,
+external clock to be a usable source of time.
+
+Whether the TSC runs at a constant rate or scales with the P-state is model
+dependent and must be determined by inspecting CPUID, chipset or various MSR
+fields.
+
+In addition, some vendors have known bugs where the P-state is actually
+compensated for properly during normal operation, but when the processor is
+inactive, the P-state may be raised temporarily to service cache misses from
+other processors. In such cases, the TSC on halted CPUs could advance faster
+than that of non-halted processors. AMD Turion processors are known to have
+this problem.
+
+3.6) TSC and STPCLK / T-states
+
+External signals given to the processor may also have the affect of stopping
+the TSC. This is typically done for thermal emergency power control to prevent
+an overheating condition, and typically, there is no way to detect that this
+condition has happened.
+
+3.7) TSC virtualization - VMX
+
+VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner. In
+addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
+field specified in the VMCS. Special instructions must be used to read and
+write the VMCS field.
+
+3.8) TSC virtualization - SVM
+
+VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner. In
+addition, SVM allows passing through the host TSC plus an additional offset
+ield specified in SVM control block.
+
+3.9) TSC feature bits in Linux
+
+In summary, there is no way to guarantee the TSC remains in perfect
+synchronization unless it is explicitly guaranteed by the architecture. Even
+if so, the TSCs in multi-sockets or NUMA systems may still run independently
+despite being locally consistent.
+
+The following feature bits are used by Linux to signal various TSC attributes,
+but they can only be taken to be meaningful for UP or single node systems.
+
+X86_FEATURE_TSC : The TSC is available in hardware
+X86_FEATURE_RDTSCP : The RDTSCP instruction is available
+X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
+X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
+X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
+
+4) Virtualization Problems
+
+Timekeeping is especially problematic for virtualization because a number of
+challenges arise. The most obvious problem is that time is now shared between
+the host and, potentially, a number of virtual machines. This happens
+naturally on X86 systems when SMM mode is used by the BIOS, but not to such a
+degree nor with such frequency. However, the fact that SMM mode may cause
+similar problems to virtualization makes it a good justification for solving
+many of these problems on bare metal.
+
+4.1) Interrupt clocking
+
+One of the most immediate problems that occurs with legacy operating systems
+is that the system timekeeping routines are often designed to keep track of
+time by counting periodic interrupts. These interrupts may come from the PIT
+or the RTC, but the problem is the same: the host virtualization engine may not
+be able to deliver the proper number of interrupts per second, and so guest
+time may fall behind. This is especially problematic if a high interrupt rate
+is selected, such as 1000 HZ, which is unfortunately the default for many Linux
+guests.
+
+There are three approaches to solving this problem; first, it may be possible
+to simply ignore it. Guests which have a separate time source for tracking
+'wall clock' or 'real time' may not need any adjustment of their interrupts to
+maintain proper time. If this is not sufficient, it may be necessary to inject
+additional interrupts into the guest in order to increase the effective
+interrupt rate. This approach leads to complications in extreme conditions,
+where host load or guest lag is too much to compensate for, and thus another
+solution to the problem has risen: the guest may need to become aware of lost
+ticks and compensate for them internally. Although promising in theory, the
+implementation of this policy in Linux has been extremely error prone, and a
+number of buggy variants of lost tick compensation are distributed across
+commonly used Linux systems.
+
+Windows uses periodic RTC clocking as a means of keeping time internally, and
+thus requires interrupt slewing to keep proper time. It does use a low enough
+rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
+practice.
+
+4.2) TSC sampling and serialization
+
+As the highest precision time source available, the cycle counter of the CPU
+has aroused much interest from developers. As explained above, this timer has
+many problems unique to its nature as a local, potentially unstable and
+potentially unsynchronized source. One issue which is not unique to the TSC,
+but is highlighted because of it's very precise nature is sampling delay. By
+definition, the counter, once read is already old. However, it is also
+possible for the counter to be read ahead of the actual use of the result.
+This is a consequence of the superscalar execution of the instruction stream,
+which may execute instructions out of order. Such execution is called
+non-serialized. Forcing serialized execution is necessary for precise
+measurement with the TSC, and requires a serializing instruction, such as CPUID
+or an MSR read.
+
+Since CPUID may actually be virtualized by a trap and emulate mechanism, this
+serialization can pose a performance issue for hardware virtualization. An
+accurate time stamp counter reading may therefore not always be available, and
+it may be necessary for an implementation to guard against "backwards" reads of
+the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
+system.
+
+4.3) Timespec aliasing
+
+Additionally, this lack of serialization from the TSC poses another challenge
+when using results of the TSC when measured against another time source. As
+the TSC is much higher precision, many possible values of the TSC may be read
+while another clock is still expressing the same value.
+
+That is, you may read (T,T+10) while external clock C maintains the same value.
+Due to non-serialized reads, you may actually end up with a range which
+fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
+calibrated against an external value may have a range of valid values.
+Re-calibrating this computation may actually cause time, as computed after the
+calibration, to go backwards, compared with time computed before the
+calibration.
+
+This problem is particularly pronounced with an internal time source in Linux,
+the kernel time, which is expressed in the theoretically high resultion
+timespec - but which advances in much larger granularity intervals, sometimes
+at the rate of jiffies, and possibly in catchup modes, at a much larger step.
+
+This aliasing requires care in the computation and recalibration of kvmclock
+and any other values derived from TSC computation (such as TSC virtualization
+itself).
+
+4.4) Migration
+
+Migration of a virtual machine raises problems for timekeeping in two ways.
+First, the migration itself may take time, during which interrupts can not be
+delivered, and after which, the guest time may need to be caught up. NTP may
+be able to help to some degree here, as the clock correction required is
+typically small enough to fall in the NTP-correctable window.
+
+An additional concern is that timers based off the TSC (or HPET, if the raw bus
+clock is exposed) may now be running at different rates, requiring compensation
+in some may in the hypervisor by virtualizing these timers. In addition,
+migrating to a faster machine may preclude the use of a passthrough TSC, as a
+faster clock can not be made visible to a guest without the potential of time
+advancing faster than usual. A slower clock is less of a problem, as it can
+always be caught up to the original rate. KVM clock avoids these problems by
+simply storing multipliers and offsets gainst the TSC for the guest to convert
+back into nanosecond resolution values.
+
+4.5) Scheduling
+
+Since scheduling may be based on precise timing and firing of interrupts, the
+scheduling algorithms of an operating system may be adversely affected by
+virtualization. In theory, the effect is random and should be universally
+distributed, but in contrived as well as real scenarios (guest device access,
+causes virtualization exits, possible context switch), this may not always be
+the case. The effect of this has not been well studied (ed: has it? any
+published results?).
+
+In an attempt to workaround this, several implementations have provided a
+paravirtualized scheduler clock, which reveals the true amount of CPU time for
+which a virtual machine has been running.
+
+4.6) Watchdogs
+
+Watchdog timers, such as the lock detector in Linux may fire accidentally when
+running under hardware virtualization due to timer interrupts being delayed or
+misinterpretation of the passage of real time. Usually, these warnings are
+spurious and can be ignored, but in some circumstances it may be necessary to
+disable such detection.
+
+4.7) Delays and precision timing
+
+Precise timing and delays may not be possible in a virtualized system. This
+can happen if the system is controlling physical hardware, or issues delays to
+compensate for slower I/O to and from devices. The first issue is not solvable
+in general for a virtualized system; hardware control software can't be
+adequately virtualized without a full real-time operating system, which would
+require an RT aware virtualization platform.
+
+The second issue may cause performance problems, but this is unlikely to be a
+significant issue. In many cases these delays may be eliminated through
+configuration or paravirtualization.
+
+4.8) Covert channels and leaks
+
+In addition to the above problems, time information will inevitably leak to the
+guest about the host in anything but a perfect implementation of virtualized
+time. This may allow the guest to infer the presence of a hypervisor (as in a
+red-pill type detection), and it may allow information to leak between guests
+by using CPU utilization itself as a signalling channel. Preventing such
+problems would require completely isolated virtual time which may not track
+real time any longer. This may be useful in certain security or QA contexts,
+but in general isn't recommended for real-world deployment scenarios.
+
--
1.7.1

2010-06-15 07:34:47

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 09/17] Move scale_delta into common header.

The scale_delta function for shift / multiply with 31-bit
precision moves to a common header so it can be used by both
kernel and kvm module.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/include/asm/pvclock.h | 38 ++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/pvclock.c | 3 ++-
2 files changed, 40 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index cd02f32..7f7e577 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -12,4 +12,42 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
struct timespec *ts);

+/*
+ * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
+ * yielding a 64-bit result.
+ */
+static inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int shift)
+{
+ u64 product;
+#ifdef __i386__
+ u32 tmp1, tmp2;
+#endif
+
+ if (shift < 0)
+ delta >>= -shift;
+ else
+ delta <<= shift;
+
+#ifdef __i386__
+ __asm__ (
+ "mul %5 ; "
+ "mov %4,%%eax ; "
+ "mov %%edx,%4 ; "
+ "mul %5 ; "
+ "xor %5,%5 ; "
+ "add %4,%%eax ; "
+ "adc %5,%%edx ; "
+ : "=A" (product), "=r" (tmp1), "=r" (tmp2)
+ : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif defined(__x86_64__)
+ __asm__ (
+ "mul %%rdx ; shrd $32,%%rdx,%%rax"
+ : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+ return product;
+}
+
#endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 239427c..bab3b9e 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -82,7 +82,8 @@ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
{
u64 delta = native_read_tsc() - shadow->tsc_timestamp;
- return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
+ return pvclock_scale_delta(delta, shadow->tsc_to_nsec_mul,
+ shadow->tsc_shift);
}

/*
--
1.7.1

2010-06-15 07:34:55

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 12/17] Add helper function get_kernel_ns

Add a helper function for the multiple places this is used. Note that it
must not be called in preemptible context, as that would mean the kernel
could enter software suspend state, which would cause non-atomic operation
of the monotonic_to_bootbased computation.

Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
as well? Currently, they are not bootbased (but perhaps should be).

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 26 +++++++++++++-------------
1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 703ea43..15c7317 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
__func__, base_khz, scaled_khz, shift, *pmultiplier);
}

+static inline u64 get_kernel_ns(void)
+{
+ struct timespec ts;
+
+ WARN_ON(preemptible());
+ ktime_get_ts(&ts);
+ monotonic_to_bootbased(&ts);
+ return timespec_to_ns(&ts);
+}
+
static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);

static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
@@ -924,7 +934,6 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)

static int kvm_recompute_guest_time(struct kvm_vcpu *v)
{
- struct timespec ts;
unsigned long flags;
struct kvm_vcpu_arch *vcpu = &v->arch;
void *shared_kaddr;
@@ -944,9 +953,7 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
local_irq_save(flags);
this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
- ktime_get_ts(&ts);
- monotonic_to_bootbased(&ts);
- kernel_ns = timespec_to_ns(&ts);
+ kernel_ns = get_kernel_ns();
local_irq_restore(flags);

if (unlikely(this_tsc_khz == 0)) {
@@ -1865,11 +1872,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

/* Subtract elapsed cycle time from the delta computation */
if (check_tsc_unstable() && vcpu->arch.last_host_ns) {
- s64 delta;
- struct timespec ts;
- ktime_get_ts(&ts);
- monotonic_to_bootbased(&ts);
- delta = timespec_to_ns(&ts) - vcpu->arch.last_host_ns;
+ s64 delta = get_kernel_ns() - vcpu->arch.last_host_ns;
delta = delta * per_cpu(cpu_tsc_khz, cpu);
delta = delta / USEC_PER_SEC;
tsc_delta -= delta;
@@ -1898,10 +1901,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
* vcpu->cpu = -1 to force the recalibration above.
*/
if (check_tsc_unstable()) {
- struct timespec ts;
- ktime_get_ts(&ts);
- monotonic_to_bootbased(&ts);
- vcpu->arch.last_host_ns = timespec_to_ns(&ts);
+ vcpu->arch.last_host_ns = get_kernel_ns();
vcpu->cpu = -1;
}
}
--
1.7.1

2010-06-15 07:34:57

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 11/17] Fix a possible backwards warp of kvmclock

Kernel time, which advances in discrete steps may progress much slower
than TSC. As a result, when kvmclock is adjusted to a new base, the
apparent time to the guest, which runs at a much higher, nsec scaled
rate based on the current TSC, may have already been observed to have
a larger value (kernel_ns + scaled tsc) than the value to which we are
setting it (kernel_ns + 0).

We must instead compute the clock as potentially observed by the guest
for kernel_ns to make sure it does not go backwards.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++
arch/x86/kvm/x86.c | 79 +++++++++++++++++++++++++++++++++------
2 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1afecd7..7ec2472 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -338,6 +338,8 @@ struct kvm_vcpu_arch {
struct page *time_page;
u64 last_host_tsc;
u64 last_host_ns;
+ u64 last_guest_tsc;
+ u64 last_kernel_ns;

bool nmi_pending;
bool nmi_injected;
@@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
u32 hypercalls;
u32 irq_injections;
u32 nmi_injections;
+ u32 tsc_overshoot;
+ u32 tsc_ahead;
};

struct kvm_x86_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52d7d34..703ea43 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -138,6 +138,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
{ "irq_injections", VCPU_STAT(irq_injections) },
{ "nmi_injections", VCPU_STAT(nmi_injections) },
+ { "tsc_overshoot", VCPU_STAT(tsc_overshoot) },
+ { "tsc_ahead", VCPU_STAT(tsc_ahead) },
{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -927,33 +929,84 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
struct kvm_vcpu_arch *vcpu = &v->arch;
void *shared_kaddr;
unsigned long this_tsc_khz;
+ s64 kernel_ns, max_kernel_ns;
+ u64 tsc_timestamp;

if ((!vcpu->time_page))
return 0;

- this_tsc_khz = get_cpu_var(cpu_tsc_khz);
- put_cpu_var(cpu_tsc_khz);
+ /*
+ * The protection we require is simple: we must not be preempted from
+ * the CPU between our read of the TSC khz and our read of the TSC.
+ * Interrupt protection is not strictly required, but it does result in
+ * greater accuracy for the TSC / kernel_ns measurement.
+ */
+ local_irq_save(flags);
+ this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
+ kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
+ ktime_get_ts(&ts);
+ monotonic_to_bootbased(&ts);
+ kernel_ns = timespec_to_ns(&ts);
+ local_irq_restore(flags);
+
if (unlikely(this_tsc_khz == 0)) {
kvm_request_guest_time_update(v);
return 1;
}

+ /*
+ * Time as measured by the TSC may go backwards when resetting the base
+ * tsc_timestamp. The reason for this is that the TSC resolution is
+ * higher than the resolution of the other clock scales. Thus, many
+ * possible measurments of the TSC correspond to one measurement of any
+ * other clock, and so a spread of values is possible. This is not a
+ * problem for the computation of the nanosecond clock; with TSC rates
+ * around 1GHZ, there can only be a few cycles which correspond to one
+ * nanosecond value, and any path through this code will inevitably
+ * take longer than that. However, with the kernel_ns value itself,
+ * the precision may be much lower, down to HZ granularity. If the
+ * first sampling of TSC against kernel_ns ends in the low part of the
+ * range, and the second in the high end of the range, we can get:
+ *
+ * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
+ *
+ * As the sampling errors potentially range in the thousands of cycles,
+ * it is possible such a time value has already been observed by the
+ * guest. To protect against this, we must compute the system time as
+ * observed by the guest and ensure the new system time is greater.
+ */
+ max_kernel_ns = 0;
+ if (vcpu->hv_clock.tsc_timestamp) {
+ max_kernel_ns = vcpu->last_guest_tsc -
+ vcpu->hv_clock.tsc_timestamp;
+ max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
+ vcpu->hv_clock.tsc_to_system_mul,
+ vcpu->hv_clock.tsc_shift);
+ max_kernel_ns += vcpu->last_kernel_ns;
+ }
+
if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
- kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
+ kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
+ &vcpu->hv_clock.tsc_shift,
+ &vcpu->hv_clock.tsc_to_system_mul);
vcpu->hw_tsc_khz = this_tsc_khz;
}

- /* Keep irq disabled to prevent changes to the clock */
- local_irq_save(flags);
- kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
- ktime_get_ts(&ts);
- monotonic_to_bootbased(&ts);
- local_irq_restore(flags);
+ if (max_kernel_ns > kernel_ns) {
+ s64 overshoot = max_kernel_ns - kernel_ns;
+ ++v->stat.tsc_ahead;
+ if (overshoot > NSEC_PER_SEC / HZ) {
+ ++v->stat.tsc_overshoot;
+ if (printk_ratelimit())
+ pr_debug("ns overshoot: %lld\n", overshoot);
+ }
+ kernel_ns = max_kernel_ns;
+ }

/* With all the info we got, fill in the values */
-
- vcpu->hv_clock.system_time = ts.tv_nsec +
- (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
+ vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
+ vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
+ vcpu->last_kernel_ns = kernel_ns;

vcpu->hv_clock.flags = 0;

@@ -4836,6 +4889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (hw_breakpoint_active())
hw_breakpoint_restore();

+ kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+
atomic_set(&vcpu->guest_mode, 0);
smp_wmb();
local_irq_enable();
--
1.7.1

2010-06-15 07:35:49

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 15/17] Fix AMD C1 TSC desynchronization

Some AMD based machines can have TSC drift when in C1 HLT state because
despite attempting to scale the TSC increment when dividing down the
P-state, the processor may return to full P-state to service cache
probes. The TSC of halted CPUs can advance faster than that of running
CPUs as a result, causing unpredictable TSC drift.

We implement a recommended workaround, which is disabling C1 clock ramping.
---
arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ef847ee..8e836e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -56,6 +56,11 @@
#include <asm/i387.h>
#include <asm/xcr.h>

+#ifdef CONFIG_K8_NB
+#include <linux/pci.h>
+#include <asm/k8.h>
+#endif
+
#define MAX_IO_MSRS 256
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -4287,10 +4292,43 @@ static struct notifier_block kvmclock_cpu_notifier_block = {
.priority = -INT_MAX
};

+static u8 disabled_c1_ramp = 0;
+
static void kvm_timer_init(void)
{
int cpu;

+ /*
+ * AMD processors can de-synchronize TSC on halt in C1 state, because
+ * processors in lower P state will have TSC scaled properly during
+ * normal operation, but will have TSC scaled improperly while
+ * servicing cache probes. Because there is no way to determine how
+ * TSC was adjusted during cache probes, there are two solutions:
+ * resynchronize after halt, or disable C1-clock ramping.
+ *
+ * We implemenent solution 2.
+ */
+#ifdef CONFIG_K8_NB
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+ boot_cpu_data.x86 == 0x0f &&
+ !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ struct pci_dev *nb;
+ int i;
+ cache_k8_northbridges();
+ for (i = 0; i < num_k8_northbridges; i++) {
+ u8 byte;
+ nb = k8_northbridges[i];
+ pci_read_config_byte(nb, 0x87, &byte);
+ if (byte & 1) {
+ printk(KERN_INFO "%s: AMD C1 clock ramping detected, performing workaround\n", __func__);
+ disabled_c1_ramp = byte;
+ pci_write_config_byte(nb, 0x87, byte & 0xFC);
+
+ }
+ }
+ }
+#endif
+
register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
@@ -4402,6 +4440,13 @@ void kvm_arch_exit(void)
unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
kvm_x86_ops = NULL;
kvm_mmu_module_exit();
+#ifdef CONFIG_K8_NB
+ if (disabled_c1_ramp) {
+ struct pci_dev **nb;
+ for (nb = k8_northbridges; *nb; nb++)
+ pci_write_config_byte(*nb, 0x87, disabled_c1_ramp);
+ }
+#endif
}

int kvm_emulate_halt(struct kvm_vcpu *vcpu)
--
1.7.1

2010-06-15 07:36:19

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 13/17] Add TSC offset tracking

Track the last TSC offset set for each VM and ensure that the storing of
the offset and the reading of the TSC are never preempted by taking a
spinlock.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 +++++-
arch/x86/kvm/svm.c | 30 +++++++++++++++++-------------
arch/x86/kvm/vmx.c | 21 +++++++--------------
arch/x86/kvm/x86.c | 20 +++++++++++++++++++-
arch/x86/kvm/x86.h | 2 ++
5 files changed, 50 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ec2472..98d4de8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -410,8 +410,11 @@ struct kvm_arch {
gpa_t ept_identity_map_addr;

unsigned long irq_sources_bitmap;
- u64 vm_init_tsc;
s64 kvmclock_offset;
+ u64 last_tsc_write;
+ u64 last_tsc_offset;
+ u64 last_tsc_nsec;
+ spinlock_t tsc_write_lock;

struct kvm_xen_hvm_config xen_hvm_config;

@@ -536,6 +539,7 @@ struct kvm_x86_ops {
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+ void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);

void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ee2cf30..4654507 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2527,20 +2527,9 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
struct vcpu_svm *svm = to_svm(vcpu);

switch (ecx) {
- case MSR_IA32_TSC: {
- u64 tsc_offset = data - native_read_tsc();
- u64 g_tsc_offset = 0;
-
- if (is_nested(svm)) {
- g_tsc_offset = svm->vmcb->control.tsc_offset -
- svm->nested.hsave->control.tsc_offset;
- svm->nested.hsave->control.tsc_offset = tsc_offset;
- }
-
- svm->vmcb->control.tsc_offset = tsc_offset + g_tsc_offset;
-
+ case MSR_IA32_TSC:
+ guest_write_tsc(vcpu, data);
break;
- }
case MSR_K6_STAR:
svm->vmcb->save.star = data;
break;
@@ -3417,6 +3406,20 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
svm->nested.hsave->control.tsc_offset += adjustment;
}

+static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ u64 g_tsc_offset = 0;
+
+ if (is_nested(svm)) {
+ g_tsc_offset = svm->vmcb->control.tsc_offset -
+ svm->nested.hsave->control.tsc_offset;
+ svm->nested.hsave->control.tsc_offset = offset;
+ }
+
+ svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
+}
+
static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -3503,6 +3506,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.set_supported_cpuid = svm_set_supported_cpuid,

.adjust_tsc_offset = svm_adjust_tsc_offset,
+ .write_tsc_offset = svm_write_tsc_offset,
};

static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a993e67..9b604b8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1133,12 +1133,11 @@ static u64 guest_read_tsc(void)
}

/*
- * writes 'guest_tsc' into guest's timestamp counter "register"
- * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
+ * writes 'offset' into guest's timestamp counter offset register
*/
-static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
+static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
{
- vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
+ vmcs_write64(TSC_OFFSET, offset);
}

static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -1217,7 +1216,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct shared_msr_entry *msr;
- u64 host_tsc;
int ret = 0;

switch (msr_index) {
@@ -1247,8 +1245,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
vmcs_writel(GUEST_SYSENTER_ESP, data);
break;
case MSR_IA32_TSC:
- rdtscll(host_tsc);
- guest_write_tsc(data, host_tsc);
+ guest_write_tsc(vcpu, data);
break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
@@ -2503,7 +2500,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
{
u32 host_sysenter_cs, msr_low, msr_high;
u32 junk;
- u64 host_pat, tsc_this, tsc_base;
+ u64 host_pat;
unsigned long a;
struct desc_ptr dt;
int i;
@@ -2644,12 +2641,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);

- tsc_base = vmx->vcpu.kvm->arch.vm_init_tsc;
- rdtscll(tsc_this);
- if (tsc_this < vmx->vcpu.kvm->arch.vm_init_tsc)
- tsc_base = tsc_this;
-
- guest_write_tsc(0, tsc_base);
+ guest_write_tsc(&vmx->vcpu, 0);

return 0;
}
@@ -4336,6 +4328,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.set_supported_cpuid = vmx_set_supported_cpuid,

.adjust_tsc_offset = vmx_adjust_tsc_offset,
+ .write_tsc_offset = vmx_write_tsc_offset,
};

static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 15c7317..ef847ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -932,6 +932,24 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
set_bit(KVM_REQ_CLOCK_SYNC, &v->requests);
}

+void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
+{
+ struct kvm *kvm = vcpu->kvm;
+ u64 offset;
+
+ spin_lock(&kvm->arch.tsc_write_lock);
+ offset = data - native_read_tsc();
+ kvm->arch.last_tsc_nsec = get_kernel_ns();
+ kvm->arch.last_tsc_write = data;
+ kvm->arch.last_tsc_offset = offset;
+ kvm_x86_ops->write_tsc_offset(vcpu, offset);
+ spin_unlock(&kvm->arch.tsc_write_lock);
+
+ /* Reset of TSC must disable overshoot protection below */
+ vcpu->arch.hv_clock.tsc_timestamp = 0;
+}
+EXPORT_SYMBOL_GPL(guest_write_tsc);
+
static int kvm_recompute_guest_time(struct kvm_vcpu *v)
{
unsigned long flags;
@@ -5616,7 +5634,7 @@ struct kvm *kvm_arch_create_vm(void)
/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);

- rdtscll(kvm->arch.vm_init_tsc);
+ spin_lock_init(&kvm->arch.tsc_write_lock);

return kvm;
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f4b5445..ce2aff8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -75,4 +75,6 @@ static inline struct kvm_mem_aliases *kvm_aliases(struct kvm *kvm)
void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);

+void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data);
+
#endif
--
1.7.1

2010-06-15 07:36:49

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 10/17] Make KVM clock computation work for other scales

The math in kvm_get_time_scale relies on the fact that
NSEC_PER_SEC < 2^32. To use the same function to compute
arbitrary time scales, we must extend the first reduction
step to shrink the base rate to a 32-bit value, and
possibly reduce the scaled rate into a 32-bit as well.

Note we must take care to avoid an arithmetic overflow
when scaling up the tps32 value (this could not happen
with the fixed scaled value of NSEC_PER_SEC, but can
happen with scaled rates above 2^31.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 26 +++++++++++++++-----------
1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05c559d..52d7d34 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -882,31 +882,35 @@ static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
return quotient;
}

-static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *hv_clock)
+static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
+ s8 *pshift, u32 *pmultiplier)
{
- uint64_t nsecs = 1000000000LL;
+ uint64_t scaled64;
int32_t shift = 0;
uint64_t tps64;
uint32_t tps32;

- tps64 = tsc_khz * 1000LL;
- while (tps64 > nsecs*2) {
+ tps64 = base_khz * 1000LL;
+ scaled64 = scaled_khz * 1000LL;
+ while (tps64 > scaled64*2 || tps64 & 0xffffffff00000000UL) {
tps64 >>= 1;
shift--;
}

tps32 = (uint32_t)tps64;
- while (tps32 <= (uint32_t)nsecs) {
- tps32 <<= 1;
+ while (tps32 <= scaled64 || scaled64 & 0xffffffff00000000UL) {
+ if (scaled64 & 0xffffffff00000000UL || tps32 & 0x80000000)
+ scaled64 >>= 1;
+ else
+ tps32 <<= 1;
shift++;
}

- hv_clock->tsc_shift = shift;
- hv_clock->tsc_to_system_mul = div_frac(nsecs, tps32);
+ *pshift = shift;
+ *pmultiplier = div_frac(scaled64, tps32);

- pr_debug("%s: tsc_khz %u, tsc_shift %d, tsc_mul %u\n",
- __func__, tsc_khz, hv_clock->tsc_shift,
- hv_clock->tsc_to_system_mul);
+ pr_debug("%s: base_khz %u => %u, shift %d, mul %u\n",
+ __func__, base_khz, scaled_khz, shift, *pmultiplier);
}

static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
--
1.7.1

2010-06-15 07:34:33

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 03/17] Unify vendor TSC logic

Move the TSC control logic from the vendor backends into x86.c
by adding adjust_tsc_offset to x86 ops. Now all TSC decisions
can be done in one place.

Also, rename some variable in the VCPU structure to more accurately
reflect their actual content.

VMX backend would record last observed TSC very late (possibly in an
IPI to clear the VMCS from a remote CPU), now it is done earlier.

Note this is still not yet perfect. We adjust TSC in both
directions when the TSC is unstable, resulting in desynchronized
TSCs for SMP guests. A better choice would be to compensate based
on a master reference clock.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++--
arch/x86/kvm/svm.c | 25 +++++++++++--------------
arch/x86/kvm/vmx.c | 20 ++++++++------------
arch/x86/kvm/x86.c | 16 +++++++++++-----
4 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 91631b8..94f6ce8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -253,7 +253,6 @@ struct kvm_mmu {
};

struct kvm_vcpu_arch {
- u64 host_tsc;
/*
* rip and regs accesses must go through
* kvm_{register,rip}_{read,write} functions.
@@ -334,9 +333,10 @@ struct kvm_vcpu_arch {

gpa_t time;
struct pvclock_vcpu_time_info hv_clock;
- unsigned int hv_clock_tsc_khz;
+ unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+ u64 last_host_tsc;

bool nmi_pending;
bool nmi_injected;
@@ -530,6 +530,7 @@ struct kvm_x86_ops {
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
+ void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);

void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 09b2145..ee2cf30 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
int i;

if (unlikely(cpu != vcpu->cpu)) {
- u64 delta;
-
- if (check_tsc_unstable()) {
- /*
- * Make sure that the guest sees a monotonically
- * increasing TSC.
- */
- delta = vcpu->arch.host_tsc - native_read_tsc();
- svm->vmcb->control.tsc_offset += delta;
- if (is_nested(svm))
- svm->nested.hsave->control.tsc_offset += delta;
- }
svm->asid_generation = 0;
}

@@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
++vcpu->stat.host_state_reload;
for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
-
- vcpu->arch.host_tsc = native_read_tsc();
}

static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
@@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void)
return false;
}

+static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ svm->vmcb->control.tsc_offset += adjustment;
+ if (is_nested(svm))
+ svm->nested.hsave->control.tsc_offset += adjustment;
+}
+
static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = {
.rdtscp_supported = svm_rdtscp_supported,

.set_supported_cpuid = svm_set_supported_cpuid,
+
+ .adjust_tsc_offset = svm_adjust_tsc_offset,
};

static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a657e08..a993e67 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg)
vmcs_clear(vmx->vmcs);
if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
per_cpu(current_vmcs, cpu) = NULL;
- rdtscll(vmx->vcpu.arch.host_tsc);
list_del(&vmx->local_vcpus_link);
vmx->vcpu.cpu = -1;
vmx->launched = 0;
@@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u64 tsc_this, delta, new_offset;
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));

if (!vmm_exclusive)
@@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
-
- /*
- * Make sure the time stamp counter is monotonous.
- */
- rdtscll(tsc_this);
- if (tsc_this < vcpu->arch.host_tsc) {
- delta = vcpu->arch.host_tsc - tsc_this;
- new_offset = vmcs_read64(TSC_OFFSET) + delta;
- vmcs_write64(TSC_OFFSET, new_offset);
- }
}
}

@@ -1153,6 +1141,12 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
}

+static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+{
+ u64 offset = vmcs_read64(TSC_OFFSET);
+ vmcs_write64(TSC_OFFSET, offset + adjustment);
+}
+
/*
* Reads an msr value (of 'msr_index') into 'pdata'.
* Returns 0 on success, non-0 otherwise.
@@ -4340,6 +4334,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.rdtscp_supported = vmx_rdtscp_supported,

.set_supported_cpuid = vmx_set_supported_cpuid,
+
+ .adjust_tsc_offset = vmx_adjust_tsc_offset,
};

static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a102d3..c8289d0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -929,9 +929,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
return 1;
}

- if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
+ if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
- vcpu->hv_clock_tsc_khz = this_tsc_khz;
+ vcpu->hw_tsc_khz = this_tsc_khz;
}

/* Keep irq disabled to prevent changes to the clock */
@@ -1805,18 +1805,24 @@ out:

void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
+ kvm_x86_ops->vcpu_load(vcpu, cpu);
if (unlikely(vcpu->cpu != cpu)) {
+ /* Make sure TSC doesn't go backwards */
+ s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
+ native_read_tsc() - vcpu->arch.last_host_tsc;
+ if (tsc_delta < 0 || check_tsc_unstable())
+ kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
kvm_migrate_timers(vcpu);
+ kvm_request_guest_time_update(vcpu);
+ vcpu->cpu = cpu;
}
- kvm_x86_ops->vcpu_load(vcpu, cpu);
- kvm_request_guest_time_update(vcpu);
- vcpu->cpu = cpu;
}

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
+ vcpu->arch.last_host_tsc = native_read_tsc();
}

static int is_efer_nx(void)
--
1.7.1

2010-06-15 07:34:41

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 06/17] Rename KVM_REQ_KVMCLOCK_UPDATE

This bit will be used to track all clock synchronization events,
so rename it KVM_REQ_CLOCK_SYNC. Also it makes some spacing better.

As prep for using it for all synchronization, we can't skip
setting the bit in the event there is no vcpu time page.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 26 ++++++++++----------------
include/linux/kvm_host.h | 2 +-
2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b1bdf05..4b15d03 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -911,7 +911,12 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *

static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);

-static int kvm_write_guest_time(struct kvm_vcpu *v)
+static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
+{
+ set_bit(KVM_REQ_CLOCK_SYNC, &v->requests);
+}
+
+static int kvm_recompute_guest_time(struct kvm_vcpu *v)
{
struct timespec ts;
unsigned long flags;
@@ -925,7 +930,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
this_tsc_khz = get_cpu_var(cpu_tsc_khz);
put_cpu_var(cpu_tsc_khz);
if (unlikely(this_tsc_khz == 0)) {
- set_bit(KVM_REQ_KVMCLOCK_UPDATE, &v->requests);
+ kvm_request_guest_time_update(v);
return 1;
}

@@ -966,16 +971,6 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
return 0;
}

-static int kvm_request_guest_time_update(struct kvm_vcpu *v)
-{
- struct kvm_vcpu_arch *vcpu = &v->arch;
-
- if (!vcpu->time_page)
- return 0;
- set_bit(KVM_REQ_KVMCLOCK_UPDATE, &v->requests);
- return 1;
-}
-
static bool msr_mtrr_valid(unsigned msr)
{
switch (msr) {
@@ -4166,8 +4161,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
kvm_for_each_vcpu(i, vcpu, kvm) {
if (vcpu->cpu != freq->cpu)
continue;
- if (!kvm_request_guest_time_update(vcpu))
- continue;
+ kvm_request_guest_time_update(vcpu);
if (vcpu->cpu != smp_processor_id())
send_ipi = 1;
}
@@ -4752,8 +4746,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->requests) {
if (test_and_clear_bit(KVM_REQ_MIGRATE_TIMER, &vcpu->requests))
__kvm_migrate_timers(vcpu);
- if (test_and_clear_bit(KVM_REQ_KVMCLOCK_UPDATE, &vcpu->requests)) {
- r = kvm_write_guest_time(vcpu);
+ if (test_and_clear_bit(KVM_REQ_CLOCK_SYNC, &vcpu->requests)) {
+ r = kvm_recompute_guest_time(vcpu);
if (unlikely(r))
goto out;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2d96555..142d025 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -36,7 +36,7 @@
#define KVM_REQ_PENDING_TIMER 5
#define KVM_REQ_UNHALT 6
#define KVM_REQ_MMU_SYNC 7
-#define KVM_REQ_KVMCLOCK_UPDATE 8
+#define KVM_REQ_CLOCK_SYNC 8
#define KVM_REQ_KICK 9
#define KVM_REQ_DEACTIVATE_FPU 10

--
1.7.1

2010-06-15 07:37:44

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
local offset delta from last measurement. This does not take into account how
long it has been since the measurement, leading to drift. Minimize the drift
by accounting for any time difference the kernel has observed.

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 20 +++++++++++++++++++-
2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 94f6ce8..1afecd7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -337,6 +337,7 @@ struct kvm_vcpu_arch {
unsigned int time_offset;
struct page *time_page;
u64 last_host_tsc;
+ u64 last_host_ns;

bool nmi_pending;
bool nmi_injected;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 618c435..b1bdf05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu->arch.last_host_tsc;
+
+ /* Subtract elapsed cycle time from the delta computation */
+ if (check_tsc_unstable() && vcpu->arch.last_host_ns) {
+ s64 delta;
+ struct timespec ts;
+ ktime_get_ts(&ts);
+ monotonic_to_bootbased(&ts);
+ delta = timespec_to_ns(&ts) - vcpu->arch.last_host_ns;
+ delta = delta * per_cpu(cpu_tsc_khz, cpu);
+ delta = delta / USEC_PER_SEC;
+ tsc_delta -= delta;
+ }
+
if (tsc_delta < 0 || check_tsc_unstable())
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
kvm_migrate_timers(vcpu);
@@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
* vcpu->cpu != cpu can not detect this condition. So set
* vcpu->cpu = -1 to force the recalibration above.
*/
- if (check_tsc_unstable())
+ if (check_tsc_unstable()) {
+ struct timespec ts;
+ ktime_get_ts(&ts);
+ monotonic_to_bootbased(&ts);
+ vcpu->arch.last_host_ns = timespec_to_ns(&ts);
vcpu->cpu = -1;
+ }
}

static int is_efer_nx(void)
--
1.7.1

2010-06-15 07:38:20

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 02/17] Make cpu_tsc_khz updates use local CPU.

This simplifies much of the init code; we can now simply always
call tsc_khz_changed, optionally passing it a new value, or letting
it figure out the existing value (while interrupts are disabled, and
thus, by inference from the rule, not raceful against CPU hotplug or
frequency updates, which will issue IPIs to the local CPU to perform
this very same task).

Signed-off-by: Zachary Amsden <[email protected]>
---
arch/x86/kvm/x86.c | 136 ++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 100 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05754c1..0a102d3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -911,7 +911,7 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *

static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);

-static void kvm_write_guest_time(struct kvm_vcpu *v)
+static int kvm_write_guest_time(struct kvm_vcpu *v)
{
struct timespec ts;
unsigned long flags;
@@ -920,14 +920,19 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
unsigned long this_tsc_khz;

if ((!vcpu->time_page))
- return;
+ return 0;

this_tsc_khz = get_cpu_var(cpu_tsc_khz);
+ put_cpu_var(cpu_tsc_khz);
+ if (unlikely(this_tsc_khz == 0)) {
+ set_bit(KVM_REQ_KVMCLOCK_UPDATE, &v->requests);
+ return 1;
+ }
+
if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
vcpu->hv_clock_tsc_khz = this_tsc_khz;
}
- put_cpu_var(cpu_tsc_khz);

/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
@@ -958,6 +963,7 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
kunmap_atomic(shared_kaddr, KM_USER0);

mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
+ return 0;
}

static int kvm_request_guest_time_update(struct kvm_vcpu *v)
@@ -1803,12 +1809,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
}
kvm_x86_ops->vcpu_load(vcpu, cpu);
- if (unlikely(per_cpu(cpu_tsc_khz, cpu) == 0)) {
- unsigned long khz = cpufreq_quick_get(cpu);
- if (!khz)
- khz = tsc_khz;
- per_cpu(cpu_tsc_khz, cpu) = khz;
- }
kvm_request_guest_time_update(vcpu);
vcpu->cpu = cpu;
}
@@ -4053,9 +4053,23 @@ int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port)
}
EXPORT_SYMBOL_GPL(kvm_fast_pio_out);

-static void bounce_off(void *info)
+static void tsc_bad(void *info)
{
- /* nothing */
+ per_cpu(cpu_tsc_khz, raw_smp_processor_id()) = 0;
+}
+
+static void tsc_khz_changed(void *data)
+{
+ struct cpufreq_freqs *freq = data;
+
+ if (data) {
+ per_cpu(cpu_tsc_khz, freq->cpu) = freq->new;
+ } else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ __get_cpu_var(cpu_tsc_khz) =
+ cpufreq_quick_get(raw_smp_processor_id());
+ }
+ if (!per_cpu(cpu_tsc_khz, raw_smp_processor_id()))
+ per_cpu(cpu_tsc_khz, raw_smp_processor_id()) = tsc_khz;
}

static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
@@ -4066,11 +4080,51 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
struct kvm_vcpu *vcpu;
int i, send_ipi = 0;

+ /*
+ * We allow guests to temporarily run on slowing clocks,
+ * provided we notify them after, or to run on accelerating
+ * clocks, provided we notify them before. Thus time never
+ * goes backwards.
+ *
+ * However, we have a problem. We can't atomically update
+ * the frequency of a given CPU from this function; it is
+ * merely a notifier, which can be called from any CPU.
+ * Changing the TSC frequency at arbitrary points in time
+ * requires a recomputation of local variables related to
+ * the TSC for each VCPU. We must flag these local variables
+ * to be updated and be sure the update takes place with the
+ * new frequency before any guests proceed.
+ *
+ * Unfortunately, the combination of hotplug CPU and frequency
+ * change creates an intractable locking scenario; the order
+ * of when these callouts happen is undefined with respect to
+ * CPU hotplug, and they can race with each other. As such,
+ * merely setting per_cpu(cpu_tsc_khz) = X during a hotadd is
+ * undefined; you can actually have a CPU frequency change take
+ * place in between the computation of X and the setting of the
+ * variable. To protect against this problem, all updates of
+ * the per_cpu tsc_khz variable are done in an interrupt
+ * protected IPI, and all callers wishing to update the value
+ * must wait for a synchronous IPI to complete (which is trivial
+ * if the caller is on the CPU already). This establishes the
+ * necessary total order on variable updates.
+ *
+ * Note that because the guest time update may take place
+ * anytime after the setting of the VCPU's request bit, the
+ * correct TSC value must be set before the request. However,
+ * to ensure the update actually makes it to any guest which
+ * starts running in hardware virtualization between the set
+ * and the acquisition of the spinlock, we must also ping the
+ * CPU after setting the request bit.
+ *
+ */
+
if (val == CPUFREQ_PRECHANGE && freq->old > freq->new)
return 0;
if (val == CPUFREQ_POSTCHANGE && freq->old < freq->new)
return 0;
- per_cpu(cpu_tsc_khz, freq->cpu) = freq->new;
+
+ smp_call_function_single(freq->cpu, tsc_khz_changed, freq, 1);

spin_lock(&kvm_lock);
list_for_each_entry(kvm, &vm_list, vm_list) {
@@ -4080,7 +4134,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
if (!kvm_request_guest_time_update(vcpu))
continue;
if (vcpu->cpu != smp_processor_id())
- send_ipi++;
+ send_ipi = 1;
}
}
spin_unlock(&kvm_lock);
@@ -4098,32 +4152,48 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
* guest context is entered kvmclock will be updated,
* so the guest will not see stale values.
*/
- smp_call_function_single(freq->cpu, bounce_off, NULL, 1);
+ smp_call_function_single(freq->cpu, tsc_khz_changed, freq, 1);
}
return 0;
}

static struct notifier_block kvmclock_cpufreq_notifier_block = {
- .notifier_call = kvmclock_cpufreq_notifier
+ .notifier_call = kvmclock_cpufreq_notifier
+};
+
+static int kvmclock_cpu_notifier(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+
+ switch (action) {
+ case CPU_ONLINE:
+ case CPU_DOWN_FAILED:
+ smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+ break;
+ case CPU_DOWN_PREPARE:
+ smp_call_function_single(cpu, tsc_bad, NULL, 1);
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block kvmclock_cpu_notifier_block = {
+ .notifier_call = kvmclock_cpu_notifier,
+ .priority = -INT_MAX
};

static void kvm_timer_init(void)
{
int cpu;

+ register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
- for_each_online_cpu(cpu) {
- unsigned long khz = cpufreq_get(cpu);
- if (!khz)
- khz = tsc_khz;
- per_cpu(cpu_tsc_khz, cpu) = khz;
- }
- } else {
- for_each_possible_cpu(cpu)
- per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
}
+ for_each_online_cpu(cpu)
+ smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
}

static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
@@ -4225,6 +4295,7 @@ void kvm_arch_exit(void)
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
+ unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
kvm_x86_ops = NULL;
kvm_mmu_module_exit();
}
@@ -4646,8 +4717,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->requests) {
if (test_and_clear_bit(KVM_REQ_MIGRATE_TIMER, &vcpu->requests))
__kvm_migrate_timers(vcpu);
- if (test_and_clear_bit(KVM_REQ_KVMCLOCK_UPDATE, &vcpu->requests))
- kvm_write_guest_time(vcpu);
+ if (test_and_clear_bit(KVM_REQ_KVMCLOCK_UPDATE, &vcpu->requests)) {
+ r = kvm_write_guest_time(vcpu);
+ if (unlikely(r))
+ goto out;
+ }
if (test_and_clear_bit(KVM_REQ_MMU_SYNC, &vcpu->requests))
kvm_mmu_sync_roots(vcpu);
if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
@@ -5339,17 +5413,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)

int kvm_arch_hardware_enable(void *garbage)
{
- /*
- * Since this may be called from a hotplug notifcation,
- * we can't get the CPU frequency directly.
- */
- if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
- int cpu = raw_smp_processor_id();
- per_cpu(cpu_tsc_khz, cpu) = 0;
- }
-
kvm_shared_msr_cpu_online();
-
return kvm_x86_ops->hardware_enable(garbage);
}

--
1.7.1

2010-06-15 08:02:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 02/17] Make cpu_tsc_khz updates use local CPU.

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> This simplifies much of the init code; we can now simply always
> call tsc_khz_changed, optionally passing it a new value, or letting
> it figure out the existing value (while interrupts are disabled, and
> thus, by inference from the rule, not raceful against CPU hotplug or
> frequency updates, which will issue IPIs to the local CPU to perform
> this very same task).
>
>
> -static void bounce_off(void *info)
> +static void tsc_bad(void *info)
> {
> - /* nothing */
> + per_cpu(cpu_tsc_khz, raw_smp_processor_id()) = 0;
>

Can use __get_cpu_var(cpu_tsc_khz) = 0;

> +}
> +
> +static void tsc_khz_changed(void *data)
> +{
> + struct cpufreq_freqs *freq = data;
> +
> + if (data) {
> + per_cpu(cpu_tsc_khz, freq->cpu) = freq->new;
> + } else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + __get_cpu_var(cpu_tsc_khz) =
> + cpufreq_quick_get(raw_smp_processor_id());
> + }
>

Excessive bracketing.

> + if (!per_cpu(cpu_tsc_khz, raw_smp_processor_id()))
> + per_cpu(cpu_tsc_khz, raw_smp_processor_id()) = tsc_khz;
> }
>
>
> +static int kvmclock_cpu_notifier(struct notifier_block *nfb,
> + unsigned long action, void *hcpu)
> +{
> + unsigned int cpu = (unsigned long)hcpu;
> +
> + switch (action) {
> + case CPU_ONLINE:
> + case CPU_DOWN_FAILED:
> + smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
> + break;
> + case CPU_DOWN_PREPARE:
> + smp_call_function_single(cpu, tsc_bad, NULL, 1);
> + break;
>

A little frightening in the chance for livelock, but I think you got it
right. vcpus on the doomed cpu will have need_resched() set and will
not be trapped in a loop trying to write the clock and failing.


--
error compiling committee.c: too many arguments to function

2010-06-15 08:09:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> When CPUs with unstable TSCs enter deep C-state, TSC may stop
> running. This causes us to require resynchronization. Since
> we can't tell when this may potentially happen, we assume the
> worst by forcing re-compensation for it at every point the VCPU
> task is descheduled.
>
> Signed-off-by: Zachary Amsden<[email protected]>
> ---
> arch/x86/kvm/x86.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c8289d0..618c435 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1822,7 +1822,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> +
> vcpu->arch.last_host_tsc = native_read_tsc();
> +
> + /*
> + * When potentially leaving a CPU with unstable TSCs, we risk
> + * that the CPU enters deep C-state. If it does, the TSC may
> + * go out of sync but we will not recalibrate because the test
> + * vcpu->cpu != cpu can not detect this condition. So set
> + * vcpu->cpu = -1 to force the recalibration above.
> + */
> + if (check_tsc_unstable())
> + vcpu->cpu = -1;
> }
>

That will cause us to miss a vmclear later on. Also it invalidates an
invariant that the per-cpu list vcpus_on_cpu has all the vcpus with
vcpu->cpu == cpu on this cpu (try saything that fast).

--
error compiling committee.c: too many arguments to function

2010-06-15 08:11:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
> local offset delta from last measurement. This does not take into account how
> long it has been since the measurement, leading to drift. Minimize the drift
> by accounting for any time difference the kernel has observed.
>
> Signed-off-by: Zachary Amsden<[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 94f6ce8..1afecd7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -337,6 +337,7 @@ struct kvm_vcpu_arch {
> unsigned int time_offset;
> struct page *time_page;
> u64 last_host_tsc;
> + u64 last_host_ns;
>
> bool nmi_pending;
> bool nmi_injected;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 618c435..b1bdf05 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> /* Make sure TSC doesn't go backwards */
> s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> native_read_tsc() - vcpu->arch.last_host_tsc;
> +
> + /* Subtract elapsed cycle time from the delta computation */
> + if (check_tsc_unstable()&& vcpu->arch.last_host_ns) {
> + s64 delta;
> + struct timespec ts;
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + delta = timespec_to_ns(&ts) - vcpu->arch.last_host_ns;
> + delta = delta * per_cpu(cpu_tsc_khz, cpu);
> + delta = delta / USEC_PER_SEC;
> + tsc_delta -= delta;
> + }
> +
> if (tsc_delta< 0 || check_tsc_unstable())
> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> kvm_migrate_timers(vcpu);
> @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> * vcpu->cpu != cpu can not detect this condition. So set
> * vcpu->cpu = -1 to force the recalibration above.
> */
> - if (check_tsc_unstable())
> + if (check_tsc_unstable()) {
> + struct timespec ts;
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + vcpu->arch.last_host_ns = timespec_to_ns(&ts);
> vcpu->cpu = -1;
> + }
> }
>

Is there no way to do this calculation entirely with ktime_ts? struct
timespec has nasty multiplies and divides.

--
error compiling committee.c: too many arguments to function

2010-06-15 08:14:10

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

On 06/14/2010 10:09 PM, Avi Kivity wrote:
> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>> running. This causes us to require resynchronization. Since
>> we can't tell when this may potentially happen, we assume the
>> worst by forcing re-compensation for it at every point the VCPU
>> task is descheduled.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 11 +++++++++++
>> 1 files changed, 11 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c8289d0..618c435 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1822,7 +1822,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> {
>> kvm_x86_ops->vcpu_put(vcpu);
>> kvm_put_guest_fpu(vcpu);
>> +
>> vcpu->arch.last_host_tsc = native_read_tsc();
>> +
>> + /*
>> + * When potentially leaving a CPU with unstable TSCs, we risk
>> + * that the CPU enters deep C-state. If it does, the TSC may
>> + * go out of sync but we will not recalibrate because the test
>> + * vcpu->cpu != cpu can not detect this condition. So set
>> + * vcpu->cpu = -1 to force the recalibration above.
>> + */
>> + if (check_tsc_unstable())
>> + vcpu->cpu = -1;
>> }
>
> That will cause us to miss a vmclear later on. Also it invalidates an
> invariant that the per-cpu list vcpus_on_cpu has all the vcpus with
> vcpu->cpu == cpu on this cpu (try saything that fast).
>

Blasted SVM / VMX differences! There are other, less elegant ways to
accomplish this task however.

Good catch, BTW.

2010-06-15 08:24:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 08/17] Add clock sync request to hardware enable

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> If there are active VCPUs which are marked as belonging to
> a particular hardware CPU, request a clock sync for them when
> enabling hardware; the TSC could be desynchronized on a newly
> arriving CPU, and we need to recompute guests system time
> relative to boot after a suspend event.
>
> This covers both cases.
>
> Note that it is acceptable to take the spinlock, as either
> no other tasks will be running and no locks held (BSP after
> resume), or other tasks will be guaranteed to drop the lock
> relatively quickly (AP on CPU_STARTING).
>
> Noting we now get clock synchronization requests for VCPUs
> which are starting up (or restarting), it is tempting to
> attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
> at this time, however it is not correct to do so; they are
> required for systems with non-constant TSC as the frequency
> may not be known immediately after the processor has started
> until the cpufreq driver has had a chance to run and query
> the chipset.
>
> Updated: implement better locking semantics for hardware_enable
>
> Removed the hack of dropping and retaking the lock by adding the
> semantic that we always hold kvm_lock when hardware_enable is
> called. The one place that doesn't need to worry about it is
> resume, as resuming a frozen CPU, the spinlock won't be taken.
>
> Signed-off-by: Zachary Amsden<[email protected]>
> ---
> arch/x86/kvm/x86.c | 8 ++++++++
> virt/kvm/kvm_main.c | 6 +++++-
> 2 files changed, 13 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b15d03..05c559d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5442,7 +5442,15 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>
> int kvm_arch_hardware_enable(void *garbage)
> {
> + struct kvm *kvm;
> + struct kvm_vcpu *vcpu;
> + int i;
> +
> kvm_shared_msr_cpu_online();
> + list_for_each_entry(kvm,&vm_list, vm_list)
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + if (vcpu->cpu == smp_processor_id())
> + kvm_request_guest_time_update(vcpu);
> return kvm_x86_ops->hardware_enable(garbage);
> }
>
>

An alternative to this loop (and a similar one in the cpu frequency
notifier earlier) is to have a per-cpu cpu_freq_generation_counter,
which is checked on every entry against a snapshot of the counter in the
vcpu. This would replace the loop with an O(1) mechanism, at the cost
of another compare.

I don't think it's worthwhile at this point though, just something to
keep in mind.

--
error compiling committee.c: too many arguments to function

2010-06-15 08:40:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> Kernel time, which advances in discrete steps may progress much slower
> than TSC. As a result, when kvmclock is adjusted to a new base, the
> apparent time to the guest, which runs at a much higher, nsec scaled
> rate based on the current TSC, may have already been observed to have
> a larger value (kernel_ns + scaled tsc) than the value to which we are
> setting it (kernel_ns + 0).
>
> We must instead compute the clock as potentially observed by the guest
> for kernel_ns to make sure it does not go backwards.
>
> @@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
> u32 hypercalls;
> u32 irq_injections;
> u32 nmi_injections;
> + u32 tsc_overshoot;
> + u32 tsc_ahead;
> };
>

Please don't add new stats, instead add tracepoints which can also be
observed as stats.

But does this really merit exposing? What would one do with this
information?

> struct kvm_vcpu_arch *vcpu =&v->arch;
> void *shared_kaddr;
> unsigned long this_tsc_khz;
> + s64 kernel_ns, max_kernel_ns;
> + u64 tsc_timestamp;
>
> if ((!vcpu->time_page))
> return 0;
>
> - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
> - put_cpu_var(cpu_tsc_khz);
> + /*
> + * The protection we require is simple: we must not be preempted from
> + * the CPU between our read of the TSC khz and our read of the TSC.
> + * Interrupt protection is not strictly required, but it does result in
> + * greater accuracy for the TSC / kernel_ns measurement.
> + */
> + local_irq_save(flags);
> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
> + kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
>

That's a slow path, since it has to go through kvm_get_msr()'s if tree.
Could use its own accessor.

But this isn't introduced by this patch, so it can be fixed by another.

> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + kernel_ns = timespec_to_ns(&ts);
> + local_irq_restore(flags);
> +
> if (unlikely(this_tsc_khz == 0)) {
> kvm_request_guest_time_update(v);
> return 1;
> }
>
> + /*
> + * Time as measured by the TSC may go backwards when resetting the base
> + * tsc_timestamp. The reason for this is that the TSC resolution is
> + * higher than the resolution of the other clock scales. Thus, many
> + * possible measurments of the TSC correspond to one measurement of any
> + * other clock, and so a spread of values is possible. This is not a
> + * problem for the computation of the nanosecond clock; with TSC rates
> + * around 1GHZ, there can only be a few cycles which correspond to one
> + * nanosecond value, and any path through this code will inevitably
> + * take longer than that. However, with the kernel_ns value itself,
> + * the precision may be much lower, down to HZ granularity. If the
> + * first sampling of TSC against kernel_ns ends in the low part of the
> + * range, and the second in the high end of the range, we can get:
> + *
> + * (TSC - offset_low) * S + kns_old> (TSC - offset_high) * S + kns_new
> + *
> + * As the sampling errors potentially range in the thousands of cycles,
> + * it is possible such a time value has already been observed by the
> + * guest. To protect against this, we must compute the system time as
> + * observed by the guest and ensure the new system time is greater.
> + */
> + max_kernel_ns = 0;
> + if (vcpu->hv_clock.tsc_timestamp) {
> + max_kernel_ns = vcpu->last_guest_tsc -
> + vcpu->hv_clock.tsc_timestamp;
> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> + vcpu->hv_clock.tsc_to_system_mul,
> + vcpu->hv_clock.tsc_shift);
> + max_kernel_ns += vcpu->last_kernel_ns;
> + }
> +
> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
> - kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
> + &vcpu->hv_clock.tsc_shift,
> + &vcpu->hv_clock.tsc_to_system_mul);
> vcpu->hw_tsc_khz = this_tsc_khz;
> }
>
> - /* Keep irq disabled to prevent changes to the clock */
> - local_irq_save(flags);
> - kvm_get_msr(v, MSR_IA32_TSC,&vcpu->hv_clock.tsc_timestamp);
> - ktime_get_ts(&ts);
> - monotonic_to_bootbased(&ts);
> - local_irq_restore(flags);
> + if (max_kernel_ns> kernel_ns) {
> + s64 overshoot = max_kernel_ns - kernel_ns;
> + ++v->stat.tsc_ahead;
> + if (overshoot> NSEC_PER_SEC / HZ) {
> + ++v->stat.tsc_overshoot;
> + if (printk_ratelimit())
> + pr_debug("ns overshoot: %lld\n", overshoot);
> + }
>

A tracepoint here would allow recording both the number of overshoots
and the value of the overshoot. But I don't think this is of much use
day-to-day.

--
error compiling committee.c: too many arguments to function

2010-06-15 08:41:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 12/17] Add helper function get_kernel_ns

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> Add a helper function for the multiple places this is used. Note that it
> must not be called in preemptible context, as that would mean the kernel
> could enter software suspend state, which would cause non-atomic operation
> of the monotonic_to_bootbased computation.
>
> Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
> as well? Currently, they are not bootbased (but perhaps should be).
>
> Signed-off-by: Zachary Amsden<[email protected]>
> ---
> arch/x86/kvm/x86.c | 26 +++++++++++++-------------
> 1 files changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 703ea43..15c7317 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
> __func__, base_khz, scaled_khz, shift, *pmultiplier);
> }
>
> +static inline u64 get_kernel_ns(void)
> +{
> + struct timespec ts;
> +
> + WARN_ON(preemptible());
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + return timespec_to_ns(&ts);
> +}
> +
> static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>
>

Isn't something like this a candidate for the time infrastructure?

--
error compiling committee.c: too many arguments to function

2010-06-15 08:44:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 13/17] Add TSC offset tracking

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> Track the last TSC offset set for each VM and ensure that the storing of
> the offset and the reading of the TSC are never preempted by taking a
> spinlock.
>
>

Totally missing the point - tsc is per vcpu, why do we need per-vm tracking?

--
error compiling committee.c: too many arguments to function

2010-06-15 08:47:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 15/17] Fix AMD C1 TSC desynchronization

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> Some AMD based machines can have TSC drift when in C1 HLT state because
> despite attempting to scale the TSC increment when dividing down the
> P-state, the processor may return to full P-state to service cache
> probes. The TSC of halted CPUs can advance faster than that of running
> CPUs as a result, causing unpredictable TSC drift.
>
> We implement a recommended workaround, which is disabling C1 clock ramping.
> ---
> arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 45 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ef847ee..8e836e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -56,6 +56,11 @@
> #include<asm/i387.h>
> #include<asm/xcr.h>
>
> +#ifdef CONFIG_K8_NB
> +#include<linux/pci.h>
> +#include<asm/k8.h>
> +#endif
> +
> #define MAX_IO_MSRS 256
> #define CR0_RESERVED_BITS \
> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -4287,10 +4292,43 @@ static struct notifier_block kvmclock_cpu_notifier_block = {
> .priority = -INT_MAX
> };
>
> +static u8 disabled_c1_ramp = 0;
> +
> static void kvm_timer_init(void)
> {
> int cpu;
>
> + /*
> + * AMD processors can de-synchronize TSC on halt in C1 state, because
> + * processors in lower P state will have TSC scaled properly during
> + * normal operation, but will have TSC scaled improperly while
> + * servicing cache probes. Because there is no way to determine how
> + * TSC was adjusted during cache probes, there are two solutions:
> + * resynchronize after halt, or disable C1-clock ramping.
> + *
> + * We implemenent solution 2.
> + */
> +#ifdef CONFIG_K8_NB
> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD&&
> + boot_cpu_data.x86 == 0x0f&&
> + !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + struct pci_dev *nb;
> + int i;
> + cache_k8_northbridges();
> + for (i = 0; i< num_k8_northbridges; i++) {
> + u8 byte;
> + nb = k8_northbridges[i];
> + pci_read_config_byte(nb, 0x87,&byte);
> + if (byte& 1) {
> + printk(KERN_INFO "%s: AMD C1 clock ramping detected, performing workaround\n", __func__);
> + disabled_c1_ramp = byte;
> + pci_write_config_byte(nb, 0x87, byte& 0xFC);
> +
> + }
> + }
> + }
> +#endif
> +
> register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
> if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
> @@ -4402,6 +4440,13 @@ void kvm_arch_exit(void)
> unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
> kvm_x86_ops = NULL;
> kvm_mmu_module_exit();
> +#ifdef CONFIG_K8_NB
> + if (disabled_c1_ramp) {
> + struct pci_dev **nb;
> + for (nb = k8_northbridges; *nb; nb++)
> + pci_write_config_byte(*nb, 0x87, disabled_c1_ramp);
> + }
> +#endif
> }
>

Such platform hackery should be in the platform code, not in kvm. kvm
might request to enable it (why not enable it unconditionally? should
we disable it on hardware_disable()?


--
error compiling committee.c: too many arguments to function

2010-06-15 08:51:10

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 16/17] TSC reset compensation

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> Attempt to synchronize TSCs which are reset to the same value. In the
> case of a reliable hardware TSC, we can just re-use the same offset, but
> on non-reliable hardware, we can get closer by adjusting the offset to
> match the elapsed time.
>
>

Answers a question from earlier.

I wonder about guests that might try to be clever an compensate for the
IPI round trip, so not writing the same value. On the other hand,
really clever guests will synchronize though memory, not an IPI.

> Signed-off-by: Zachary Amsden<[email protected]>
> ---
> arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
> 1 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8e836e9..cedb71f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
> set_bit(KVM_REQ_CLOCK_SYNC,&v->requests);
> }
>
> +static inline int kvm_tsc_reliable(void)
> +{
> + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)&&
> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC)&&
> + !check_tsc_unstable());
> +}
> +
> void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
> {
> struct kvm *kvm = vcpu->kvm;
> - u64 offset;
> + u64 offset, ns, elapsed;
>
> spin_lock(&kvm->arch.tsc_write_lock);
> offset = data - native_read_tsc();
> - kvm->arch.last_tsc_nsec = get_kernel_ns();
> + ns = get_kernel_ns();
> + elapsed = ns - kvm->arch.last_tsc_nsec;
> +
> + /*
> + * Special case: identical write to TSC within 5 seconds of
> + * another CPU is interpreted as an attempt to synchronize
> + * (the 5 seconds is to accomodate host load / swapping).
> + *
> + * In that case, for a reliable TSC, we can match TSC offsets,
> + * or make a best guest using kernel_ns value.
> + */
> + if (data == kvm->arch.last_tsc_write&& elapsed< 5 * NSEC_PER_SEC) {
>

5e9 will overflow on i386.

> + if (kvm_tsc_reliable()) {
> + offset = kvm->arch.last_tsc_offset;
> + pr_debug("kvm: matched tsc offset for %llu\n", data);
> + } else {
> + u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
> + tsc_delta = tsc_delta / USEC_PER_SEC;
> + offset -= tsc_delta;
> + pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
> + }
> + ns = kvm->arch.last_tsc_nsec;
> + }
> + kvm->arch.last_tsc_nsec = ns;
>

Shouldn't we check that the older write was on a different vcpu?

--
error compiling committee.c: too many arguments to function

2010-06-15 08:51:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> Basic informational document about x86 timekeeping and how KVM
> is affected.
>

Excellent.

--
error compiling committee.c: too many arguments to function

2010-06-15 09:21:18

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 15/17] Fix AMD C1 TSC desynchronization

On 06/14/2010 10:47 PM, Avi Kivity wrote:
> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>> Some AMD based machines can have TSC drift when in C1 HLT state because
>> despite attempting to scale the TSC increment when dividing down the
>> P-state, the processor may return to full P-state to service cache
>> probes. The TSC of halted CPUs can advance faster than that of running
>> CPUs as a result, causing unpredictable TSC drift.
>>
>> We implement a recommended workaround, which is disabling C1 clock
>> ramping.
>> ---
>> arch/x86/kvm/x86.c | 45
>> +++++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 45 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index ef847ee..8e836e9 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -56,6 +56,11 @@
>> #include<asm/i387.h>
>> #include<asm/xcr.h>
>>
>> +#ifdef CONFIG_K8_NB
>> +#include<linux/pci.h>
>> +#include<asm/k8.h>
>> +#endif
>> +
>> #define MAX_IO_MSRS 256
>> #define CR0_RESERVED_BITS \
>> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM |
>> X86_CR0_TS \
>> @@ -4287,10 +4292,43 @@ static struct notifier_block
>> kvmclock_cpu_notifier_block = {
>> .priority = -INT_MAX
>> };
>>
>> +static u8 disabled_c1_ramp = 0;
>> +
>> static void kvm_timer_init(void)
>> {
>> int cpu;
>>
>> + /*
>> + * AMD processors can de-synchronize TSC on halt in C1 state,
>> because
>> + * processors in lower P state will have TSC scaled properly during
>> + * normal operation, but will have TSC scaled improperly while
>> + * servicing cache probes. Because there is no way to determine
>> how
>> + * TSC was adjusted during cache probes, there are two solutions:
>> + * resynchronize after halt, or disable C1-clock ramping.
>> + *
>> + * We implemenent solution 2.
>> + */
>> +#ifdef CONFIG_K8_NB
>> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD&&
>> + boot_cpu_data.x86 == 0x0f&&
>> + !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
>> + struct pci_dev *nb;
>> + int i;
>> + cache_k8_northbridges();
>> + for (i = 0; i< num_k8_northbridges; i++) {
>> + u8 byte;
>> + nb = k8_northbridges[i];
>> + pci_read_config_byte(nb, 0x87,&byte);
>> + if (byte& 1) {
>> + printk(KERN_INFO "%s: AMD C1 clock ramping detected,
>> performing workaround\n", __func__);
>> + disabled_c1_ramp = byte;
>> + pci_write_config_byte(nb, 0x87, byte& 0xFC);
>> +
>> + }
>> + }
>> + }
>> +#endif
>> +
>> register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
>> if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
>> cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
>> @@ -4402,6 +4440,13 @@ void kvm_arch_exit(void)
>> unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
>> kvm_x86_ops = NULL;
>> kvm_mmu_module_exit();
>> +#ifdef CONFIG_K8_NB
>> + if (disabled_c1_ramp) {
>> + struct pci_dev **nb;
>> + for (nb = k8_northbridges; *nb; nb++)
>> + pci_write_config_byte(*nb, 0x87, disabled_c1_ramp);
>> + }
>> +#endif
>> }
>
> Such platform hackery should be in the platform code, not in kvm. kvm
> might request to enable it (why not enable it unconditionally? should
> we disable it on hardware_disable()?

I actually have some negative effects to report from this patch - when
under stress, my laptop spontaneously shut down. Thermal problems?

I agree it is complete hackery. I do not recommend this patch for
upstream inclusion unless it is proposed also by someone more familiar
with the hardware.

However, it was required for my testing.

2010-06-15 14:46:18

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 15/17] Fix AMD C1 TSC desynchronization

On Tue, Jun 15, 2010 at 05:21:11AM -0400, Zachary Amsden wrote:
> On 06/14/2010 10:47 PM, Avi Kivity wrote:
> > On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> >> + /*
> >> + * AMD processors can de-synchronize TSC on halt in C1 state,
> >> because
> >> + * processors in lower P state will have TSC scaled properly during
> >> + * normal operation, but will have TSC scaled improperly while
> >> + * servicing cache probes. Because there is no way to determine
> >> how
> >> + * TSC was adjusted during cache probes, there are two solutions:
> >> + * resynchronize after halt, or disable C1-clock ramping.
> >> + *
> >> + * We implemenent solution 2.
> >> + */
> >> +#ifdef CONFIG_K8_NB
> >> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD&&
> >> + boot_cpu_data.x86 == 0x0f&&
> >> + !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> >> + struct pci_dev *nb;
> >> + int i;
> >> + cache_k8_northbridges();
> >> + for (i = 0; i< num_k8_northbridges; i++) {
> >> + u8 byte;
> >> + nb = k8_northbridges[i];
> >> + pci_read_config_byte(nb, 0x87,&byte);
> >> + if (byte& 1) {
> >> + printk(KERN_INFO "%s: AMD C1 clock ramping detected,
> >> performing workaround\n", __func__);
> >> + disabled_c1_ramp = byte;
> >> + pci_write_config_byte(nb, 0x87, byte& 0xFC);
> >> +
> >> + }
> >> + }
> >> + }
> >> +#endif

This is dangerous to do as a general thing on all Fam0fh based AMD
systems, especially mobile ones and especially since some distributions
load the kvm modules at boot per default.
The BIOS should set meaningful default values for PMM7 and the
recommended default value for servers already set this bit to zero.

> I agree it is complete hackery. I do not recommend this patch for
> upstream inclusion unless it is proposed also by someone more familiar
> with the hardware.

I will check again with some other people inside AMD, but my current
suggestion would be to not provide kvm-clock to the guest if running on
an AMD Fam0fh processor in an SMP environemnt with PMM7.0 set. This also
sucks in its own way but is safer than just disable clock ramping on all
affected processors.

Joerg

2010-06-15 20:29:06

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

On Mon, 14 Jun 2010 21:34:19 -1000 Zachary Amsden wrote:

> Basic informational document about x86 timekeeping and how KVM
> is affected.

Nice job/information. Thanks.

Just some typos etc. inline below.


> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> Documentation/kvm/timekeeping.txt | 599 +++++++++++++++++++++++++++++++++++++
> 1 files changed, 599 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/kvm/timekeeping.txt
>
> diff --git a/Documentation/kvm/timekeeping.txt b/Documentation/kvm/timekeeping.txt
> new file mode 100644
> index 0000000..4ce8edf
> --- /dev/null
> +++ b/Documentation/kvm/timekeeping.txt
> @@ -0,0 +1,599 @@
> +
> + Timekeeping Virtualization for X86-Based Architectures
> +
> + Zachary Amsden <[email protected]>
> + Copyright (c) 2010, Red Hat. All rights reserved.
> +
> +1) Overview
> +2) Timing Devices
> +3) TSC Hardware
> +4) Virtualization Problems
> +
> +=========================================================================
> +
> +1) Overview
> +
> +One of the most complicated parts of the X86 platform, and specifically,
> +the virtualization of this platform is the plethora of timing devices available
> +and the complexity of emulating those devices. In addition, virtualization of
> +time introduces a new set of challenges because it introduces a multiplexed
> +division of time beyond the control of the guest CPU.
> +
> +First, we will describe the various timekeeping hardware available, then
> +present some of the problems which arise and solutions available, giving
> +specific recommendations for certain classes of KVM guests.
> +
> +The purpose of this document is to collect data and information relevant to
> +time keeping which may be difficult to find elsewhere, specifically,

timekeeping

> +information relevant to KVM and hardware based virtualization.

hardware-based

> +
> +=========================================================================
> +
> +2) Timing Devices
> +
> +First we discuss the basic hardware devices available. TSC and the related
> +KVM clock are special enough to warrant a full exposition and are described in
> +the following section.
> +
> +2.1) i8254 - PIT
> +
> +One of the first timer devices available is the programmable interrupt timer,
> +or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
> +channels which can be programmed to deliver periodic or one-shot interrupts.
> +These three channels can be configured in different modes and have individual
> +counters. Channel 1 and 2 were not available for general use in the original
> +IBM PC, and historically were connected to control RAM refresh and the PC
> +speaker. Now the PIT is typically integrated as part of an emulated chipset
> +and a separate physical PIT is not used.
> +
> +The PIT uses I/O ports 0x40h - 0x43. Access to the 16-bit counters is done

drop the 'h' ^

> +using single or multiple byte access to the I/O ports. There are 6 modes
> +available, but not all modes are available to all timers, as only timer 2
> +has a connected gate input, required for modes 1 and 5. The gate line is
> +controlled by port 61h, bit 0, as illustrated in the following diagram.
> +
> + -------------- ----------------
> +| | | |
> +| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
> +| Clock | | | |
> + -------------- | +->| GATE TIMER 0 |
> + | ----------------
> + |
> + | ----------------
> + | | |
> + |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
> + | | | (aka /dev/null)
> + | +->| GATE TIMER 1 |
> + | ----------------
> + |
> + | ----------------
> + | | |
> + |------>| CLOCK OUT | ---------> Port 61h, bit 5
> + | | |
> +Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.----
> + ---------------- _| )---- Speaker
> + / *----
> +Port 61h, bit 1 -----------------------------------/
> +
> +The timer modes are now described.
> +
> +Mode 0: Single Timeout. This is a one shot software timeout that counts down

one-shot

> + when the gate is high (always true for timers 0 and 1). When the count
> + reaches zero, the output goes high.
> +
> +Mode 1: Triggered One Shot. The output is intially set high. When the gate

One-shot (or One-Shot)

> + line is set high, a countdown is initiated (which does not stop if the gate is
> + lowered), during which the output is set low. When the count reaches zero,
> + the output goes high.
> +
> +Mode 2: Rate Generator. The output is initially set high. When the countdown
> + reaches 1, the output goes low for one count and then returns high. The value
> + is reloaded and the countdown automatically resume. If the gate line goes

resumes.

> + low, the count is halted. If the output is low when the gate is lowered, the
> + output automatically goes high (this only affects timer 2).
> +
> +Mode 3: Square Wave. This generates a sine wave. The count determines the

a sine wave is a square wave???

> + length of the pulse, which alternates between high and low when zero is
> + reached. The count only proceeds when gate is high and is automatically
> + reloaded on reaching zero. The count is decremented twice at each clock.
> + If the count is even, the clock remains high for N/2 counts and low for N/2
> + counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
> + for (N-1)/2 counts. Only even values are latched by the counter, so odd
> + values are not observed when reading.
> +
> +Mode 4: Software Strobe. After programming this mode and loading the counter,
> + the output remains high until the counter reaches zero. Then the output
> + goes low for 1 clock cycle and returns high. The counter is not reloaded.
> + Counting only occurs when gate is high.
> +
> +Mode 5: Hardware Strobe. After programming and loading the counter, the
> + output remains high. When the gate is raised, a countdown is initiated
> + (which does not stop if the gate is lowered). When the counter reaches zero,
> + the output goes low for 1 clock cycle and then returns high. The counter is
> + not reloaded.
> +
> +In addition to normal binary counting, the PIT supports BCD counting. The
> +command port, 0x43h is used to set the counter and mode for each of the three

0x43

> +timers.
> +
> +PIT commands, issued to port 0x43, using the following bit encoding:
> +
> +Bit 7-4: Command (See table below)
> +Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
> +Bit 0 : Binary (0) / BCD (1)
> +
> +Command table:
> +
> +0000 - Latch Timer 0 count for port 0x40
> + sample and hold the count to be read in port 0x40;
> + additional commands ignored until counter is read;
> + mode bits ignored.
> +
> +0001 - Set Timer 0 LSB mode for port 0x40
> + set timer to read LSB only and force MSB to zero;
> + mode bits set timer mode
> +
> +0010 - Set Timer 0 MSB mode for port 0x40
> + set timer to read MSB only and force LSB to zero;
> + mode bits set timer mode
> +
> +0011 - Set Timer 0 16-bit mode for port 0x40
> + set timer to read / write LSB first, then MSB;
> + mode bits set timer mode
> +
> +0100 - Latch Timer 1 count for port 0x41 - as described above
> +0101 - Set Timer 1 LSB mode for port 0x41 - as described above
> +0110 - Set Timer 1 MSB mode for port 0x41 - as described above
> +0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
> +
> +1000 - Latch Timer 2 count for port 0x42 - as described above
> +1001 - Set Timer 2 LSB mode for port 0x42 - as described above
> +1010 - Set Timer 2 MSB mode for port 0x42 - as described above
> +1011 - Set Timer 2 16-bit mode for port 0x42 as described above
> +
> +1101 - General counter latch
> + Latch combination of counters into corresponding ports
> + Bit 3 = Counter 2
> + Bit 2 = Counter 1
> + Bit 1 = Counter 0
> + Bit 0 = Unused
> +
> +1110 - Latch timer status
> + Latch combination of counter mode into corresponding ports
> + Bit 3 = Counter 2
> + Bit 2 = Counter 1
> + Bit 1 = Counter 0
> +
> + The output of ports 0x40-0x42 following this command will be:
> +
> + Bit 7 = Output pin
> + Bit 6 = Count loaded (0 if timer has expired)
> + Bit 5-4 = Read / Write mode
> + 01 = MSB only
> + 10 = LSB only
> + 11 = LSB / MSB (16-bit)
> + Bit 3-1 = Mode
> + Bit 0 = Binary (0) / BCD mode (1)
> +
> +2.2) RTC
> +
> +The second device which was available in the original PC was the MC146818 real
> +time clock. The original device is now obsolete, and usually emulated by the
> +system chipset, sometimes by an HPET and some frankenstein IRQ routing.
> +
> +The RTC is accessed through CMOS variables, which uses an index register to
> +control which bytes are read. Since there is only one index register, read
> +of the CMOS and read of the RTC require lock protection (in addition, it is
> +dangerous to allow userspace utilities such as hwclock to have direct RTC
> +access, as they could corrupt kernel reads and writes of CMOS memory).
> +
> +The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
> +can function as a once a day alarm, a periodic alarm, and can issue interrupts

(once a day is periodic ;)

> +after an update of the CMOS registers by the MC146818 is complete. The type of
> +interrupt is signalled in the RTC status registers.
> +
> +The RTC will update the current time fields by battery power even while the
> +system is off. The current time fields should not be read while an update is
> +in progress, as indicated in the status register.
> +
> +The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
> +programmed to a 32kHz divider if the RTC is to count seconds.
> +
> +This is the RAM map originally used for the RTC/CMOS:
> +
> +Location Size Description
> +------------------------------------------
> +00h byte Current second (BCD)
> +01h byte Seconds alarm (BCD)
> +02h byte Current minute (BCD)
> +03h byte Minutes alarm (BCD)
> +04h byte Current hour (BCD)
> +05h byte Hours alarm (BCD)
> +06h byte Current day of week (BCD)
> +07h byte Current day of month (BCD)
> +08h byte Current month (BCD)
> +09h byte Current year (BCD)
> +0Ah byte Register A
> + bit 7 = Update in progress
> + bit 6-4 = Divider for clock
> + 000 = 4.194 MHz
> + 001 = 1.049 MHz
> + 010 = 32 kHz
> + 10X = test modes
> + 110 = reset / disable
> + 111 = reset / disable
> + bit 3-0 = Rate selection for periodic interrupt
> + 000 = periodic timer disabled
> + 001 = 3.90625 uS
> + 010 = 7.8125 uS
> + 011 = .122070 mS
> + 100 = .244141 mS
> + ...
> + 1101 = 125 mS
> + 1110 = 250 mS
> + 1111 = 500 mS
> +0Bh byte Register B
> + bit 7 = Run (0) / Halt (1)
> + bit 6 = Periodic interrupt enable
> + bit 5 = Alarm interrupt enable
> + bit 4 = Update-ended interrupt enable
> + bit 3 = Square wave interrupt enable
> + bit 2 = BCD calendar (0) / Binary (1)
> + bit 1 = 12-hour mode (0) / 24-hour mode (1)
> + bit 0 = 0 (DST off) / 1 (DST enabled)
> +OCh byte Register C (read only)
> + bit 7 = interrupt request flag (IRQF)
> + bit 6 = periodic interrupt flag (PF)
> + bit 5 = alarm interrupt flag (AF)
> + bit 4 = update interrupt flag (UF)
> + bit 3-0 = reserved
> +ODh byte Register D (read only)
> + bit 7 = RTC has power
> + bit 6-0 = reserved
> +32h byte Current century BCD (*)
> + (*) location vendor specific and now determined from ACPI global tables
> +
> +2.3) APIC
> +
> +On Pentium and later processors, an on-board timer is available to each CPU
> +as part of the Advanced Programmable Interrupt Controller. The APIC is
> +accessed through memory mapped registers and provides interrupt service to each

memory-mapped

> +CPU, used for IPIs and local timer interrupts.
> +
> +Although in theory the APIC is a safe and stable source for local interrupts,
> +in practice, many bugs and glitches have occurred due to the special nature of
> +the APIC CPU-local memory mapped hardware. Beware that CPU errata may affect

memory-mapped

> +the use of the APIC and that workarounds may be required. In addition, some of
> +these workarounds pose unique constraints for virtualization - requiring either
> +extra overhead incurred from extra reads of memory mapped I/O or additional

ditto

> +functionality that may be more computationally expensive to implement.
> +
> +Since the APIC is documented quite well in the Intel and AMD manuals, we will
> +avoid repititon of the detail here. It should be pointed out that the APIC
> +timer is programmed through the LVT (local vector timer) register, is capable
> +of one-shot or periodic operation, and is based on the bus clock divided down
> +by the programmable divider register.
> +
> +2.4) HPET
> +
> +HPET is quite complex, and was originally intended to replace the PIT / RTC
> +support of the X86 PC. It remains to be seen whether that will be the case, as
> +the de-facto standard of PC hardware is to emulate these older devices. Some

de facto

> +systems designated as legacy free may support only the HPET as a hardware timer
> +device.
> +
> +The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
> +but allowing implementation freedom to support many more. It also imposes no
> +fixed rate on the timer frequency, but does impose some extremal values on
> +frequency, error and slew.
> +
> +In general, the HPET is recommended as a high precision (compared to PIT /RTC)
> +time source which is independent of local variation (as there is only one HPET
> +in any given system). The HPET is also memory mapped, and its presence is

memory-mapped,

> +indicated through ACPI table by the BIOS.

"an ACPI table" or "ACPI tables"

> +
> +Detailed specification of the HPET is beyond the current scope of this
> +document, as it is also very well documented elsewhere.
> +
> +2.5) Offboard Timers
> +
> +Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
> +timing chips built into the cards which may have registers which are accessible
> +to kernel or user drivers. To the author's knowledge, using these to generate
> +a clocksource for a Linux or other kernel has not yet been attempted and is in
> +general frowned upon as not playing by the agreed rules of the game. Such a
> +timer device would require additional support to be virtualized properly and is
> +not considered important at this time as no known operating system does this.
> +
> +=========================================================================
> +
> +3) TSC Hardware
> +
> +The TSC or time stamp counter is relatively simple in theory; it counts
> +instruction cycles issued by the processor, which can be used as a measure of
> +time. In practice, due to a number of problems, it is the most complicated
> +time keeping device to use.

timekeeping

> +
> +The TSC is represented internally as a 64-bit MSR which can be read with the
> +RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
> +limitations made it possible to write the TSC, but generally on old hardware it
> +was only possible to write the low 32-bits of the 64-bit counter, and the upper
> +32-bits of the counter were cleared. Now, however, on Intel processors family
> +0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
> +has been lifted and all 64-bits are writable. On AMD systems, the ability to
> +write the TSC MSR is not an architectural guarantee.
> +
> +The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
> +means of the CR4.TSD bit, which disables CPL > 0 TSC access.
> +
> +Some vendors have implemented an additional instruction, RDTSCP, which returns
> +atomically not just the TSC, but an indicator which corresponds to the
> +processor number. This can be used to index into an array of TSC variables to
> +determine offset information in SMP systems where TSCs are not synchronized.
> +
> +Both VMX and SVM provide extension fields in the virtualization hardware which
> +allows the guest visible TSC to be offset by a constant. Newer implementations
> +promise to allow the TSC to additionally be scaled, but this hardware is not
> +yet widely available.
> +
> +3.1) TSC synchronization
> +
> +The TSC is a CPU-local clock in most implementations. This means, on SMP
> +platforms, the TSCs of different CPUs may start at different times depending
> +on when the CPUs are powered on. Generally, CPUs on the same die will share
> +the same clock, however, this is not always the case.
> +
> +The BIOS may attempt to resynchronize the TSCs as a result during the poweron

"as a result" of what? That phrase seems superfluous.

> +process and the operating system or other system software may attempt to do
> +this as well. Several hardware limitations make the problem worse - if it is
> +not possible to write the full 32-bits of the TSC, it may be impossible to
> +match the TSC in newly arriving CPUs to that of the rest of the system,
> +resulting in unsynchronized TSCs. This may be done by BIOS or system software,
> +but in practice, getting a perfectly synchronized TSC will not be possible
> +unless all values are read from the same clock, which generally only is
> +possible on single socket systems or those with special hardware support.
> +
> +3.2) TSC and CPU hotplug
> +
> +As touched on already, CPUs which arrive later than the boot time of the system
> +may not have a TSC value that is synchronized with the rest of the system.
> +Either system software, BIOS, or SMM code may actually try to establish the TSC
> +to a value matching the rest of the system, but a perfect match is usually not
> +a guarantee.
> +
> +3.3) TSC and multi-socket / NUMA
> +
> +Multi-socket systems, especially large multi-socket systems are likely to have
> +individual clocksources rather than a single, universally distributed clock.
> +Since these clocks are driven by different crystals, they will not have
> +perfectly matched frequency, and temperature and electrical variations will
> +cause the cpu clocks, and thus the TSCs to drift over time. Depending on the

CPU

> +exact clock and bus design, the drift may or may not be fixed in absolute
> +error, and may accumulate over time.
> +
> +In addition, very large systems may deliberately slew the clocks of individual
> +cores. This technique, known as spread-spectrum clocking, reduces EMI at the
> +clock frequency and harmonics of it, which may be required to pass FCC
> +standards for telecommunications and computer equipment.
> +
> +It is recommended not to trust the TSCs to remain synchronized on NUMA or
> +multiple socket systems for these reasons.
> +
> +3.4) TSC and C-states
> +
> +C-states, or idling states of the processor, especially C1E and deeper sleep
> +states may be problematic for TSC as well. The TSC may stop advancing in such
> +a state, resulting in a TSC which is behind that of other CPUs when execution
> +is resumed. Such CPUs must be detected and flagged by the operating system
> +based on CPU and chipset identifications.
> +
> +The TSC in such a case may be corrected by catching it up to a known external
> +clocksource.
> +
> +3.5) TSC frequency change / P-states
> +
> +To make things slightly more interesting, some CPUs may change requency. They

frequency.

> +may or may not run the TSC at the same rate, and because the frequency change
> +may be staggered or slewed, at some points in time, the TSC rate may not be
> +known other than falling within a range of values. In this case, the TSC will
> +not be a stable time source, and must be calibrated against a known, stable,
> +external clock to be a usable source of time.
> +
> +Whether the TSC runs at a constant rate or scales with the P-state is model
> +dependent and must be determined by inspecting CPUID, chipset or various MSR
> +fields.
> +
> +In addition, some vendors have known bugs where the P-state is actually
> +compensated for properly during normal operation, but when the processor is
> +inactive, the P-state may be raised temporarily to service cache misses from
> +other processors. In such cases, the TSC on halted CPUs could advance faster
> +than that of non-halted processors. AMD Turion processors are known to have
> +this problem.
> +
> +3.6) TSC and STPCLK / T-states
> +
> +External signals given to the processor may also have the affect of stopping
> +the TSC. This is typically done for thermal emergency power control to prevent
> +an overheating condition, and typically, there is no way to detect that this
> +condition has happened.
> +
> +3.7) TSC virtualization - VMX
> +
> +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
> +instructions, which is enough for full virtualization of TSC in any manner. In
> +addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
> +field specified in the VMCS. Special instructions must be used to read and
> +write the VMCS field.
> +
> +3.8) TSC virtualization - SVM
> +
> +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP

SVM

> +instructions, which is enough for full virtualization of TSC in any manner. In
> +addition, SVM allows passing through the host TSC plus an additional offset
> +ield specified in SVM control block.
> +
> +3.9) TSC feature bits in Linux
> +
> +In summary, there is no way to guarantee the TSC remains in perfect
> +synchronization unless it is explicitly guaranteed by the architecture. Even
> +if so, the TSCs in multi-sockets or NUMA systems may still run independently
> +despite being locally consistent.
> +
> +The following feature bits are used by Linux to signal various TSC attributes,
> +but they can only be taken to be meaningful for UP or single node systems.
> +
> +X86_FEATURE_TSC : The TSC is available in hardware
> +X86_FEATURE_RDTSCP : The RDTSCP instruction is available
> +X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
> +X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
> +X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
> +
> +4) Virtualization Problems
> +
> +Timekeeping is especially problematic for virtualization because a number of
> +challenges arise. The most obvious problem is that time is now shared between
> +the host and, potentially, a number of virtual machines. This happens
> +naturally on X86 systems when SMM mode is used by the BIOS, but not to such a
> +degree nor with such frequency. However, the fact that SMM mode may cause
> +similar problems to virtualization makes it a good justification for solving
> +many of these problems on bare metal.
> +
> +4.1) Interrupt clocking
> +
> +One of the most immediate problems that occurs with legacy operating systems
> +is that the system timekeeping routines are often designed to keep track of
> +time by counting periodic interrupts. These interrupts may come from the PIT
> +or the RTC, but the problem is the same: the host virtualization engine may not
> +be able to deliver the proper number of interrupts per second, and so guest
> +time may fall behind. This is especially problematic if a high interrupt rate
> +is selected, such as 1000 HZ, which is unfortunately the default for many Linux
> +guests.
> +
> +There are three approaches to solving this problem; first, it may be possible
> +to simply ignore it. Guests which have a separate time source for tracking
> +'wall clock' or 'real time' may not need any adjustment of their interrupts to
> +maintain proper time. If this is not sufficient, it may be necessary to inject
> +additional interrupts into the guest in order to increase the effective
> +interrupt rate. This approach leads to complications in extreme conditions,
> +where host load or guest lag is too much to compensate for, and thus another
> +solution to the problem has risen: the guest may need to become aware of lost
> +ticks and compensate for them internally. Although promising in theory, the
> +implementation of this policy in Linux has been extremely error prone, and a
> +number of buggy variants of lost tick compensation are distributed across
> +commonly used Linux systems.
> +
> +Windows uses periodic RTC clocking as a means of keeping time internally, and
> +thus requires interrupt slewing to keep proper time. It does use a low enough
> +rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
> +practice.
> +
> +4.2) TSC sampling and serialization
> +
> +As the highest precision time source available, the cycle counter of the CPU
> +has aroused much interest from developers. As explained above, this timer has
> +many problems unique to its nature as a local, potentially unstable and
> +potentially unsynchronized source. One issue which is not unique to the TSC,
> +but is highlighted because of it's very precise nature is sampling delay. By

its

> +definition, the counter, once read is already old. However, it is also
> +possible for the counter to be read ahead of the actual use of the result.
> +This is a consequence of the superscalar execution of the instruction stream,
> +which may execute instructions out of order. Such execution is called
> +non-serialized. Forcing serialized execution is necessary for precise
> +measurement with the TSC, and requires a serializing instruction, such as CPUID
> +or an MSR read.
> +
> +Since CPUID may actually be virtualized by a trap and emulate mechanism, this
> +serialization can pose a performance issue for hardware virtualization. An
> +accurate time stamp counter reading may therefore not always be available, and
> +it may be necessary for an implementation to guard against "backwards" reads of
> +the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
> +system.
> +
> +4.3) Timespec aliasing
> +
> +Additionally, this lack of serialization from the TSC poses another challenge
> +when using results of the TSC when measured against another time source. As
> +the TSC is much higher precision, many possible values of the TSC may be read
> +while another clock is still expressing the same value.
> +
> +That is, you may read (T,T+10) while external clock C maintains the same value.
> +Due to non-serialized reads, you may actually end up with a range which
> +fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
> +calibrated against an external value may have a range of valid values.
> +Re-calibrating this computation may actually cause time, as computed after the
> +calibration, to go backwards, compared with time computed before the
> +calibration.
> +
> +This problem is particularly pronounced with an internal time source in Linux,
> +the kernel time, which is expressed in the theoretically high resultion

resolution

> +timespec - but which advances in much larger granularity intervals, sometimes
> +at the rate of jiffies, and possibly in catchup modes, at a much larger step.
> +
> +This aliasing requires care in the computation and recalibration of kvmclock
> +and any other values derived from TSC computation (such as TSC virtualization
> +itself).
> +
> +4.4) Migration
> +
> +Migration of a virtual machine raises problems for timekeeping in two ways.
> +First, the migration itself may take time, during which interrupts can not be

cannot

> +delivered, and after which, the guest time may need to be caught up. NTP may
> +be able to help to some degree here, as the clock correction required is
> +typically small enough to fall in the NTP-correctable window.
> +
> +An additional concern is that timers based off the TSC (or HPET, if the raw bus
> +clock is exposed) may now be running at different rates, requiring compensation
> +in some may in the hypervisor by virtualizing these timers. In addition,
> +migrating to a faster machine may preclude the use of a passthrough TSC, as a
> +faster clock can not be made visible to a guest without the potential of time

cannot

> +advancing faster than usual. A slower clock is less of a problem, as it can
> +always be caught up to the original rate. KVM clock avoids these problems by
> +simply storing multipliers and offsets gainst the TSC for the guest to convert
> +back into nanosecond resolution values.
> +
> +4.5) Scheduling
> +
> +Since scheduling may be based on precise timing and firing of interrupts, the
> +scheduling algorithms of an operating system may be adversely affected by
> +virtualization. In theory, the effect is random and should be universally
> +distributed, but in contrived as well as real scenarios (guest device access,
> +causes virtualization exits, possible context switch), this may not always be
> +the case. The effect of this has not been well studied (ed: has it? any
> +published results?).
> +
> +In an attempt to workaround this, several implementations have provided a

work around this,

> +paravirtualized scheduler clock, which reveals the true amount of CPU time for
> +which a virtual machine has been running.
> +
> +4.6) Watchdogs
> +
> +Watchdog timers, such as the lock detector in Linux may fire accidentally when
> +running under hardware virtualization due to timer interrupts being delayed or
> +misinterpretation of the passage of real time. Usually, these warnings are
> +spurious and can be ignored, but in some circumstances it may be necessary to
> +disable such detection.
> +
> +4.7) Delays and precision timing
> +
> +Precise timing and delays may not be possible in a virtualized system. This
> +can happen if the system is controlling physical hardware, or issues delays to
> +compensate for slower I/O to and from devices. The first issue is not solvable
> +in general for a virtualized system; hardware control software can't be
> +adequately virtualized without a full real-time operating system, which would
> +require an RT aware virtualization platform.
> +
> +The second issue may cause performance problems, but this is unlikely to be a
> +significant issue. In many cases these delays may be eliminated through
> +configuration or paravirtualization.
> +
> +4.8) Covert channels and leaks
> +
> +In addition to the above problems, time information will inevitably leak to the
> +guest about the host in anything but a perfect implementation of virtualized
> +time. This may allow the guest to infer the presence of a hypervisor (as in a
> +red-pill type detection), and it may allow information to leak between guests
> +by using CPU utilization itself as a signalling channel. Preventing such
> +problems would require completely isolated virtual time which may not track
> +real time any longer. This may be useful in certain security or QA contexts,
> +but in general isn't recommended for real-world deployment scenarios.
> +
> --


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2010-06-15 20:32:43

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 16/17] TSC reset compensation

On 06/14/2010 10:51 PM, Avi Kivity wrote:
> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>> Attempt to synchronize TSCs which are reset to the same value. In the
>> case of a reliable hardware TSC, we can just re-use the same offset, but
>> on non-reliable hardware, we can get closer by adjusting the offset to
>> match the elapsed time.
>>
>
> Answers a question from earlier.
>
> I wonder about guests that might try to be clever an compensate for
> the IPI round trip, so not writing the same value. On the other hand,
> really clever guests will synchronize though memory, not an IPI.

Really, really clever guests will use an NMI so as to block further NMIs
and then synchronize through memory from the NMI handler. Or an SMI, if
they have enough understanding of the chipset to make it work.

>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
>> 1 files changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8e836e9..cedb71f 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -937,14 +937,44 @@ static inline void
>> kvm_request_guest_time_update(struct kvm_vcpu *v)
>> set_bit(KVM_REQ_CLOCK_SYNC,&v->requests);
>> }
>>
>> +static inline int kvm_tsc_reliable(void)
>> +{
>> + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)&&
>> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC)&&
>> + !check_tsc_unstable());
>> +}
>> +
>> void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
>> {
>> struct kvm *kvm = vcpu->kvm;
>> - u64 offset;
>> + u64 offset, ns, elapsed;
>>
>> spin_lock(&kvm->arch.tsc_write_lock);
>> offset = data - native_read_tsc();
>> - kvm->arch.last_tsc_nsec = get_kernel_ns();
>> + ns = get_kernel_ns();
>> + elapsed = ns - kvm->arch.last_tsc_nsec;
>> +
>> + /*
>> + * Special case: identical write to TSC within 5 seconds of
>> + * another CPU is interpreted as an attempt to synchronize
>> + * (the 5 seconds is to accomodate host load / swapping).
>> + *
>> + * In that case, for a reliable TSC, we can match TSC offsets,
>> + * or make a best guest using kernel_ns value.
>> + */
>> + if (data == kvm->arch.last_tsc_write&& elapsed< 5 *
>> NSEC_PER_SEC) {
>
> 5e9 will overflow on i386.

Better make it 4 ;)

>
>> + if (kvm_tsc_reliable()) {
>> + offset = kvm->arch.last_tsc_offset;
>> + pr_debug("kvm: matched tsc offset for %llu\n", data);
>> + } else {
>> + u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
>> + tsc_delta = tsc_delta / USEC_PER_SEC;
>> + offset -= tsc_delta;
>> + pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
>> + }
>> + ns = kvm->arch.last_tsc_nsec;
>> + }
>> + kvm->arch.last_tsc_nsec = ns;
>
> Shouldn't we check that the older write was on a different vcpu?

I thought about it; the pattern isn't necessarily the same with every OS
(or even every qemu-kvm binary), but at least on AMD hardware, we see 6
writes of the TSC for a 2-VCPU VM. Two are done from ioctls for each
VCPU at creation and setup time. The other two are triggered by the
INIT and STARTUP IPI signals for each AP from the Linux bootup code.

So if we want to keep the APs in sync with the BSP, they must tolerate
four resets in a row of the TSC on a single CPU. We could eliminate the
second setup_vcpu reset with refactoring and choose not to reset the TSC
for SIPI signals, but for now this seems the simplest and best approach.

2010-06-15 20:37:07

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On 06/14/2010 10:40 PM, Avi Kivity wrote:
> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>> Kernel time, which advances in discrete steps may progress much slower
>> than TSC. As a result, when kvmclock is adjusted to a new base, the
>> apparent time to the guest, which runs at a much higher, nsec scaled
>> rate based on the current TSC, may have already been observed to have
>> a larger value (kernel_ns + scaled tsc) than the value to which we are
>> setting it (kernel_ns + 0).
>>
>> We must instead compute the clock as potentially observed by the guest
>> for kernel_ns to make sure it does not go backwards.
>>
>> @@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
>> u32 hypercalls;
>> u32 irq_injections;
>> u32 nmi_injections;
>> + u32 tsc_overshoot;
>> + u32 tsc_ahead;
>> };
>
> Please don't add new stats, instead add tracepoints which can also be
> observed as stats.
>
> But does this really merit exposing? What would one do with this
> information?
>
>> struct kvm_vcpu_arch *vcpu =&v->arch;
>> void *shared_kaddr;
>> unsigned long this_tsc_khz;
>> + s64 kernel_ns, max_kernel_ns;
>> + u64 tsc_timestamp;
>>
>> if ((!vcpu->time_page))
>> return 0;
>>
>> - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
>> - put_cpu_var(cpu_tsc_khz);
>> + /*
>> + * The protection we require is simple: we must not be preempted
>> from
>> + * the CPU between our read of the TSC khz and our read of the TSC.
>> + * Interrupt protection is not strictly required, but it does
>> result in
>> + * greater accuracy for the TSC / kernel_ns measurement.
>> + */
>> + local_irq_save(flags);
>> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>> + kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
>
> That's a slow path, since it has to go through kvm_get_msr()'s if
> tree. Could use its own accessor.
>
> But this isn't introduced by this patch, so it can be fixed by another.
>
>> + ktime_get_ts(&ts);
>> + monotonic_to_bootbased(&ts);
>> + kernel_ns = timespec_to_ns(&ts);
>> + local_irq_restore(flags);
>> +
>> if (unlikely(this_tsc_khz == 0)) {
>> kvm_request_guest_time_update(v);
>> return 1;
>> }
>>
>> + /*
>> + * Time as measured by the TSC may go backwards when resetting
>> the base
>> + * tsc_timestamp. The reason for this is that the TSC
>> resolution is
>> + * higher than the resolution of the other clock scales. Thus,
>> many
>> + * possible measurments of the TSC correspond to one measurement
>> of any
>> + * other clock, and so a spread of values is possible. This is
>> not a
>> + * problem for the computation of the nanosecond clock; with TSC
>> rates
>> + * around 1GHZ, there can only be a few cycles which correspond
>> to one
>> + * nanosecond value, and any path through this code will inevitably
>> + * take longer than that. However, with the kernel_ns value
>> itself,
>> + * the precision may be much lower, down to HZ granularity. If the
>> + * first sampling of TSC against kernel_ns ends in the low part
>> of the
>> + * range, and the second in the high end of the range, we can get:
>> + *
>> + * (TSC - offset_low) * S + kns_old> (TSC - offset_high) * S +
>> kns_new
>> + *
>> + * As the sampling errors potentially range in the thousands of
>> cycles,
>> + * it is possible such a time value has already been observed by
>> the
>> + * guest. To protect against this, we must compute the system
>> time as
>> + * observed by the guest and ensure the new system time is greater.
>> + */
>> + max_kernel_ns = 0;
>> + if (vcpu->hv_clock.tsc_timestamp) {
>> + max_kernel_ns = vcpu->last_guest_tsc -
>> + vcpu->hv_clock.tsc_timestamp;
>> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
>> + vcpu->hv_clock.tsc_to_system_mul,
>> + vcpu->hv_clock.tsc_shift);
>> + max_kernel_ns += vcpu->last_kernel_ns;
>> + }
>> +
>> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>> - kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
>> + &vcpu->hv_clock.tsc_shift,
>> + &vcpu->hv_clock.tsc_to_system_mul);
>> vcpu->hw_tsc_khz = this_tsc_khz;
>> }
>>
>> - /* Keep irq disabled to prevent changes to the clock */
>> - local_irq_save(flags);
>> - kvm_get_msr(v, MSR_IA32_TSC,&vcpu->hv_clock.tsc_timestamp);
>> - ktime_get_ts(&ts);
>> - monotonic_to_bootbased(&ts);
>> - local_irq_restore(flags);
>> + if (max_kernel_ns> kernel_ns) {
>> + s64 overshoot = max_kernel_ns - kernel_ns;
>> + ++v->stat.tsc_ahead;
>> + if (overshoot> NSEC_PER_SEC / HZ) {
>> + ++v->stat.tsc_overshoot;
>> + if (printk_ratelimit())
>> + pr_debug("ns overshoot: %lld\n", overshoot);
>> + }
>
> A tracepoint here would allow recording both the number of overshoots
> and the value of the overshoot. But I don't think this is of much use
> day-to-day.

FWIW, I was using this to track how often this case would hit and by how
much. Originally, tsc_ahead was firing near 100% and tsc_overshoot near
0%, but moving the observation of last_guest_tsc into the exit path
decreased both number to near zero. Obviously it's a bit hardware
dependent, as it matters how high resolution the kernel clocksource is
(and how recent your kernel).

I'll rip the stats stuff for sure.

2010-06-15 21:03:47

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 12/17] Add helper function get_kernel_ns

On 06/14/2010 10:41 PM, Avi Kivity wrote:
> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>> Add a helper function for the multiple places this is used. Note
>> that it
>> must not be called in preemptible context, as that would mean the kernel
>> could enter software suspend state, which would cause non-atomic
>> operation
>> of the monotonic_to_bootbased computation.
>>
>> Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
>> as well? Currently, they are not bootbased (but perhaps should be).
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 26 +++++++++++++-------------
>> 1 files changed, 13 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 703ea43..15c7317 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t
>> scaled_khz, uint32_t base_khz,
>> __func__, base_khz, scaled_khz, shift, *pmultiplier);
>> }
>>
>> +static inline u64 get_kernel_ns(void)
>> +{
>> + struct timespec ts;
>> +
>> + WARN_ON(preemptible());
>> + ktime_get_ts(&ts);
>> + monotonic_to_bootbased(&ts);
>> + return timespec_to_ns(&ts);
>> +}
>> +
>> static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>>
>
> Isn't something like this a candidate for the time infrastructure?
>

Should it be? It certainly seems reasonable.


Also, should we be using it for the cases KVM_GET_CLOCK /
KVM_SET_CLOCK? It seems global kvmclock_offset for the VM ignores the
bootbased conversion.

Zach

2010-06-15 21:13:28

by john stultz

[permalink] [raw]
Subject: Re: [PATCH 12/17] Add helper function get_kernel_ns

On Tue, 2010-06-15 at 11:03 -1000, Zachary Amsden wrote:
> On 06/14/2010 10:41 PM, Avi Kivity wrote:
> > On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> >> Add a helper function for the multiple places this is used. Note
> >> that it
> >> must not be called in preemptible context, as that would mean the kernel
> >> could enter software suspend state, which would cause non-atomic
> >> operation
> >> of the monotonic_to_bootbased computation.
> >>
> >> Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
> >> as well? Currently, they are not bootbased (but perhaps should be).
> >>
> >> Signed-off-by: Zachary Amsden<[email protected]>
> >> ---
> >> arch/x86/kvm/x86.c | 26 +++++++++++++-------------
> >> 1 files changed, 13 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 703ea43..15c7317 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t
> >> scaled_khz, uint32_t base_khz,
> >> __func__, base_khz, scaled_khz, shift, *pmultiplier);
> >> }
> >>
> >> +static inline u64 get_kernel_ns(void)
> >> +{
> >> + struct timespec ts;
> >> +
> >> + WARN_ON(preemptible());
> >> + ktime_get_ts(&ts);
> >> + monotonic_to_bootbased(&ts);
> >> + return timespec_to_ns(&ts);
> >> +}
> >> +
> >> static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
> >>
> >
> > Isn't something like this a candidate for the time infrastructure?
> >
>
> Should it be? It certainly seems reasonable.

Yea, probably should move to timekeeping.c or time.h.

You might also want a more descriptive name, since get_kernel_ns()
doesn't really express that this is the bootbased monotonic time.

The similar sounding current_kernel_time() returns a coarse tick
granular CLOCK_REALTIME, so it could lead to confusion.

thanks
-john

2010-06-16 00:11:33

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On Mon, Jun 14, 2010 at 09:34:13PM -1000, Zachary Amsden wrote:
> Kernel time, which advances in discrete steps may progress much slower
> than TSC. As a result, when kvmclock is adjusted to a new base, the
> apparent time to the guest, which runs at a much higher, nsec scaled
> rate based on the current TSC, may have already been observed to have
> a larger value (kernel_ns + scaled tsc) than the value to which we are
> setting it (kernel_ns + 0).
>
> We must instead compute the clock as potentially observed by the guest
> for kernel_ns to make sure it does not go backwards.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++
> arch/x86/kvm/x86.c | 79 +++++++++++++++++++++++++++++++++------
> 2 files changed, 71 insertions(+), 12 deletions(-)
>

> + /*
> + * The protection we require is simple: we must not be preempted from
> + * the CPU between our read of the TSC khz and our read of the TSC.
> + * Interrupt protection is not strictly required, but it does result in
> + * greater accuracy for the TSC / kernel_ns measurement.
> + */
> + local_irq_save(flags);
> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
> + kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + kernel_ns = timespec_to_ns(&ts);
> + local_irq_restore(flags);
> +
> if (unlikely(this_tsc_khz == 0)) {
> kvm_request_guest_time_update(v);
> return 1;
> }
>
> + /*
> + * Time as measured by the TSC may go backwards when resetting the base
> + * tsc_timestamp. The reason for this is that the TSC resolution is
> + * higher than the resolution of the other clock scales. Thus, many
> + * possible measurments of the TSC correspond to one measurement of any
> + * other clock, and so a spread of values is possible. This is not a
> + * problem for the computation of the nanosecond clock; with TSC rates
> + * around 1GHZ, there can only be a few cycles which correspond to one
> + * nanosecond value, and any path through this code will inevitably
> + * take longer than that. However, with the kernel_ns value itself,
> + * the precision may be much lower, down to HZ granularity. If the
> + * first sampling of TSC against kernel_ns ends in the low part of the
> + * range, and the second in the high end of the range, we can get:
> + *
> + * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> + *
> + * As the sampling errors potentially range in the thousands of cycles,
> + * it is possible such a time value has already been observed by the
> + * guest. To protect against this, we must compute the system time as
> + * observed by the guest and ensure the new system time is greater.
> + */
> + max_kernel_ns = 0;
> + if (vcpu->hv_clock.tsc_timestamp) {
> + max_kernel_ns = vcpu->last_guest_tsc -
> + vcpu->hv_clock.tsc_timestamp;
> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> + vcpu->hv_clock.tsc_to_system_mul,
> + vcpu->hv_clock.tsc_shift);
> + max_kernel_ns += vcpu->last_kernel_ns;
> + }
> +
> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
> - kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
> + &vcpu->hv_clock.tsc_shift,
> + &vcpu->hv_clock.tsc_to_system_mul);
> vcpu->hw_tsc_khz = this_tsc_khz;
> }
>
> - /* Keep irq disabled to prevent changes to the clock */
> - local_irq_save(flags);
> - kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
> - ktime_get_ts(&ts);
> - monotonic_to_bootbased(&ts);
> - local_irq_restore(flags);
> + if (max_kernel_ns > kernel_ns) {
> + s64 overshoot = max_kernel_ns - kernel_ns;
> + ++v->stat.tsc_ahead;
> + if (overshoot > NSEC_PER_SEC / HZ) {
> + ++v->stat.tsc_overshoot;
> + if (printk_ratelimit())
> + pr_debug("ns overshoot: %lld\n", overshoot);
> + }
> + kernel_ns = max_kernel_ns;
> + }
>
> /* With all the info we got, fill in the values */
> -
> - vcpu->hv_clock.system_time = ts.tv_nsec +
> - (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
> + vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
> + vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
> + vcpu->last_kernel_ns = kernel_ns;
>
> vcpu->hv_clock.flags = 0;
>
> @@ -4836,6 +4889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> if (hw_breakpoint_active())
> hw_breakpoint_restore();
>
> + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +
> atomic_set(&vcpu->guest_mode, 0);
> smp_wmb();
> local_irq_enable();

Is this still needed with the guest side global counter fix?

2010-06-16 00:21:22

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On 06/15/2010 01:47 PM, Marcelo Tosatti wrote:
> On Mon, Jun 14, 2010 at 09:34:13PM -1000, Zachary Amsden wrote:
>
>> Kernel time, which advances in discrete steps may progress much slower
>> than TSC. As a result, when kvmclock is adjusted to a new base, the
>> apparent time to the guest, which runs at a much higher, nsec scaled
>> rate based on the current TSC, may have already been observed to have
>> a larger value (kernel_ns + scaled tsc) than the value to which we are
>> setting it (kernel_ns + 0).
>>
>> We must instead compute the clock as potentially observed by the guest
>> for kernel_ns to make sure it does not go backwards.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 4 ++
>> arch/x86/kvm/x86.c | 79 +++++++++++++++++++++++++++++++++------
>> 2 files changed, 71 insertions(+), 12 deletions(-)
>>
>>
>
>> + /*
>> + * The protection we require is simple: we must not be preempted from
>> + * the CPU between our read of the TSC khz and our read of the TSC.
>> + * Interrupt protection is not strictly required, but it does result in
>> + * greater accuracy for the TSC / kernel_ns measurement.
>> + */
>> + local_irq_save(flags);
>> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>> + kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
>> + ktime_get_ts(&ts);
>> + monotonic_to_bootbased(&ts);
>> + kernel_ns = timespec_to_ns(&ts);
>> + local_irq_restore(flags);
>> +
>> if (unlikely(this_tsc_khz == 0)) {
>> kvm_request_guest_time_update(v);
>> return 1;
>> }
>>
>> + /*
>> + * Time as measured by the TSC may go backwards when resetting the base
>> + * tsc_timestamp. The reason for this is that the TSC resolution is
>> + * higher than the resolution of the other clock scales. Thus, many
>> + * possible measurments of the TSC correspond to one measurement of any
>> + * other clock, and so a spread of values is possible. This is not a
>> + * problem for the computation of the nanosecond clock; with TSC rates
>> + * around 1GHZ, there can only be a few cycles which correspond to one
>> + * nanosecond value, and any path through this code will inevitably
>> + * take longer than that. However, with the kernel_ns value itself,
>> + * the precision may be much lower, down to HZ granularity. If the
>> + * first sampling of TSC against kernel_ns ends in the low part of the
>> + * range, and the second in the high end of the range, we can get:
>> + *
>> + * (TSC - offset_low) * S + kns_old> (TSC - offset_high) * S + kns_new
>> + *
>> + * As the sampling errors potentially range in the thousands of cycles,
>> + * it is possible such a time value has already been observed by the
>> + * guest. To protect against this, we must compute the system time as
>> + * observed by the guest and ensure the new system time is greater.
>> + */
>> + max_kernel_ns = 0;
>> + if (vcpu->hv_clock.tsc_timestamp) {
>> + max_kernel_ns = vcpu->last_guest_tsc -
>> + vcpu->hv_clock.tsc_timestamp;
>> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
>> + vcpu->hv_clock.tsc_to_system_mul,
>> + vcpu->hv_clock.tsc_shift);
>> + max_kernel_ns += vcpu->last_kernel_ns;
>> + }
>> +
>> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>> - kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
>> + &vcpu->hv_clock.tsc_shift,
>> + &vcpu->hv_clock.tsc_to_system_mul);
>> vcpu->hw_tsc_khz = this_tsc_khz;
>> }
>>
>> - /* Keep irq disabled to prevent changes to the clock */
>> - local_irq_save(flags);
>> - kvm_get_msr(v, MSR_IA32_TSC,&vcpu->hv_clock.tsc_timestamp);
>> - ktime_get_ts(&ts);
>> - monotonic_to_bootbased(&ts);
>> - local_irq_restore(flags);
>> + if (max_kernel_ns> kernel_ns) {
>> + s64 overshoot = max_kernel_ns - kernel_ns;
>> + ++v->stat.tsc_ahead;
>> + if (overshoot> NSEC_PER_SEC / HZ) {
>> + ++v->stat.tsc_overshoot;
>> + if (printk_ratelimit())
>> + pr_debug("ns overshoot: %lld\n", overshoot);
>> + }
>> + kernel_ns = max_kernel_ns;
>> + }
>>
>> /* With all the info we got, fill in the values */
>> -
>> - vcpu->hv_clock.system_time = ts.tv_nsec +
>> - (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
>> + vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>> + vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>> + vcpu->last_kernel_ns = kernel_ns;
>>
>> vcpu->hv_clock.flags = 0;
>>
>> @@ -4836,6 +4889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>> if (hw_breakpoint_active())
>> hw_breakpoint_restore();
>>
>> + kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>> +
>> atomic_set(&vcpu->guest_mode, 0);
>> smp_wmb();
>> local_irq_enable();
>>
> Is this still needed with the guest side global counter fix?
>

It's debatable. Instrumentation showed this happen 100% of the time
when measuring TSC in the compensation sequence. When measuring TSC in
the hot-path exit from hardware virt, before interrupts are enabled, the
compensation rate drops to 0%.

That's with an HPET clocksource for kernel time. Kernels with less
accurate and more granular clocksources would have worse problems, of
course.

If we're ever going to turn on the "kvmclock is reliable" bit, though, I
think at least paying attention to the potential need for compensation
is required - it technically is a backwards warp of time, and even if we
spend so long getting out of and back into hardware virtualization that
the guest can't notice it today, that might not be true on a faster
processor.

Zach

2010-06-16 00:27:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 16/17] TSC reset compensation

On Mon, Jun 14, 2010 at 09:34:18PM -1000, Zachary Amsden wrote:
> Attempt to synchronize TSCs which are reset to the same value. In the
> case of a reliable hardware TSC, we can just re-use the same offset, but
> on non-reliable hardware, we can get closer by adjusting the offset to
> match the elapsed time.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
> 1 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8e836e9..cedb71f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
> set_bit(KVM_REQ_CLOCK_SYNC, &v->requests);
> }
>
> +static inline int kvm_tsc_reliable(void)
> +{
> + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> + !check_tsc_unstable());
> +}
> +
> void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
> {
> struct kvm *kvm = vcpu->kvm;
> - u64 offset;
> + u64 offset, ns, elapsed;
>
> spin_lock(&kvm->arch.tsc_write_lock);
> offset = data - native_read_tsc();
> - kvm->arch.last_tsc_nsec = get_kernel_ns();
> + ns = get_kernel_ns();
> + elapsed = ns - kvm->arch.last_tsc_nsec;
> +
> + /*
> + * Special case: identical write to TSC within 5 seconds of
> + * another CPU is interpreted as an attempt to synchronize
> + * (the 5 seconds is to accomodate host load / swapping).
> + *
> + * In that case, for a reliable TSC, we can match TSC offsets,
> + * or make a best guest using kernel_ns value.
> + */
> + if (data == kvm->arch.last_tsc_write && elapsed < 5 * NSEC_PER_SEC) {
> + if (kvm_tsc_reliable()) {
> + offset = kvm->arch.last_tsc_offset;
> + pr_debug("kvm: matched tsc offset for %llu\n", data);
> + } else {
> + u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
> + tsc_delta = tsc_delta / USEC_PER_SEC;
> + offset -= tsc_delta;
> + pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
> + }
> + ns = kvm->arch.last_tsc_nsec;
> + }
> + kvm->arch.last_tsc_nsec = ns;
> kvm->arch.last_tsc_write = data;
> kvm->arch.last_tsc_offset = offset;
> kvm_x86_ops->write_tsc_offset(vcpu, offset);
> --

Could extend this to handle migration.

2010-06-16 00:32:24

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 16/17] TSC reset compensation

On 06/15/2010 02:27 PM, Marcelo Tosatti wrote:
> On Mon, Jun 14, 2010 at 09:34:18PM -1000, Zachary Amsden wrote:
>
>> Attempt to synchronize TSCs which are reset to the same value. In the
>> case of a reliable hardware TSC, we can just re-use the same offset, but
>> on non-reliable hardware, we can get closer by adjusting the offset to
>> match the elapsed time.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
>> 1 files changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8e836e9..cedb71f 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
>> set_bit(KVM_REQ_CLOCK_SYNC,&v->requests);
>> }
>>
>> +static inline int kvm_tsc_reliable(void)
>> +{
>> + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)&&
>> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC)&&
>> + !check_tsc_unstable());
>> +}
>> +
>> void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
>> {
>> struct kvm *kvm = vcpu->kvm;
>> - u64 offset;
>> + u64 offset, ns, elapsed;
>>
>> spin_lock(&kvm->arch.tsc_write_lock);
>> offset = data - native_read_tsc();
>> - kvm->arch.last_tsc_nsec = get_kernel_ns();
>> + ns = get_kernel_ns();
>> + elapsed = ns - kvm->arch.last_tsc_nsec;
>> +
>> + /*
>> + * Special case: identical write to TSC within 5 seconds of
>> + * another CPU is interpreted as an attempt to synchronize
>> + * (the 5 seconds is to accomodate host load / swapping).
>> + *
>> + * In that case, for a reliable TSC, we can match TSC offsets,
>> + * or make a best guest using kernel_ns value.
>> + */
>> + if (data == kvm->arch.last_tsc_write&& elapsed< 5 * NSEC_PER_SEC) {
>> + if (kvm_tsc_reliable()) {
>> + offset = kvm->arch.last_tsc_offset;
>> + pr_debug("kvm: matched tsc offset for %llu\n", data);
>> + } else {
>> + u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
>> + tsc_delta = tsc_delta / USEC_PER_SEC;
>> + offset -= tsc_delta;
>> + pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
>> + }
>> + ns = kvm->arch.last_tsc_nsec;
>> + }
>> + kvm->arch.last_tsc_nsec = ns;
>> kvm->arch.last_tsc_write = data;
>> kvm->arch.last_tsc_offset = offset;
>> kvm_x86_ops->write_tsc_offset(vcpu, offset);
>> --
>>
> Could extend this to handle migration.
>

Also, this could be extended to cover the kvmclock variables themselves;
then, if tsc is reliable, we need not ever recalibrate the kvmclock. In
fact, all VMs would have the same parameters for kvmclock in that case,
just with a different kvm->arch.kvmclock_offset.

Zach

2010-06-16 00:39:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On Tue, Jun 15, 2010 at 02:21:17PM -1000, Zachary Amsden wrote:
> On 06/15/2010 01:47 PM, Marcelo Tosatti wrote:
> >On Mon, Jun 14, 2010 at 09:34:13PM -1000, Zachary Amsden wrote:
> >>Kernel time, which advances in discrete steps may progress much slower
> >>than TSC. As a result, when kvmclock is adjusted to a new base, the
> >>apparent time to the guest, which runs at a much higher, nsec scaled
> >>rate based on the current TSC, may have already been observed to have
> >>a larger value (kernel_ns + scaled tsc) than the value to which we are
> >>setting it (kernel_ns + 0).
> >>
> >>We must instead compute the clock as potentially observed by the guest
> >>for kernel_ns to make sure it does not go backwards.
> >>
> >>Signed-off-by: Zachary Amsden<[email protected]>
> >>---
> >> arch/x86/include/asm/kvm_host.h | 4 ++
> >> arch/x86/kvm/x86.c | 79 +++++++++++++++++++++++++++++++++------
> >> 2 files changed, 71 insertions(+), 12 deletions(-)
> >>
> >>+ /*
> >>+ * The protection we require is simple: we must not be preempted from
> >>+ * the CPU between our read of the TSC khz and our read of the TSC.
> >>+ * Interrupt protection is not strictly required, but it does result in
> >>+ * greater accuracy for the TSC / kernel_ns measurement.
> >>+ */
> >>+ local_irq_save(flags);
> >>+ this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
> >>+ kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
> >>+ ktime_get_ts(&ts);
> >>+ monotonic_to_bootbased(&ts);
> >>+ kernel_ns = timespec_to_ns(&ts);
> >>+ local_irq_restore(flags);
> >>+
> >> if (unlikely(this_tsc_khz == 0)) {
> >> kvm_request_guest_time_update(v);
> >> return 1;
> >> }
> >>
> >>+ /*
> >>+ * Time as measured by the TSC may go backwards when resetting the base
> >>+ * tsc_timestamp. The reason for this is that the TSC resolution is
> >>+ * higher than the resolution of the other clock scales. Thus, many
> >>+ * possible measurments of the TSC correspond to one measurement of any
> >>+ * other clock, and so a spread of values is possible. This is not a
> >>+ * problem for the computation of the nanosecond clock; with TSC rates
> >>+ * around 1GHZ, there can only be a few cycles which correspond to one
> >>+ * nanosecond value, and any path through this code will inevitably
> >>+ * take longer than that. However, with the kernel_ns value itself,
> >>+ * the precision may be much lower, down to HZ granularity. If the
> >>+ * first sampling of TSC against kernel_ns ends in the low part of the
> >>+ * range, and the second in the high end of the range, we can get:
> >>+ *
> >>+ * (TSC - offset_low) * S + kns_old> (TSC - offset_high) * S + kns_new
> >>+ *
> >>+ * As the sampling errors potentially range in the thousands of cycles,
> >>+ * it is possible such a time value has already been observed by the
> >>+ * guest. To protect against this, we must compute the system time as
> >>+ * observed by the guest and ensure the new system time is greater.
> >>+ */
> >>+ max_kernel_ns = 0;
> >>+ if (vcpu->hv_clock.tsc_timestamp) {
> >>+ max_kernel_ns = vcpu->last_guest_tsc -
> >>+ vcpu->hv_clock.tsc_timestamp;
> >>+ max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> >>+ vcpu->hv_clock.tsc_to_system_mul,
> >>+ vcpu->hv_clock.tsc_shift);
> >>+ max_kernel_ns += vcpu->last_kernel_ns;
> >>+ }
> >>+
> >> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
> >>- kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
> >>+ kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
> >>+ &vcpu->hv_clock.tsc_shift,
> >>+ &vcpu->hv_clock.tsc_to_system_mul);
> >> vcpu->hw_tsc_khz = this_tsc_khz;
> >> }
> >>
> >>- /* Keep irq disabled to prevent changes to the clock */
> >>- local_irq_save(flags);
> >>- kvm_get_msr(v, MSR_IA32_TSC,&vcpu->hv_clock.tsc_timestamp);
> >>- ktime_get_ts(&ts);
> >>- monotonic_to_bootbased(&ts);
> >>- local_irq_restore(flags);
> >>+ if (max_kernel_ns> kernel_ns) {
> >>+ s64 overshoot = max_kernel_ns - kernel_ns;
> >>+ ++v->stat.tsc_ahead;
> >>+ if (overshoot> NSEC_PER_SEC / HZ) {
> >>+ ++v->stat.tsc_overshoot;
> >>+ if (printk_ratelimit())
> >>+ pr_debug("ns overshoot: %lld\n", overshoot);
> >>+ }
> >>+ kernel_ns = max_kernel_ns;
> >>+ }
> >>
> >> /* With all the info we got, fill in the values */
> >>-
> >>- vcpu->hv_clock.system_time = ts.tv_nsec +
> >>- (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
> >>+ vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
> >>+ vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
> >>+ vcpu->last_kernel_ns = kernel_ns;
> >>
> >> vcpu->hv_clock.flags = 0;
> >>
> >>@@ -4836,6 +4889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >> if (hw_breakpoint_active())
> >> hw_breakpoint_restore();
> >>
> >>+ kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
> >>+
> >> atomic_set(&vcpu->guest_mode, 0);
> >> smp_wmb();
> >> local_irq_enable();
> >Is this still needed with the guest side global counter fix?
>
> It's debatable. Instrumentation showed this happen 100% of the time
> when measuring TSC in the compensation sequence. When measuring TSC
> in the hot-path exit from hardware virt, before interrupts are
> enabled, the compensation rate drops to 0%.
>
> That's with an HPET clocksource for kernel time. Kernels with less
> accurate and more granular clocksources would have worse problems,
> of course.
>
> If we're ever going to turn on the "kvmclock is reliable" bit,
> though, I think at least paying attention to the potential need for
> compensation is required - it technically is a backwards warp of
> time, and even if we spend so long getting out of and back into
> hardware virtualization that the guest can't notice it today, that
> might not be true on a faster processor.
>
> Zach

What is worrying is that if this keeps happening the guest clock will
advance faster then it should. The solution you had before with "if
(kernel_ns <= last_ns) compensate()" was simpler and more resistant in
that respect, if i'm not missing anything.

2010-06-16 08:04:37

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 03/17] Unify vendor TSC logic

Zachary Amsden wrote:
> Move the TSC control logic from the vendor backends into x86.c
> by adding adjust_tsc_offset to x86 ops. Now all TSC decisions
> can be done in one place.
>
> Also, rename some variable in the VCPU structure to more accurately
> reflect their actual content.
>
> VMX backend would record last observed TSC very late (possibly in an
> IPI to clear the VMCS from a remote CPU), now it is done earlier.
>
> Note this is still not yet perfect. We adjust TSC in both
> directions when the TSC is unstable, resulting in desynchronized
> TSCs for SMP guests. A better choice would be to compensate based
> on a master reference clock.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 5 +++--
> arch/x86/kvm/svm.c | 25 +++++++++++--------------
> arch/x86/kvm/vmx.c | 20 ++++++++------------
> arch/x86/kvm/x86.c | 16 +++++++++++-----
> 4 files changed, 33 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 91631b8..94f6ce8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -253,7 +253,6 @@ struct kvm_mmu {
> };
>
> struct kvm_vcpu_arch {
> - u64 host_tsc;
> /*
> * rip and regs accesses must go through
> * kvm_{register,rip}_{read,write} functions.
> @@ -334,9 +333,10 @@ struct kvm_vcpu_arch {
>
> gpa_t time;
> struct pvclock_vcpu_time_info hv_clock;
> - unsigned int hv_clock_tsc_khz;
> + unsigned int hw_tsc_khz;
> unsigned int time_offset;
> struct page *time_page;
> + u64 last_host_tsc;
>
> bool nmi_pending;
> bool nmi_injected;
> @@ -530,6 +530,7 @@ struct kvm_x86_ops {
> u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> int (*get_lpage_level)(void);
> bool (*rdtscp_supported)(void);
> + void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
>
> void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 09b2145..ee2cf30 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> int i;
>
> if (unlikely(cpu != vcpu->cpu)) {
> - u64 delta;
> -
> - if (check_tsc_unstable()) {
> - /*
> - * Make sure that the guest sees a monotonically
> - * increasing TSC.
> - */
> - delta = vcpu->arch.host_tsc - native_read_tsc();
> - svm->vmcb->control.tsc_offset += delta;
> - if (is_nested(svm))
> - svm->nested.hsave->control.tsc_offset += delta;
> - }
> svm->asid_generation = 0;
> }
>
> @@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
> ++vcpu->stat.host_state_reload;
> for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
> wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
> -
> - vcpu->arch.host_tsc = native_read_tsc();
> }
>
> static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
> @@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void)
> return false;
> }
>
> +static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + svm->vmcb->control.tsc_offset += adjustment;
> + if (is_nested(svm))
> + svm->nested.hsave->control.tsc_offset += adjustment;
> +}
> +
> static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> @@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = {
> .rdtscp_supported = svm_rdtscp_supported,
>
> .set_supported_cpuid = svm_set_supported_cpuid,
> +
> + .adjust_tsc_offset = svm_adjust_tsc_offset,
> };
>
> static int __init svm_init(void)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a657e08..a993e67 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg)
> vmcs_clear(vmx->vmcs);
> if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> per_cpu(current_vmcs, cpu) = NULL;
> - rdtscll(vmx->vcpu.arch.host_tsc);
> list_del(&vmx->local_vcpus_link);
> vmx->vcpu.cpu = -1;
> vmx->launched = 0;
> @@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
> static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> - u64 tsc_this, delta, new_offset;
> u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
>
> if (!vmm_exclusive)
> @@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>
> rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
> vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
> -
> - /*
> - * Make sure the time stamp counter is monotonous.
> - */
> - rdtscll(tsc_this);
> - if (tsc_this < vcpu->arch.host_tsc) {
> - delta = vcpu->arch.host_tsc - tsc_this;
> - new_offset = vmcs_read64(TSC_OFFSET) + delta;
> - vmcs_write64(TSC_OFFSET, new_offset);
> - }
> }
> }
>
> @@ -1153,6 +1141,12 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
> vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
> }
>
> +static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> +{
> + u64 offset = vmcs_read64(TSC_OFFSET);
> + vmcs_write64(TSC_OFFSET, offset + adjustment);
> +}
> +
> /*
> * Reads an msr value (of 'msr_index') into 'pdata'.
> * Returns 0 on success, non-0 otherwise.
> @@ -4340,6 +4334,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
> .rdtscp_supported = vmx_rdtscp_supported,
>
> .set_supported_cpuid = vmx_set_supported_cpuid,
> +
> + .adjust_tsc_offset = vmx_adjust_tsc_offset,
> };
>
> static int __init vmx_init(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0a102d3..c8289d0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -929,9 +929,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
> return 1;
> }
>
> - if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
> + if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
> kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
> - vcpu->hv_clock_tsc_khz = this_tsc_khz;
> + vcpu->hw_tsc_khz = this_tsc_khz;
> }
>
> /* Keep irq disabled to prevent changes to the clock */
> @@ -1805,18 +1805,24 @@ out:
>
> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> + kvm_x86_ops->vcpu_load(vcpu, cpu);
> if (unlikely(vcpu->cpu != cpu)) {
> + /* Make sure TSC doesn't go backwards */
> + s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> + native_read_tsc() - vcpu->arch.last_host_tsc;
> + if (tsc_delta < 0 || check_tsc_unstable())
>
It's better to do the adjustment also when tsc_delta > 0
> + kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> kvm_migrate_timers(vcpu);
> + kvm_request_guest_time_update(vcpu);
> + vcpu->cpu = cpu;
> }
> - kvm_x86_ops->vcpu_load(vcpu, cpu);
> - kvm_request_guest_time_update(vcpu);
> - vcpu->cpu = cpu;
> }
>
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> + vcpu->arch.last_host_tsc = native_read_tsc();
> }
>
> static int is_efer_nx(void)
>

2010-06-16 08:05:17

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

Zachary Amsden wrote:
> When CPUs with unstable TSCs enter deep C-state, TSC may stop
> running. This causes us to require resynchronization. Since
> we can't tell when this may potentially happen, we assume the
> worst by forcing re-compensation for it at every point the VCPU
> task is descheduled.
>
>
Is there any method to do the compensation on host TSC when recovering
from C-state? It should be simpler than do it on guest TSC.
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/kvm/x86.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c8289d0..618c435 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1822,7 +1822,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> +
> vcpu->arch.last_host_tsc = native_read_tsc();
> +
> + /*
> + * When potentially leaving a CPU with unstable TSCs, we risk
> + * that the CPU enters deep C-state. If it does, the TSC may
> + * go out of sync but we will not recalibrate because the test
> + * vcpu->cpu != cpu can not detect this condition. So set
> + * vcpu->cpu = -1 to force the recalibration above.
> + */
> + if (check_tsc_unstable())
> + vcpu->cpu = -1;
> }
>
> static int is_efer_nx(void)
>

2010-06-16 08:05:33

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

Zachary Amsden wrote:
> SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
> local offset delta from last measurement. This does not take into account how
> long it has been since the measurement, leading to drift. Minimize the drift
> by accounting for any time difference the kernel has observed.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 94f6ce8..1afecd7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -337,6 +337,7 @@ struct kvm_vcpu_arch {
> unsigned int time_offset;
> struct page *time_page;
> u64 last_host_tsc;
> + u64 last_host_ns;
>
> bool nmi_pending;
> bool nmi_injected;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 618c435..b1bdf05 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> /* Make sure TSC doesn't go backwards */
> s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> native_read_tsc() - vcpu->arch.last_host_tsc;
> +
> + /* Subtract elapsed cycle time from the delta computation */
> + if (check_tsc_unstable() && vcpu->arch.last_host_ns) {
> + s64 delta;
> + struct timespec ts;
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + delta = timespec_to_ns(&ts) - vcpu->arch.last_host_ns;
> + delta = delta * per_cpu(cpu_tsc_khz, cpu);
>
This seems not work well on a cpu w/o CONSTANT_TSC.
> + delta = delta / USEC_PER_SEC;
> + tsc_delta -= delta;
> + }
> +
> if (tsc_delta < 0 || check_tsc_unstable())
> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> kvm_migrate_timers(vcpu);
> @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> * vcpu->cpu != cpu can not detect this condition. So set
> * vcpu->cpu = -1 to force the recalibration above.
> */
> - if (check_tsc_unstable())
> + if (check_tsc_unstable()) {
> + struct timespec ts;
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + vcpu->arch.last_host_ns = timespec_to_ns(&ts);
> vcpu->cpu = -1;
> + }
> }
>
> static int is_efer_nx(void)
>

2010-06-16 08:05:54

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

Zachary Amsden wrote:
> Kernel time, which advances in discrete steps may progress much slower
> than TSC. As a result, when kvmclock is adjusted to a new base, the
> apparent time to the guest, which runs at a much higher, nsec scaled
> rate based on the current TSC, may have already been observed to have
> a larger value (kernel_ns + scaled tsc) than the value to which we are
> setting it (kernel_ns + 0).
>
This is one issue of kvmclock which tries to supply a clocksource whose
precision may even higher than host.
> We must instead compute the clock as potentially observed by the guest
> for kernel_ns to make sure it does not go backwards.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++
> arch/x86/kvm/x86.c | 79 +++++++++++++++++++++++++++++++++------
> 2 files changed, 71 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1afecd7..7ec2472 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -338,6 +338,8 @@ struct kvm_vcpu_arch {
> struct page *time_page;
> u64 last_host_tsc;
> u64 last_host_ns;
> + u64 last_guest_tsc;
> + u64 last_kernel_ns;
>
> bool nmi_pending;
> bool nmi_injected;
> @@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
> u32 hypercalls;
> u32 irq_injections;
> u32 nmi_injections;
> + u32 tsc_overshoot;
> + u32 tsc_ahead;
> };
>
> struct kvm_x86_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 52d7d34..703ea43 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -138,6 +138,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
> { "irq_injections", VCPU_STAT(irq_injections) },
> { "nmi_injections", VCPU_STAT(nmi_injections) },
> + { "tsc_overshoot", VCPU_STAT(tsc_overshoot) },
> + { "tsc_ahead", VCPU_STAT(tsc_ahead) },
> { "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
> { "mmu_pte_write", VM_STAT(mmu_pte_write) },
> { "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
> @@ -927,33 +929,84 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
> struct kvm_vcpu_arch *vcpu = &v->arch;
> void *shared_kaddr;
> unsigned long this_tsc_khz;
> + s64 kernel_ns, max_kernel_ns;
> + u64 tsc_timestamp;
>
> if ((!vcpu->time_page))
> return 0;
>
> - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
> - put_cpu_var(cpu_tsc_khz);
> + /*
> + * The protection we require is simple: we must not be preempted from
> + * the CPU between our read of the TSC khz and our read of the TSC.
> + * Interrupt protection is not strictly required, but it does result in
> + * greater accuracy for the TSC / kernel_ns measurement.
> + */
> + local_irq_save(flags);
> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
> + kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + kernel_ns = timespec_to_ns(&ts);
> + local_irq_restore(flags);
> +
> if (unlikely(this_tsc_khz == 0)) {
> kvm_request_guest_time_update(v);
> return 1;
> }
>
> + /*
> + * Time as measured by the TSC may go backwards when resetting the base
> + * tsc_timestamp. The reason for this is that the TSC resolution is
> + * higher than the resolution of the other clock scales. Thus, many
> + * possible measurments of the TSC correspond to one measurement of any
> + * other clock, and so a spread of values is possible. This is not a
> + * problem for the computation of the nanosecond clock; with TSC rates
> + * around 1GHZ, there can only be a few cycles which correspond to one
> + * nanosecond value, and any path through this code will inevitably
> + * take longer than that. However, with the kernel_ns value itself,
> + * the precision may be much lower, down to HZ granularity. If the
> + * first sampling of TSC against kernel_ns ends in the low part of the
> + * range, and the second in the high end of the range, we can get:
> + *
> + * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> + *
> + * As the sampling errors potentially range in the thousands of cycles,
> + * it is possible such a time value has already been observed by the
> + * guest. To protect against this, we must compute the system time as
> + * observed by the guest and ensure the new system time is greater.
> + */
> + max_kernel_ns = 0;
> + if (vcpu->hv_clock.tsc_timestamp) {
> + max_kernel_ns = vcpu->last_guest_tsc -
>
Since you do the comparison with kernel_ns, so what you need here is
tsc_timestamp which looks more like the 'last' tsc seen by guest.
> + vcpu->hv_clock.tsc_timestamp;
> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> + vcpu->hv_clock.tsc_to_system_mul,
> + vcpu->hv_clock.tsc_shift);
> + max_kernel_ns += vcpu->last_kernel_ns;
> + }
> +
> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
> - kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
> + &vcpu->hv_clock.tsc_shift,
> + &vcpu->hv_clock.tsc_to_system_mul);
> vcpu->hw_tsc_khz = this_tsc_khz;
> }
>
> - /* Keep irq disabled to prevent changes to the clock */
> - local_irq_save(flags);
> - kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
> - ktime_get_ts(&ts);
> - monotonic_to_bootbased(&ts);
> - local_irq_restore(flags);
> + if (max_kernel_ns > kernel_ns) {
>
Both max_kernel_ns and kernel_ns are not adjusted by kvmclock_offset, so
this comparing is not safe after migration.
> + s64 overshoot = max_kernel_ns - kernel_ns;
> + ++v->stat.tsc_ahead;
> + if (overshoot > NSEC_PER_SEC / HZ) {
> + ++v->stat.tsc_overshoot;
> + if (printk_ratelimit())
> + pr_debug("ns overshoot: %lld\n", overshoot);
> + }
> + kernel_ns = max_kernel_ns;
> + }
>
A tsc_behind or something like this would make the problem more clear,
and tsc_ahead should be zero when host using tsc as its clocksource.
>
> /* With all the info we got, fill in the values */
> -
> - vcpu->hv_clock.system_time = ts.tv_nsec +
> - (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
> + vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
> + vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
> + vcpu->last_kernel_ns = kernel_ns;
>
> vcpu->hv_clock.flags = 0;
>
> @@ -4836,6 +4889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> if (hw_breakpoint_active())
> hw_breakpoint_restore();
>
> + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>
This could be dropped since it does not take the time of guest execution
into account.
> +
> atomic_set(&vcpu->guest_mode, 0);
> smp_wmb();
> local_irq_enable();
>

2010-06-16 08:07:18

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 12/17] Add helper function get_kernel_ns

Zachary Amsden wrote:
> On 06/14/2010 10:41 PM, Avi Kivity wrote:
>
>> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>>
>>> Add a helper function for the multiple places this is used. Note
>>> that it
>>> must not be called in preemptible context, as that would mean the kernel
>>> could enter software suspend state, which would cause non-atomic
>>> operation
>>> of the monotonic_to_bootbased computation.
>>>
>>> Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
>>> as well? Currently, they are not bootbased (but perhaps should be).
>>>
>>> Signed-off-by: Zachary Amsden<[email protected]>
>>> ---
>>> arch/x86/kvm/x86.c | 26 +++++++++++++-------------
>>> 1 files changed, 13 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 703ea43..15c7317 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t
>>> scaled_khz, uint32_t base_khz,
>>> __func__, base_khz, scaled_khz, shift, *pmultiplier);
>>> }
>>>
>>> +static inline u64 get_kernel_ns(void)
>>> +{
>>> + struct timespec ts;
>>> +
>>> + WARN_ON(preemptible());
>>> + ktime_get_ts(&ts);
>>> + monotonic_to_bootbased(&ts);
>>> + return timespec_to_ns(&ts);
>>> +}
>>> +
>>> static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>>>
>>>
>> Isn't something like this a candidate for the time infrastructure?
>>
>>
>
> Should it be? It certainly seems reasonable.
>
>
> Also, should we be using it for the cases KVM_GET_CLOCK /
> KVM_SET_CLOCK? It seems global kvmclock_offset for the VM ignores the
> bootbased conversion.
>
>
Yes we should.
> Zach
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-06-16 13:07:18

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 01/17] Eliminate duplicated timer code

On Mon, Jun 14, 2010 at 09:34:03PM -1000, Zachary Amsden wrote:
> Move duplicated timer related code into arch_vcpu_load rather
> than vendor callouts. Should be an isomorphic transformation.
>
> Signed-off-by: Zachary Amsden <[email protected]>
Acked-by: Glauber Costa <[email protected]>

2010-06-16 13:22:29

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 03/17] Unify vendor TSC logic

On Wed, Jun 16, 2010 at 04:10:10PM +0800, Jason Wang wrote:
> Zachary Amsden wrote:
> >
> > void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > {
> > + kvm_x86_ops->vcpu_load(vcpu, cpu);
> > if (unlikely(vcpu->cpu != cpu)) {
> > + /* Make sure TSC doesn't go backwards */
> > + s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> > + native_read_tsc() - vcpu->arch.last_host_tsc;
> > + if (tsc_delta < 0 || check_tsc_unstable())
> >
> It's better to do the adjustment also when tsc_delta > 0
And why do you think so? Doing it on tsc_delta > 0 would force us to adjust
at every entry but the first. And I guess we want to adjust as few times as
we can.

For example, we would adjust on every cpu bounce even for machines that has
a perfectly sync tsc. This could introduce an error not present before.

2010-06-16 13:24:41

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

On Mon, Jun 14, 2010 at 09:34:06PM -1000, Zachary Amsden wrote:
> When CPUs with unstable TSCs enter deep C-state, TSC may stop
> running. This causes us to require resynchronization. Since
> we can't tell when this may potentially happen, we assume the
> worst by forcing re-compensation for it at every point the VCPU
> task is descheduled.
can't we just compensate everytime check_tsc_unstable() is true?

2010-06-16 13:33:01

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

On Mon, Jun 14, 2010 at 09:34:07PM -1000, Zachary Amsden wrote:
> SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
> local offset delta from last measurement. This does not take into account how
> long it has been since the measurement, leading to drift. Minimize the drift
> by accounting for any time difference the kernel has observed.
>
> Signed-off-by: Zachary Amsden <[email protected]>
I believe this should be done not only if we have check_tsc_unstable() == true,
but anytime we adjust the tsc. I mean:

Sure it is expected to be much more relevant in this case, but if we're testing
generally for tsc_delta < 0 in the adjustment code, it is because we believe
it can happen, even if tsc is stable (otherwise, we'd better take it off completely).

And in that case, we should account elapsed time too.

> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 94f6ce8..1afecd7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -337,6 +337,7 @@ struct kvm_vcpu_arch {
> unsigned int time_offset;
> struct page *time_page;
> u64 last_host_tsc;
> + u64 last_host_ns;
>
> bool nmi_pending;
> bool nmi_injected;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 618c435..b1bdf05 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> /* Make sure TSC doesn't go backwards */
> s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> native_read_tsc() - vcpu->arch.last_host_tsc;
> +
> + /* Subtract elapsed cycle time from the delta computation */
> + if (check_tsc_unstable() && vcpu->arch.last_host_ns) {
> + s64 delta;
> + struct timespec ts;
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + delta = timespec_to_ns(&ts) - vcpu->arch.last_host_ns;
> + delta = delta * per_cpu(cpu_tsc_khz, cpu);
> + delta = delta / USEC_PER_SEC;
> + tsc_delta -= delta;
> + }
> +
> if (tsc_delta < 0 || check_tsc_unstable())
> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> kvm_migrate_timers(vcpu);
> @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> * vcpu->cpu != cpu can not detect this condition. So set
> * vcpu->cpu = -1 to force the recalibration above.
> */
> - if (check_tsc_unstable())
> + if (check_tsc_unstable()) {
> + struct timespec ts;
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + vcpu->arch.last_host_ns = timespec_to_ns(&ts);
> vcpu->cpu = -1;
> + }
> }
>
> static int is_efer_nx(void)
> --
> 1.7.1
>

2010-06-16 13:52:19

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 16/17] TSC reset compensation

On Mon, Jun 14, 2010 at 09:34:18PM -1000, Zachary Amsden wrote:
> Attempt to synchronize TSCs which are reset to the same value. In the
> case of a reliable hardware TSC, we can just re-use the same offset, but
> on non-reliable hardware, we can get closer by adjusting the offset to
> match the elapsed time.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> ---
> arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
> 1 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8e836e9..cedb71f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
> set_bit(KVM_REQ_CLOCK_SYNC, &v->requests);
> }
>
> +static inline int kvm_tsc_reliable(void)
> +{
> + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> + !check_tsc_unstable());
> +}
> +
why can't we re-use vmware TSC_RELIABLE flag?

2010-06-16 13:58:50

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
> Zachary Amsden wrote:
> > Kernel time, which advances in discrete steps may progress much slower
> > than TSC. As a result, when kvmclock is adjusted to a new base, the
> > apparent time to the guest, which runs at a much higher, nsec scaled
> > rate based on the current TSC, may have already been observed to have
> > a larger value (kernel_ns + scaled tsc) than the value to which we are
> > setting it (kernel_ns + 0).
> >
> This is one issue of kvmclock which tries to supply a clocksource whose
> precision may even higher than host.
What if we export to the guest the current clock resolution, and when doing guest
reads, simply chop whatever value we got to the lowest acceptable value?

2010-06-16 14:04:25

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 12/17] Add helper function get_kernel_ns

On Wed, Jun 16, 2010 at 04:12:46PM +0800, Jason Wang wrote:
> Zachary Amsden wrote:
> > On 06/14/2010 10:41 PM, Avi Kivity wrote:
> >
> >> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
> >>
> >>> Add a helper function for the multiple places this is used. Note
> >>> that it
> >>> must not be called in preemptible context, as that would mean the kernel
> >>> could enter software suspend state, which would cause non-atomic
> >>> operation
> >>> of the monotonic_to_bootbased computation.
> >>>
> >>> Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
> >>> as well? Currently, they are not bootbased (but perhaps should be).
> >>>
> >>> Signed-off-by: Zachary Amsden<[email protected]>
> >>> ---
> >>> arch/x86/kvm/x86.c | 26 +++++++++++++-------------
> >>> 1 files changed, 13 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index 703ea43..15c7317 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t
> >>> scaled_khz, uint32_t base_khz,
> >>> __func__, base_khz, scaled_khz, shift, *pmultiplier);
> >>> }
> >>>
> >>> +static inline u64 get_kernel_ns(void)
> >>> +{
> >>> + struct timespec ts;
> >>> +
> >>> + WARN_ON(preemptible());
> >>> + ktime_get_ts(&ts);
> >>> + monotonic_to_bootbased(&ts);
> >>> + return timespec_to_ns(&ts);
> >>> +}
> >>> +
> >>> static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
> >>>
> >>>
> >> Isn't something like this a candidate for the time infrastructure?
> >>
> >>
> >
> > Should it be? It certainly seems reasonable.
> >
> >
> > Also, should we be using it for the cases KVM_GET_CLOCK /
> > KVM_SET_CLOCK? It seems global kvmclock_offset for the VM ignores the
> > bootbased conversion.
> >
> >
> Yes we should.
indeed. Also shows the importance of doing it in a central place.
Have we had an acessor like that before, we would probably not have
made this mistake.


2010-06-16 14:13:07

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On 06/16/2010 04:58 PM, Glauber Costa wrote:
> On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
>
>> Zachary Amsden wrote:
>>
>>> Kernel time, which advances in discrete steps may progress much slower
>>> than TSC. As a result, when kvmclock is adjusted to a new base, the
>>> apparent time to the guest, which runs at a much higher, nsec scaled
>>> rate based on the current TSC, may have already been observed to have
>>> a larger value (kernel_ns + scaled tsc) than the value to which we are
>>> setting it (kernel_ns + 0).
>>>
>>>
>> This is one issue of kvmclock which tries to supply a clocksource whose
>> precision may even higher than host.
>>
> What if we export to the guest the current clock resolution, and when doing guest
> reads, simply chop whatever value we got to the lowest acceptable value?
>

The clock resolution can change, and while we can expose it reliably
through pvclock, do we need a notification so that the guest can update
other internal structures?

--
error compiling committee.c: too many arguments to function

2010-06-16 14:58:32

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On Wed, Jun 16, 2010 at 05:13:02PM +0300, Avi Kivity wrote:
> On 06/16/2010 04:58 PM, Glauber Costa wrote:
> >On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
> >>Zachary Amsden wrote:
> >>>Kernel time, which advances in discrete steps may progress much slower
> >>>than TSC. As a result, when kvmclock is adjusted to a new base, the
> >>>apparent time to the guest, which runs at a much higher, nsec scaled
> >>>rate based on the current TSC, may have already been observed to have
> >>>a larger value (kernel_ns + scaled tsc) than the value to which we are
> >>>setting it (kernel_ns + 0).
> >>>
> >>This is one issue of kvmclock which tries to supply a clocksource whose
> >>precision may even higher than host.
> >What if we export to the guest the current clock resolution, and when doing guest
> >reads, simply chop whatever value we got to the lowest acceptable value?
>
> The clock resolution can change, and while we can expose it reliably
> through pvclock, do we need a notification so that the guest can
> update other internal structures?
I believe the only thing we need is to warn the guest whenever this changes.
We can probably fit it in one of the paddings we have, and add a flag to say
it is valid (now that we have the infrastructure).

2010-06-16 18:42:25

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 03/17] Unify vendor TSC logic

On 06/15/2010 10:10 PM, Jason Wang wrote:
> Zachary Amsden wrote:
>
>> Move the TSC control logic from the vendor backends into x86.c
>> by adding adjust_tsc_offset to x86 ops. Now all TSC decisions
>> can be done in one place.
>>
>> Also, rename some variable in the VCPU structure to more accurately
>> reflect their actual content.
>>
>> VMX backend would record last observed TSC very late (possibly in an
>> IPI to clear the VMCS from a remote CPU), now it is done earlier.
>>
>> Note this is still not yet perfect. We adjust TSC in both
>> directions when the TSC is unstable, resulting in desynchronized
>> TSCs for SMP guests. A better choice would be to compensate based
>> on a master reference clock.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 5 +++--
>> arch/x86/kvm/svm.c | 25 +++++++++++--------------
>> arch/x86/kvm/vmx.c | 20 ++++++++------------
>> arch/x86/kvm/x86.c | 16 +++++++++++-----
>> 4 files changed, 33 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 91631b8..94f6ce8 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -253,7 +253,6 @@ struct kvm_mmu {
>> };
>>
>> struct kvm_vcpu_arch {
>> - u64 host_tsc;
>> /*
>> * rip and regs accesses must go through
>> * kvm_{register,rip}_{read,write} functions.
>> @@ -334,9 +333,10 @@ struct kvm_vcpu_arch {
>>
>> gpa_t time;
>> struct pvclock_vcpu_time_info hv_clock;
>> - unsigned int hv_clock_tsc_khz;
>> + unsigned int hw_tsc_khz;
>> unsigned int time_offset;
>> struct page *time_page;
>> + u64 last_host_tsc;
>>
>> bool nmi_pending;
>> bool nmi_injected;
>> @@ -530,6 +530,7 @@ struct kvm_x86_ops {
>> u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> int (*get_lpage_level)(void);
>> bool (*rdtscp_supported)(void);
>> + void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
>>
>> void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
>>
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 09b2145..ee2cf30 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>> int i;
>>
>> if (unlikely(cpu != vcpu->cpu)) {
>> - u64 delta;
>> -
>> - if (check_tsc_unstable()) {
>> - /*
>> - * Make sure that the guest sees a monotonically
>> - * increasing TSC.
>> - */
>> - delta = vcpu->arch.host_tsc - native_read_tsc();
>> - svm->vmcb->control.tsc_offset += delta;
>> - if (is_nested(svm))
>> - svm->nested.hsave->control.tsc_offset += delta;
>> - }
>> svm->asid_generation = 0;
>> }
>>
>> @@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
>> ++vcpu->stat.host_state_reload;
>> for (i = 0; i< NR_HOST_SAVE_USER_MSRS; i++)
>> wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
>> -
>> - vcpu->arch.host_tsc = native_read_tsc();
>> }
>>
>> static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
>> @@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void)
>> return false;
>> }
>>
>> +static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>> +{
>> + struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> + svm->vmcb->control.tsc_offset += adjustment;
>> + if (is_nested(svm))
>> + svm->nested.hsave->control.tsc_offset += adjustment;
>> +}
>> +
>> static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_svm *svm = to_svm(vcpu);
>> @@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = {
>> .rdtscp_supported = svm_rdtscp_supported,
>>
>> .set_supported_cpuid = svm_set_supported_cpuid,
>> +
>> + .adjust_tsc_offset = svm_adjust_tsc_offset,
>> };
>>
>> static int __init svm_init(void)
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index a657e08..a993e67 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg)
>> vmcs_clear(vmx->vmcs);
>> if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
>> per_cpu(current_vmcs, cpu) = NULL;
>> - rdtscll(vmx->vcpu.arch.host_tsc);
>> list_del(&vmx->local_vcpus_link);
>> vmx->vcpu.cpu = -1;
>> vmx->launched = 0;
>> @@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
>> static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>> {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> - u64 tsc_this, delta, new_offset;
>> u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
>>
>> if (!vmm_exclusive)
>> @@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>
>> rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>> vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
>> -
>> - /*
>> - * Make sure the time stamp counter is monotonous.
>> - */
>> - rdtscll(tsc_this);
>> - if (tsc_this< vcpu->arch.host_tsc) {
>> - delta = vcpu->arch.host_tsc - tsc_this;
>> - new_offset = vmcs_read64(TSC_OFFSET) + delta;
>> - vmcs_write64(TSC_OFFSET, new_offset);
>> - }
>> }
>> }
>>
>> @@ -1153,6 +1141,12 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
>> vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
>> }
>>
>> +static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>> +{
>> + u64 offset = vmcs_read64(TSC_OFFSET);
>> + vmcs_write64(TSC_OFFSET, offset + adjustment);
>> +}
>> +
>> /*
>> * Reads an msr value (of 'msr_index') into 'pdata'.
>> * Returns 0 on success, non-0 otherwise.
>> @@ -4340,6 +4334,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
>> .rdtscp_supported = vmx_rdtscp_supported,
>>
>> .set_supported_cpuid = vmx_set_supported_cpuid,
>> +
>> + .adjust_tsc_offset = vmx_adjust_tsc_offset,
>> };
>>
>> static int __init vmx_init(void)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 0a102d3..c8289d0 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -929,9 +929,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>> return 1;
>> }
>>
>> - if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
>> + if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>> kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>> - vcpu->hv_clock_tsc_khz = this_tsc_khz;
>> + vcpu->hw_tsc_khz = this_tsc_khz;
>> }
>>
>> /* Keep irq disabled to prevent changes to the clock */
>> @@ -1805,18 +1805,24 @@ out:
>>
>> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>> {
>> + kvm_x86_ops->vcpu_load(vcpu, cpu);
>> if (unlikely(vcpu->cpu != cpu)) {
>> + /* Make sure TSC doesn't go backwards */
>> + s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>> + native_read_tsc() - vcpu->arch.last_host_tsc;
>> + if (tsc_delta< 0 || check_tsc_unstable())
>>
>>
> It's better to do the adjustment also when tsc_delta> 0
>

No, that causes time to stop.

2010-06-16 18:43:48

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

On 06/15/2010 10:10 PM, Jason Wang wrote:
> Zachary Amsden wrote:
>
>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>> running. This causes us to require resynchronization. Since
>> we can't tell when this may potentially happen, we assume the
>> worst by forcing re-compensation for it at every point the VCPU
>> task is descheduled.
>>
>>
>>
> Is there any method to do the compensation on host TSC when recovering
> from C-state? It should be simpler than do it on guest TSC.
>

Actually, yes, we could re-calibrate from HPET. Considering this patch
is already broken, maybe that's a better idea.

2010-06-16 19:20:35

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

On 06/16/2010 03:24 AM, Glauber Costa wrote:
> On Mon, Jun 14, 2010 at 09:34:06PM -1000, Zachary Amsden wrote:
>
>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>> running. This causes us to require resynchronization. Since
>> we can't tell when this may potentially happen, we assume the
>> worst by forcing re-compensation for it at every point the VCPU
>> task is descheduled.
>>
> can't we just compensate everytime check_tsc_unstable() is true?
>

We should probably not rely on check_tsc_unstable, we should do this
only if we don't have NONSTOP_TSC.

2010-06-16 19:36:31

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On 06/15/2010 10:11 PM, Jason Wang wrote:
> Zachary Amsden wrote:
>
>> Kernel time, which advances in discrete steps may progress much slower
>> than TSC. As a result, when kvmclock is adjusted to a new base, the
>> apparent time to the guest, which runs at a much higher, nsec scaled
>> rate based on the current TSC, may have already been observed to have
>> a larger value (kernel_ns + scaled tsc) than the value to which we are
>> setting it (kernel_ns + 0).
>>
>>
> This is one issue of kvmclock which tries to supply a clocksource whose
> precision may even higher than host.
>
>> We must instead compute the clock as potentially observed by the guest
>> for kernel_ns to make sure it does not go backwards.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 4 ++
>> arch/x86/kvm/x86.c | 79 +++++++++++++++++++++++++++++++++------
>> 2 files changed, 71 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 1afecd7..7ec2472 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -338,6 +338,8 @@ struct kvm_vcpu_arch {
>> struct page *time_page;
>> u64 last_host_tsc;
>> u64 last_host_ns;
>> + u64 last_guest_tsc;
>> + u64 last_kernel_ns;
>>
>> bool nmi_pending;
>> bool nmi_injected;
>> @@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
>> u32 hypercalls;
>> u32 irq_injections;
>> u32 nmi_injections;
>> + u32 tsc_overshoot;
>> + u32 tsc_ahead;
>> };
>>
>> struct kvm_x86_ops {
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 52d7d34..703ea43 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -138,6 +138,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>> { "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
>> { "irq_injections", VCPU_STAT(irq_injections) },
>> { "nmi_injections", VCPU_STAT(nmi_injections) },
>> + { "tsc_overshoot", VCPU_STAT(tsc_overshoot) },
>> + { "tsc_ahead", VCPU_STAT(tsc_ahead) },
>> { "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
>> { "mmu_pte_write", VM_STAT(mmu_pte_write) },
>> { "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
>> @@ -927,33 +929,84 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
>> struct kvm_vcpu_arch *vcpu =&v->arch;
>> void *shared_kaddr;
>> unsigned long this_tsc_khz;
>> + s64 kernel_ns, max_kernel_ns;
>> + u64 tsc_timestamp;
>>
>> if ((!vcpu->time_page))
>> return 0;
>>
>> - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
>> - put_cpu_var(cpu_tsc_khz);
>> + /*
>> + * The protection we require is simple: we must not be preempted from
>> + * the CPU between our read of the TSC khz and our read of the TSC.
>> + * Interrupt protection is not strictly required, but it does result in
>> + * greater accuracy for the TSC / kernel_ns measurement.
>> + */
>> + local_irq_save(flags);
>> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>> + kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
>> + ktime_get_ts(&ts);
>> + monotonic_to_bootbased(&ts);
>> + kernel_ns = timespec_to_ns(&ts);
>> + local_irq_restore(flags);
>> +
>> if (unlikely(this_tsc_khz == 0)) {
>> kvm_request_guest_time_update(v);
>> return 1;
>> }
>>
>> + /*
>> + * Time as measured by the TSC may go backwards when resetting the base
>> + * tsc_timestamp. The reason for this is that the TSC resolution is
>> + * higher than the resolution of the other clock scales. Thus, many
>> + * possible measurments of the TSC correspond to one measurement of any
>> + * other clock, and so a spread of values is possible. This is not a
>> + * problem for the computation of the nanosecond clock; with TSC rates
>> + * around 1GHZ, there can only be a few cycles which correspond to one
>> + * nanosecond value, and any path through this code will inevitably
>> + * take longer than that. However, with the kernel_ns value itself,
>> + * the precision may be much lower, down to HZ granularity. If the
>> + * first sampling of TSC against kernel_ns ends in the low part of the
>> + * range, and the second in the high end of the range, we can get:
>> + *
>> + * (TSC - offset_low) * S + kns_old> (TSC - offset_high) * S + kns_new
>> + *
>> + * As the sampling errors potentially range in the thousands of cycles,
>> + * it is possible such a time value has already been observed by the
>> + * guest. To protect against this, we must compute the system time as
>> + * observed by the guest and ensure the new system time is greater.
>> + */
>> + max_kernel_ns = 0;
>> + if (vcpu->hv_clock.tsc_timestamp) {
>> + max_kernel_ns = vcpu->last_guest_tsc -
>>
>>
> Since you do the comparison with kernel_ns, so what you need here is
> tsc_timestamp which looks more like the 'last' tsc seen by guest.
>

What this is computing is the highest bootbased nanosecond time value
potentially seen by the guest:

last_guest_tsc - hv_clock.tsc_timestamp is the maximum cycle offset the
guest has seen against the last version of kvmclock.

Then it is scaled and added to the last_kernel_ns value used for
hv_clock. I chose to cache vcpu->last_kernel_ns separately from
hv_clock so that kvmclock_offset can not change in the meantime, so the
value deliberately discounts kvmclock_offset.

>> + vcpu->hv_clock.tsc_timestamp;
>> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
>> + vcpu->hv_clock.tsc_to_system_mul,
>> + vcpu->hv_clock.tsc_shift);
>> + max_kernel_ns += vcpu->last_kernel_ns;
>> + }
>> +
>> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>> - kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
>> + &vcpu->hv_clock.tsc_shift,
>> + &vcpu->hv_clock.tsc_to_system_mul);
>> vcpu->hw_tsc_khz = this_tsc_khz;
>> }
>>
>> - /* Keep irq disabled to prevent changes to the clock */
>> - local_irq_save(flags);
>> - kvm_get_msr(v, MSR_IA32_TSC,&vcpu->hv_clock.tsc_timestamp);
>> - ktime_get_ts(&ts);
>> - monotonic_to_bootbased(&ts);
>> - local_irq_restore(flags);
>> + if (max_kernel_ns> kernel_ns) {
>>
>>
> Both max_kernel_ns and kernel_ns are not adjusted by kvmclock_offset, so
> this comparing is not safe after migration.
>

They are deliberately not adjusted by kvmclock_offset, so they are
simply scalar bootbased nanosecond values, not kvmclock_offset values.
Thus, we can add, subtract and take maximums of them without worrying
about kvmclock_offset at all.

>> + s64 overshoot = max_kernel_ns - kernel_ns;
>> + ++v->stat.tsc_ahead;
>> + if (overshoot> NSEC_PER_SEC / HZ) {
>> + ++v->stat.tsc_overshoot;
>> + if (printk_ratelimit())
>> + pr_debug("ns overshoot: %lld\n", overshoot);
>> + }
>> + kernel_ns = max_kernel_ns;
>> + }
>>
>>
> A tsc_behind or something like this would make the problem more clear,
> and tsc_ahead should be zero when host using tsc as its clocksource.
>
>>
>> /* With all the info we got, fill in the values */
>> -
>> - vcpu->hv_clock.system_time = ts.tv_nsec +
>> - (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
>> + vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>> + vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>> + vcpu->last_kernel_ns = kernel_ns;
>>
>> vcpu->hv_clock.flags = 0;
>>
>> @@ -4836,6 +4889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>> if (hw_breakpoint_active())
>> hw_breakpoint_restore();
>>
>> + kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>>
>>
> This could be dropped since it does not take the time of guest execution
> into account.
>

This is required is for the maximum cycle offset above.

2010-06-16 21:34:46

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

On 06/16/2010 03:32 AM, Glauber Costa wrote:
> On Mon, Jun 14, 2010 at 09:34:07PM -1000, Zachary Amsden wrote:
>
>> SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
>> local offset delta from last measurement. This does not take into account how
>> long it has been since the measurement, leading to drift. Minimize the drift
>> by accounting for any time difference the kernel has observed.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>>
> I believe this should be done not only if we have check_tsc_unstable() == true,
> but anytime we adjust the tsc. I mean:
>
> Sure it is expected to be much more relevant in this case, but if we're testing
> generally for tsc_delta< 0 in the adjustment code, it is because we believe
> it can happen, even if tsc is stable (otherwise, we'd better take it off completely).
>
> And in that case, we should account elapsed time too.
>

If we get tsc_delta < 0 test turning true, we've got an unstable tsc to
begin with, so perhaps we should just check that and let the TSC code
deal with detecting an unstable TSC for us.

2010-06-16 22:36:22

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 16/17] TSC reset compensation

On 06/16/2010 03:52 AM, Glauber Costa wrote:
> On Mon, Jun 14, 2010 at 09:34:18PM -1000, Zachary Amsden wrote:
>
>> Attempt to synchronize TSCs which are reset to the same value. In the
>> case of a reliable hardware TSC, we can just re-use the same offset, but
>> on non-reliable hardware, we can get closer by adjusting the offset to
>> match the elapsed time.
>>
>> Signed-off-by: Zachary Amsden<[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++++--
>> 1 files changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8e836e9..cedb71f 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct kvm_vcpu *v)
>> set_bit(KVM_REQ_CLOCK_SYNC,&v->requests);
>> }
>>
>> +static inline int kvm_tsc_reliable(void)
>> +{
>> + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)&&
>> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC)&&
>> + !check_tsc_unstable());
>> +}
>> +
>>
> why can't we re-use vmware TSC_RELIABLE flag?
>

It's only set for VMware. Basically, it means "you are running in a
VMware hypervisor, TSC is reliable". Which KVM won't ever be, at least,
not in production use, so it doesn't make that sort of sense here.
Besides, a system with a reliable TSC can become a system without a
reliable TSC : CPU hotplug will always guarantee this.

We could, however, have the guest set the TSC_RELIABLE flag for itself
if KVM somehow makes that promise (currently, it does not).

Zach

2010-06-16 22:38:36

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

On 06/16/2010 03:58 AM, Glauber Costa wrote:
> On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
>
>> Zachary Amsden wrote:
>>
>>> Kernel time, which advances in discrete steps may progress much slower
>>> than TSC. As a result, when kvmclock is adjusted to a new base, the
>>> apparent time to the guest, which runs at a much higher, nsec scaled
>>> rate based on the current TSC, may have already been observed to have
>>> a larger value (kernel_ns + scaled tsc) than the value to which we are
>>> setting it (kernel_ns + 0).
>>>
>>>
>> This is one issue of kvmclock which tries to supply a clocksource whose
>> precision may even higher than host.
>>
> What if we export to the guest the current clock resolution, and when doing guest
> reads, simply chop whatever value we got to the lowest acceptable value?
>

I considered it, but it still doesn't solve the problem, at least, not
without adding TSC trap and emulate. If you can see a higher resolution
clock advance faster than the resolution of the kernel, you still have
the problem, and any visible TSC will do that.

2010-06-16 23:59:55

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

On 06/15/2010 10:27 AM, Randy Dunlap wrote:
> On Mon, 14 Jun 2010 21:34:19 -1000 Zachary Amsden wrote:
>
>
>> Basic informational document about x86 timekeeping and how KVM
>> is affected.
>>
> Nice job/information. Thanks.
>
> Just some typos etc. inline below.
>

Thanks for all the detailed feedback! I'll include it all in the next
revision. I only have one response:
>
>> + low, the count is halted. If the output is low when the gate is lowered, the
>> + output automatically goes high (this only affects timer 2).
>> +
>> +Mode 3: Square Wave. This generates a sine wave. The count determines the
>>
> a sine wave is a square wave???
>

Actually, yes. The hardware output from timer 2 output goes through a
low pass filter which effectively smooths the square wave to a sine to
make the PC speaker sound somewhat less annoying. I didn't have room to
depict this in the schematic. Of course, the counter output isn't
directly a sine wave, and the LPF may not be all that effective, but it
was the intent.

Zach

2010-06-17 07:57:55

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 03/17] Unify vendor TSC logic

Glauber Costa wrote:
> On Wed, Jun 16, 2010 at 04:10:10PM +0800, Jason Wang wrote:
>
>> Zachary Amsden wrote:
>>
>>>
>>> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>> {
>>> + kvm_x86_ops->vcpu_load(vcpu, cpu);
>>> if (unlikely(vcpu->cpu != cpu)) {
>>> + /* Make sure TSC doesn't go backwards */
>>> + s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>> + native_read_tsc() - vcpu->arch.last_host_tsc;
>>> + if (tsc_delta < 0 || check_tsc_unstable())
>>>
>>>
>> It's better to do the adjustment also when tsc_delta > 0
>>
> And why do you think so? Doing it on tsc_delta > 0 would force us to adjust
> at every entry but the first. And I guess we want to adjust as few times as
> we can.
>
>
This is not strange and is what current SVM code does. If we do not do
this, guest may see a jump in the value of TSC when tsc_delta > 0.
> For example, we would adjust on every cpu bounce even for machines that has
> a perfectly sync tsc. This could introduce an error not present before.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-06-17 08:10:22

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 03/17] Unify vendor TSC logic

Zachary Amsden wrote:
> On 06/15/2010 10:10 PM, Jason Wang wrote:
>
>> Zachary Amsden wrote:
>>
>>
>>> Move the TSC control logic from the vendor backends into x86.c
>>> by adding adjust_tsc_offset to x86 ops. Now all TSC decisions
>>> can be done in one place.
>>>
>>> Also, rename some variable in the VCPU structure to more accurately
>>> reflect their actual content.
>>>
>>> VMX backend would record last observed TSC very late (possibly in an
>>> IPI to clear the VMCS from a remote CPU), now it is done earlier.
>>>
>>> Note this is still not yet perfect. We adjust TSC in both
>>> directions when the TSC is unstable, resulting in desynchronized
>>> TSCs for SMP guests. A better choice would be to compensate based
>>> on a master reference clock.
>>>
>>> Signed-off-by: Zachary Amsden<[email protected]>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 5 +++--
>>> arch/x86/kvm/svm.c | 25 +++++++++++--------------
>>> arch/x86/kvm/vmx.c | 20 ++++++++------------
>>> arch/x86/kvm/x86.c | 16 +++++++++++-----
>>> 4 files changed, 33 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 91631b8..94f6ce8 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -253,7 +253,6 @@ struct kvm_mmu {
>>> };
>>>
>>> struct kvm_vcpu_arch {
>>> - u64 host_tsc;
>>> /*
>>> * rip and regs accesses must go through
>>> * kvm_{register,rip}_{read,write} functions.
>>> @@ -334,9 +333,10 @@ struct kvm_vcpu_arch {
>>>
>>> gpa_t time;
>>> struct pvclock_vcpu_time_info hv_clock;
>>> - unsigned int hv_clock_tsc_khz;
>>> + unsigned int hw_tsc_khz;
>>> unsigned int time_offset;
>>> struct page *time_page;
>>> + u64 last_host_tsc;
>>>
>>> bool nmi_pending;
>>> bool nmi_injected;
>>> @@ -530,6 +530,7 @@ struct kvm_x86_ops {
>>> u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>>> int (*get_lpage_level)(void);
>>> bool (*rdtscp_supported)(void);
>>> + void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
>>>
>>> void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
>>>
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 09b2145..ee2cf30 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>> int i;
>>>
>>> if (unlikely(cpu != vcpu->cpu)) {
>>> - u64 delta;
>>> -
>>> - if (check_tsc_unstable()) {
>>> - /*
>>> - * Make sure that the guest sees a monotonically
>>> - * increasing TSC.
>>> - */
>>> - delta = vcpu->arch.host_tsc - native_read_tsc();
>>> - svm->vmcb->control.tsc_offset += delta;
>>> - if (is_nested(svm))
>>> - svm->nested.hsave->control.tsc_offset += delta;
>>> - }
>>> svm->asid_generation = 0;
>>> }
>>>
>>> @@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
>>> ++vcpu->stat.host_state_reload;
>>> for (i = 0; i< NR_HOST_SAVE_USER_MSRS; i++)
>>> wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
>>> -
>>> - vcpu->arch.host_tsc = native_read_tsc();
>>> }
>>>
>>> static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
>>> @@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void)
>>> return false;
>>> }
>>>
>>> +static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>>> +{
>>> + struct vcpu_svm *svm = to_svm(vcpu);
>>> +
>>> + svm->vmcb->control.tsc_offset += adjustment;
>>> + if (is_nested(svm))
>>> + svm->nested.hsave->control.tsc_offset += adjustment;
>>> +}
>>> +
>>> static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
>>> {
>>> struct vcpu_svm *svm = to_svm(vcpu);
>>> @@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = {
>>> .rdtscp_supported = svm_rdtscp_supported,
>>>
>>> .set_supported_cpuid = svm_set_supported_cpuid,
>>> +
>>> + .adjust_tsc_offset = svm_adjust_tsc_offset,
>>> };
>>>
>>> static int __init svm_init(void)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index a657e08..a993e67 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg)
>>> vmcs_clear(vmx->vmcs);
>>> if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
>>> per_cpu(current_vmcs, cpu) = NULL;
>>> - rdtscll(vmx->vcpu.arch.host_tsc);
>>> list_del(&vmx->local_vcpus_link);
>>> vmx->vcpu.cpu = -1;
>>> vmx->launched = 0;
>>> @@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
>>> static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>> {
>>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> - u64 tsc_this, delta, new_offset;
>>> u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
>>>
>>> if (!vmm_exclusive)
>>> @@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>
>>> rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>>> vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
>>> -
>>> - /*
>>> - * Make sure the time stamp counter is monotonous.
>>> - */
>>> - rdtscll(tsc_this);
>>> - if (tsc_this< vcpu->arch.host_tsc) {
>>> - delta = vcpu->arch.host_tsc - tsc_this;
>>> - new_offset = vmcs_read64(TSC_OFFSET) + delta;
>>> - vmcs_write64(TSC_OFFSET, new_offset);
>>> - }
>>> }
>>> }
>>>
>>> @@ -1153,6 +1141,12 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
>>> vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
>>> }
>>>
>>> +static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>>> +{
>>> + u64 offset = vmcs_read64(TSC_OFFSET);
>>> + vmcs_write64(TSC_OFFSET, offset + adjustment);
>>> +}
>>> +
>>> /*
>>> * Reads an msr value (of 'msr_index') into 'pdata'.
>>> * Returns 0 on success, non-0 otherwise.
>>> @@ -4340,6 +4334,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
>>> .rdtscp_supported = vmx_rdtscp_supported,
>>>
>>> .set_supported_cpuid = vmx_set_supported_cpuid,
>>> +
>>> + .adjust_tsc_offset = vmx_adjust_tsc_offset,
>>> };
>>>
>>> static int __init vmx_init(void)
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 0a102d3..c8289d0 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -929,9 +929,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>>> return 1;
>>> }
>>>
>>> - if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
>>> + if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>>> kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>>> - vcpu->hv_clock_tsc_khz = this_tsc_khz;
>>> + vcpu->hw_tsc_khz = this_tsc_khz;
>>> }
>>>
>>> /* Keep irq disabled to prevent changes to the clock */
>>> @@ -1805,18 +1805,24 @@ out:
>>>
>>> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>> {
>>> + kvm_x86_ops->vcpu_load(vcpu, cpu);
>>> if (unlikely(vcpu->cpu != cpu)) {
>>> + /* Make sure TSC doesn't go backwards */
>>> + s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>> + native_read_tsc() - vcpu->arch.last_host_tsc;
>>> + if (tsc_delta< 0 || check_tsc_unstable())
>>>
>>>
>>>
>> It's better to do the adjustment also when tsc_delta> 0
>>
>>
>
> No, that causes time to stop.
>

You have done the compensation in PATCH 5. And another big question is
why did you choose to drop the IPI based method which have done the
compensation like PATCH 5? And more accurate value could be got through
measuring the RTT of IPI.

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-06-17 08:55:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

Zachary Amsden <[email protected]> writes:

I think listing all the obscure bits in the PIT was an attempt to
weed out the weak and weary readers early, right?

> +this as well. Several hardware limitations make the problem worse - if it is
> +not possible to write the full 32-bits of the TSC, it may be impossible to
> +match the TSC in newly arriving CPUs to that of the rest of the system,
> +resulting in unsynchronized TSCs. This may be done by BIOS or system software,
> +but in practice, getting a perfectly synchronized TSC will not be possible
> +unless all values are read from the same clock, which generally only is
> +possible on single socket systems or those with special hardware
> +support.

That's not true, single crystal for all sockets is very common
as long as you only have a single motherboard.

Of course there might be other reasons why the TSC is unsynchronized
(e.g. stop count in C-states), but the single clock is not the problem.

> +3.4) TSC and C-states
> +
> +C-states, or idling states of the processor, especially C1E and deeper sleep
> +states may be problematic for TSC as well. The TSC may stop advancing in such
> +a state, resulting in a TSC which is behind that of other CPUs when execution
> +is resumed. Such CPUs must be detected and flagged by the operating system
> +based on CPU and chipset identifications.
> +
> +The TSC in such a case may be corrected by catching it up to a known external
> +clocksource.

... This is fixed in recent CPUs ...

> +
> +3.5) TSC frequency change / P-states
> +
> +To make things slightly more interesting, some CPUs may change requency. They
> +may or may not run the TSC at the same rate, and because the frequency change
> +may be staggered or slewed, at some points in time, the TSC rate may not be
> +known other than falling within a range of values. In this case, the TSC will
> +not be a stable time source, and must be calibrated against a known, stable,
> +external clock to be a usable source of time.
> +
> +Whether the TSC runs at a constant rate or scales with the P-state is model
> +dependent and must be determined by inspecting CPUID, chipset or various MSR
> +fields.

... In general newer CPUs should not have problems with this anymore

> +
> +4) Virtualization Problems
> +
> +Timekeeping is especially problematic for virtualization because a number of
> +challenges arise. The most obvious problem is that time is now shared between
> +the host and, potentially, a number of virtual machines. This happens
> +naturally on X86 systems when SMM mode is used by the BIOS, but not to such a
> +degree nor with such frequency. However, the fact that SMM mode may cause

The SMM reference here seems at best odd.

-Andi
--
[email protected] -- Speaking for myself only.

2010-06-17 20:31:05

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 03/17] Unify vendor TSC logic

On 06/16/2010 10:15 PM, Jason Wang wrote:
> Zachary Amsden wrote:
>
>> On 06/15/2010 10:10 PM, Jason Wang wrote:
>>
>>
>>> Zachary Amsden wrote:
>>>
>>>
>>>
>>>> Move the TSC control logic from the vendor backends into x86.c
>>>> by adding adjust_tsc_offset to x86 ops. Now all TSC decisions
>>>> can be done in one place.
>>>>
>>>> Also, rename some variable in the VCPU structure to more accurately
>>>> reflect their actual content.
>>>>
>>>> VMX backend would record last observed TSC very late (possibly in an
>>>> IPI to clear the VMCS from a remote CPU), now it is done earlier.
>>>>
>>>> Note this is still not yet perfect. We adjust TSC in both
>>>> directions when the TSC is unstable, resulting in desynchronized
>>>> TSCs for SMP guests. A better choice would be to compensate based
>>>> on a master reference clock.
>>>>
>>>> Signed-off-by: Zachary Amsden<[email protected]>
>>>> ---
>>>> arch/x86/include/asm/kvm_host.h | 5 +++--
>>>> arch/x86/kvm/svm.c | 25 +++++++++++--------------
>>>> arch/x86/kvm/vmx.c | 20 ++++++++------------
>>>> arch/x86/kvm/x86.c | 16 +++++++++++-----
>>>> 4 files changed, 33 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index 91631b8..94f6ce8 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -253,7 +253,6 @@ struct kvm_mmu {
>>>> };
>>>>
>>>> struct kvm_vcpu_arch {
>>>> - u64 host_tsc;
>>>> /*
>>>> * rip and regs accesses must go through
>>>> * kvm_{register,rip}_{read,write} functions.
>>>> @@ -334,9 +333,10 @@ struct kvm_vcpu_arch {
>>>>
>>>> gpa_t time;
>>>> struct pvclock_vcpu_time_info hv_clock;
>>>> - unsigned int hv_clock_tsc_khz;
>>>> + unsigned int hw_tsc_khz;
>>>> unsigned int time_offset;
>>>> struct page *time_page;
>>>> + u64 last_host_tsc;
>>>>
>>>> bool nmi_pending;
>>>> bool nmi_injected;
>>>> @@ -530,6 +530,7 @@ struct kvm_x86_ops {
>>>> u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>>>> int (*get_lpage_level)(void);
>>>> bool (*rdtscp_supported)(void);
>>>> + void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
>>>>
>>>> void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
>>>>
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index 09b2145..ee2cf30 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>> int i;
>>>>
>>>> if (unlikely(cpu != vcpu->cpu)) {
>>>> - u64 delta;
>>>> -
>>>> - if (check_tsc_unstable()) {
>>>> - /*
>>>> - * Make sure that the guest sees a monotonically
>>>> - * increasing TSC.
>>>> - */
>>>> - delta = vcpu->arch.host_tsc - native_read_tsc();
>>>> - svm->vmcb->control.tsc_offset += delta;
>>>> - if (is_nested(svm))
>>>> - svm->nested.hsave->control.tsc_offset += delta;
>>>> - }
>>>> svm->asid_generation = 0;
>>>> }
>>>>
>>>> @@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
>>>> ++vcpu->stat.host_state_reload;
>>>> for (i = 0; i< NR_HOST_SAVE_USER_MSRS; i++)
>>>> wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
>>>> -
>>>> - vcpu->arch.host_tsc = native_read_tsc();
>>>> }
>>>>
>>>> static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
>>>> @@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void)
>>>> return false;
>>>> }
>>>>
>>>> +static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>>>> +{
>>>> + struct vcpu_svm *svm = to_svm(vcpu);
>>>> +
>>>> + svm->vmcb->control.tsc_offset += adjustment;
>>>> + if (is_nested(svm))
>>>> + svm->nested.hsave->control.tsc_offset += adjustment;
>>>> +}
>>>> +
>>>> static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
>>>> {
>>>> struct vcpu_svm *svm = to_svm(vcpu);
>>>> @@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = {
>>>> .rdtscp_supported = svm_rdtscp_supported,
>>>>
>>>> .set_supported_cpuid = svm_set_supported_cpuid,
>>>> +
>>>> + .adjust_tsc_offset = svm_adjust_tsc_offset,
>>>> };
>>>>
>>>> static int __init svm_init(void)
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index a657e08..a993e67 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg)
>>>> vmcs_clear(vmx->vmcs);
>>>> if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
>>>> per_cpu(current_vmcs, cpu) = NULL;
>>>> - rdtscll(vmx->vcpu.arch.host_tsc);
>>>> list_del(&vmx->local_vcpus_link);
>>>> vmx->vcpu.cpu = -1;
>>>> vmx->launched = 0;
>>>> @@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
>>>> static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>> {
>>>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>> - u64 tsc_this, delta, new_offset;
>>>> u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
>>>>
>>>> if (!vmm_exclusive)
>>>> @@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>
>>>> rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>>>> vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
>>>> -
>>>> - /*
>>>> - * Make sure the time stamp counter is monotonous.
>>>> - */
>>>> - rdtscll(tsc_this);
>>>> - if (tsc_this< vcpu->arch.host_tsc) {
>>>> - delta = vcpu->arch.host_tsc - tsc_this;
>>>> - new_offset = vmcs_read64(TSC_OFFSET) + delta;
>>>> - vmcs_write64(TSC_OFFSET, new_offset);
>>>> - }
>>>> }
>>>> }
>>>>
>>>> @@ -1153,6 +1141,12 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
>>>> vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
>>>> }
>>>>
>>>> +static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>>>> +{
>>>> + u64 offset = vmcs_read64(TSC_OFFSET);
>>>> + vmcs_write64(TSC_OFFSET, offset + adjustment);
>>>> +}
>>>> +
>>>> /*
>>>> * Reads an msr value (of 'msr_index') into 'pdata'.
>>>> * Returns 0 on success, non-0 otherwise.
>>>> @@ -4340,6 +4334,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
>>>> .rdtscp_supported = vmx_rdtscp_supported,
>>>>
>>>> .set_supported_cpuid = vmx_set_supported_cpuid,
>>>> +
>>>> + .adjust_tsc_offset = vmx_adjust_tsc_offset,
>>>> };
>>>>
>>>> static int __init vmx_init(void)
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 0a102d3..c8289d0 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -929,9 +929,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>>>> return 1;
>>>> }
>>>>
>>>> - if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
>>>> + if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>>>> kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>>>> - vcpu->hv_clock_tsc_khz = this_tsc_khz;
>>>> + vcpu->hw_tsc_khz = this_tsc_khz;
>>>> }
>>>>
>>>> /* Keep irq disabled to prevent changes to the clock */
>>>> @@ -1805,18 +1805,24 @@ out:
>>>>
>>>> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>> {
>>>> + kvm_x86_ops->vcpu_load(vcpu, cpu);
>>>> if (unlikely(vcpu->cpu != cpu)) {
>>>> + /* Make sure TSC doesn't go backwards */
>>>> + s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>> + native_read_tsc() - vcpu->arch.last_host_tsc;
>>>> + if (tsc_delta< 0 || check_tsc_unstable())
>>>>
>>>>
>>>>
>>>>
>>> It's better to do the adjustment also when tsc_delta> 0
>>>
>>>
>>>
>> No, that causes time to stop.
>>
>>
> You have done the compensation in PATCH 5. And another big question is
>

Consider a scheduling diagram of what happens if you always compensate
for local TSC for an SMP VM:

VCPU1 VCPU2
------------------ -------------------
on pcpu0,tsc=0 on pcup1,tsc=9000

off,tsc=2
idle

off,tsc=9005
on pcpu1,tsc=9006 on pcpu0,tsc=6

off,tsc=9008 off,tsc=8
------------------ -------------------
VTSC: 5 8

At the end, virtual TSC is desynchronized, even though both hardware
TSCs are running in sync.

This is 100% wrong to do if TSC is stable and in sync. With unstable
TSC, the best you can do is compensate for local TSC AND account for
elapsed time (the idle time above which causes VCPU1 to lose cycles).
You will still have error because the clock used to time elapsed time is
sampled from a different source, but there isn't much you can do about
it. If you knew the TSC delta, then you could avoid multiplying
compensation errors over time, but most of the time, you don't even have
a guarantee the TSC will stay running at the same rate and in the same
periods of time, so the delta is changing.

> why did you choose to drop the IPI based method which have done the
> compensation like PATCH 5? And more accurate value could be got through
> measuring the RTT of IPI.
>

You could measure the delta at every potential discontinuity with an
IPI, but it's way too heavy a hammer to use. If your TSC stops in C1
idle state, you have to resynchronize after every halt with a round of
IPIs to the master? Might as well constantly hit it on the head with a
frying pan. You're better off just disabling C1 idle (or the passive
idle loop entirely). Also, the locking is horrendously complicated.

Given all this cost and complexity, you'd want some positive result.

However, it still doesn't solve the problem above; there's going to be
error and a potentially observed backwards clock across SMP VCPUs
because even the IPI based measurement isn't perfect. The only way to
solve that is trap all the reads of the TSC.

So it doesn't seem a good design choice to make things more complex and
costly in the common case to get a slightly better, but still broken
solution in the outlying case.

Zach

2010-06-17 21:14:26

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

On 06/16/2010 10:55 PM, Andi Kleen wrote:
> Zachary Amsden<[email protected]> writes:
>
> I think listing all the obscure bits in the PIT was an attempt to
> weed out the weak and weary readers early, right?
>

Very perceptive of you ;)

>
>> +this as well. Several hardware limitations make the problem worse - if it is
>> +not possible to write the full 32-bits of the TSC, it may be impossible to
>> +match the TSC in newly arriving CPUs to that of the rest of the system,
>> +resulting in unsynchronized TSCs. This may be done by BIOS or system software,
>> +but in practice, getting a perfectly synchronized TSC will not be possible
>> +unless all values are read from the same clock, which generally only is
>> +possible on single socket systems or those with special hardware
>> +support.
>>
> That's not true, single crystal for all sockets is very common
> as long as you only have a single motherboard.
>
> Of course there might be other reasons why the TSC is unsynchronized
> (e.g. stop count in C-states), but the single clock is not the problem.
>

The point is about hotplug CPUs. Any hotplugged CPU will not have a
perfectly synchronized TSC, ever, even on a single socket, single
crystal board.

>
>> +3.4) TSC and C-states
>> +
>> +C-states, or idling states of the processor, especially C1E and deeper sleep
>> +states may be problematic for TSC as well. The TSC may stop advancing in such
>> +a state, resulting in a TSC which is behind that of other CPUs when execution
>> +is resumed. Such CPUs must be detected and flagged by the operating system
>> +based on CPU and chipset identifications.
>> +
>> +The TSC in such a case may be corrected by catching it up to a known external
>> +clocksource.
>>
> ... This is fixed in recent CPUs ...
>

And has a CPU flag associated with it (NONSTOP_TSC). But whether it
remains fixed across all models and vendors remains to be seen.

>> +
>> +3.5) TSC frequency change / P-states
>> +
>> +To make things slightly more interesting, some CPUs may change requency. They
>> +may or may not run the TSC at the same rate, and because the frequency change
>> +may be staggered or slewed, at some points in time, the TSC rate may not be
>> +known other than falling within a range of values. In this case, the TSC will
>> +not be a stable time source, and must be calibrated against a known, stable,
>> +external clock to be a usable source of time.
>> +
>> +Whether the TSC runs at a constant rate or scales with the P-state is model
>> +dependent and must be determined by inspecting CPUID, chipset or various MSR
>> +fields.
>>
> ... In general newer CPUs should not have problems with this anymore
>

But that's not the point. Old CPUs will, and I'm detailing all of the
existing issues, relevant to new CPUs or not. A lot of these "old" CPUs
are still in service and will be for quite some time.

>
>> +
>> +4) Virtualization Problems
>> +
>> +Timekeeping is especially problematic for virtualization because a number of
>> +challenges arise. The most obvious problem is that time is now shared between
>> +the host and, potentially, a number of virtual machines. This happens
>> +naturally on X86 systems when SMM mode is used by the BIOS, but not to such a
>> +degree nor with such frequency. However, the fact that SMM mode may cause
>>
> The SMM reference here seems at best odd.
>

SMIs are notorious for frustrating writers of careful timing loops, and
several pieces of kernel code take time measurements multiple times to
rule out outliers from it.

Seems a perfectly reasonable reference to me, perhaps I should explain
it better.

2010-06-18 07:49:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

> The point is about hotplug CPUs. Any hotplugged CPU will not have a
> perfectly synchronized TSC, ever, even on a single socket, single crystal
> board.

hotplug was in the next section, not in this.

Besides most systems do not support hotplug CPUs.

-Andi

2010-06-18 17:35:25

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 17/17] Add timekeeping documentation

On 06/17/2010 09:49 PM, Andi Kleen wrote:
>> The point is about hotplug CPUs. Any hotplugged CPU will not have a
>> perfectly synchronized TSC, ever, even on a single socket, single crystal
>> board.
>>
> hotplug was in the next section, not in this.
>

Yeah, I reread it and this section was totally confused. Hopefully it
makes more sense this time around.

> Besides most systems do not support hotplug CPUs.
>

Yet.