2010-04-26 17:50:38

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 0/6] pvclock fixes

Hi,

This is the last series I've sent, with comments from you merged.
The first 5 patches are the same, only with the suggested fixes.
I am leaving documentation out, since the basics won't change, and
we're still discussing the details.

Patch 6 is new, and is the guest side of the skipping updates
avi asked for. I haven't yet done any HV work on this
(specially because I am not convinced exactly where it is safe to
do).

Let me know what you think.

Thanks

Glauber Costa (6):
Enable pvclock flags in vcpu_time_info structure
Add a global synchronization point for pvclock
change msr numbers for kvmclock
export new cpuid KVM_CAP
Try using new kvm clock msrs
don't compute pvclock adjustments if we trust the tsc

arch/x86/include/asm/kvm_para.h | 13 ++++++++
arch/x86/include/asm/pvclock-abi.h | 4 ++-
arch/x86/include/asm/pvclock.h | 1 +
arch/x86/kernel/kvmclock.c | 59 +++++++++++++++++++++++-------------
arch/x86/kernel/pvclock.c | 37 ++++++++++++++++++++++
arch/x86/kvm/x86.c | 13 +++++++-
include/linux/kvm.h | 1 +
7 files changed, 105 insertions(+), 23 deletions(-)


2010-04-26 17:50:52

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 6/6] don't compute pvclock adjustments if we trust the tsc

If the HV told us we can fully trust the TSC, skip any
correction

Signed-off-by: Glauber Costa <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 5 +++++
arch/x86/include/asm/pvclock-abi.h | 1 +
arch/x86/kernel/kvmclock.c | 3 +++
arch/x86/kernel/pvclock.c | 4 ++++
4 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f019f8c..615ebb1 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,11 @@
*/
#define KVM_FEATURE_CLOCKSOURCE2 3

+/* The last 8 bits are used to indicate how to interpret the flags field
+ * in pvclock structure. If no bits are set, all flags are ignored.
+ */
+#define KVM_FEATURE_CLOCKSOURCE_STABLE_TSC 0xf8
+
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12

diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
index ec5c41a..b123bd7 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -39,5 +39,6 @@ struct pvclock_wall_clock {
u32 nsec;
} __attribute__((__packed__));

+#define PVCLOCK_STABLE_TSC (1 << 0)
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_PVCLOCK_ABI_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f2f6aee..aca2d3c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -218,4 +218,7 @@ void __init kvmclock_init(void)
clocksource_register(&kvm_clock);
pv_info.paravirt_enabled = 1;
pv_info.name = "KVM";
+
+ if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_TSC))
+ pvclock_valid_flags(PVCLOCK_STABLE_TSC);
}
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 6cf6dec..43ae8d5 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -135,6 +135,10 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
barrier();
} while (version != src->version);

+ if ((valid_flags & PVCLOCK_STABLE_TSC) &&
+ (shadow->flags & PVCLOCK_STABLE_TSC))
+ return ret;
+
/*
* Assumption here is that last_value, a global accumulator, always goes
* forward. If we are less than that, we should not be much smaller.
--
1.6.2.2

2010-04-26 17:50:44

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 1/6] Enable pvclock flags in vcpu_time_info structure

This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.

Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.

Signed-off-by: Glauber Costa <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/pvclock-abi.h | 3 ++-
arch/x86/include/asm/pvclock.h | 1 +
arch/x86/kernel/pvclock.c | 9 +++++++++
3 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
index 6d93508..ec5c41a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info {
u64 system_time;
u32 tsc_to_system_mul;
s8 tsc_shift;
- u8 pad[3];
+ u8 flags;
+ u8 pad[2];
} __attribute__((__packed__)); /* 32 bytes */

struct pvclock_wall_clock {
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..c50823f 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@

/* some helper functions for xen and kvm pv clock sources */
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+void pvclock_valid_flags(u8 flags);
unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..8f4af7b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -31,8 +31,16 @@ struct pvclock_shadow_time {
u32 tsc_to_nsec_mul;
int tsc_shift;
u32 version;
+ u8 flags;
};

+static u8 valid_flags = 0;
+
+void pvclock_valid_flags(u8 flags)
+{
+ valid_flags = flags;
+}
+
/*
* Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
* yielding a 64-bit result.
@@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
dst->system_timestamp = src->system_time;
dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
dst->tsc_shift = src->tsc_shift;
+ dst->flags = src->flags;
rmb(); /* test version after fetching data */
} while ((src->version & 1) || (dst->version != src->version));

--
1.6.2.2

2010-04-26 17:50:43

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 2/6] Add a global synchronization point for pvclock

In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.

This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.

Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.

Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.

A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.

We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.

OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.

After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.

Signed-off-by: Glauber Costa <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Marcelo Tosatti <[email protected]>
CC: Zachary Amsden <[email protected]>
---
arch/x86/kernel/pvclock.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 8f4af7b..6cf6dec 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -118,11 +118,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
return pv_tsc_khz;
}

+static atomic64_t last_value = ATOMIC64_INIT(0);
+
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
+ u64 last;

do {
version = pvclock_get_time_values(&shadow, src);
@@ -132,6 +135,27 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
barrier();
} while (version != src->version);

+ /*
+ * Assumption here is that last_value, a global accumulator, always goes
+ * forward. If we are less than that, we should not be much smaller.
+ * We assume there is an error marging we're inside, and then the correction
+ * does not sacrifice accuracy.
+ *
+ * For reads: global may have changed between test and return,
+ * but this means someone else updated poked the clock at a later time.
+ * We just need to make sure we are not seeing a backwards event.
+ *
+ * For updates: last_value = ret is not enough, since two vcpus could be
+ * updating at the same time, and one of them could be slightly behind,
+ * making the assumption that last_value always go forward fail to hold.
+ */
+ last = atomic64_read(&last_value);
+ do {
+ if (ret < last)
+ return last;
+ last = atomic64_cmpxchg(&last_value, last, ret);
+ } while (unlikely(last != ret));
+
return ret;
}

--
1.6.2.2

2010-04-26 17:51:17

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 5/6] Try using new kvm clock msrs

We now added a new set of clock-related msrs in replacement of the old
ones. In theory, we could just try to use them and get a return value
indicating they do not exist, due to our use of kvm_write_msr_save.

However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.

So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.

Signed-off-by: Glauber Costa <[email protected]>
---
arch/x86/kernel/kvmclock.c | 56 +++++++++++++++++++++++++++----------------
1 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..f2f6aee 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -29,6 +29,8 @@
#define KVM_SCALE 22

static int kvmclock = 1;
+static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;

static int parse_no_kvmclock(char *arg)
{
@@ -41,6 +43,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
static struct pvclock_wall_clock wall_clock;

+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -54,7 +57,8 @@ static unsigned long kvm_get_wallclock(void)

low = (int)__pa_symbol(&wall_clock);
high = ((u64)__pa_symbol(&wall_clock) >> 32);
- native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
+
+ native_write_msr_safe(msr_kvm_wall_clock, low, high);

vcpu_time = &get_cpu_var(hv_clock);
pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
@@ -130,7 +134,8 @@ static int kvm_register_clock(char *txt)
high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt);
- return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+
+ return native_write_msr_safe(msr_kvm_system_time, low, high);
}

#ifdef CONFIG_X86_LOCAL_APIC
@@ -165,14 +170,15 @@ static void __init kvm_smp_prepare_boot_cpu(void)
#ifdef CONFIG_KEXEC
static void kvm_crash_shutdown(struct pt_regs *regs)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+
+ native_write_msr(msr_kvm_system_time, 0, 0);
native_machine_crash_shutdown(regs);
}
#endif

static void kvm_shutdown(void)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+ native_write_msr(msr_kvm_system_time, 0, 0);
native_machine_shutdown();
}

@@ -181,27 +187,35 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;

- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
- if (kvm_register_clock("boot clock"))
- return;
- pv_time_ops.sched_clock = kvm_clock_read;
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.get_wallclock = kvm_get_wallclock;
- x86_platform.set_wallclock = kvm_set_wallclock;
+ if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+ }
+ else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+ return;
+
+ printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+ msr_kvm_system_time, msr_kvm_wall_clock);
+
+ if (kvm_register_clock("boot clock"))
+ return;
+ pv_time_ops.sched_clock = kvm_clock_read;
+ x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+ x86_platform.get_wallclock = kvm_get_wallclock;
+ x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.setup_percpu_clockev =
- kvm_setup_secondary_clock;
+ x86_cpuinit.setup_percpu_clockev =
+ kvm_setup_secondary_clock;
#endif
#ifdef CONFIG_SMP
- smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+ smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
#endif
- machine_ops.shutdown = kvm_shutdown;
+ machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC
- machine_ops.crash_shutdown = kvm_crash_shutdown;
+ machine_ops.crash_shutdown = kvm_crash_shutdown;
#endif
- kvm_get_preset_lpj();
- clocksource_register(&kvm_clock);
- pv_info.paravirt_enabled = 1;
- pv_info.name = "KVM";
- }
+ kvm_get_preset_lpj();
+ clocksource_register(&kvm_clock);
+ pv_info.paravirt_enabled = 1;
+ pv_info.name = "KVM";
}
--
1.6.2.2

2010-04-26 17:51:39

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 4/6] export new cpuid KVM_CAP

Since we're changing the msrs kvmclock uses, we have to communicate
that to the guest, through cpuid. We can add a new KVM_CAP to the
hypervisor, and then patch userspace to recognize it.

And if we ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.

Instead, what I'm proposing in this patch is a new capability, called
KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
currently supported by the hypervisor. If we ever want to add or remove
some feature, we only need to tweak into the HV, leaving userspace untouched.

Signed-off-by: Glauber Costa <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 4 ++++
arch/x86/kvm/x86.c | 6 ++++++
include/linux/kvm.h | 1 +
3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9734808..f019f8c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,10 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+/* This indicates that the new set of kvmclock msrs
+ * are available. The use of 0x11 and 0x12 is deprecated
+ */
+#define KVM_FEATURE_CLOCKSOURCE2 3

#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a2ead7f..04f04aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_MCE:
r = KVM_MAX_MCE_BANKS;
break;
+ case KVM_CAP_X86_CPUID_FEATURE_LIST:
+ r = (1 << KVM_FEATURE_CLOCKSOURCE) |
+ (1 << KVM_FEATURE_NOP_IO_DELAY) |
+ (1 << KVM_FEATURE_MMU_OP) |
+ (1 << KVM_FEATURE_CLOCKSOURCE2);
+ break;
default:
r = 0;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ce28767..1ce124f 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -507,6 +507,7 @@ struct kvm_ioeventfd {
#define KVM_CAP_DEBUGREGS 50
#endif
#define KVM_CAP_X86_ROBUST_SINGLESTEP 51
+#define KVM_CAP_X86_CPUID_FEATURE_LIST 52

#ifdef KVM_CAP_IRQ_ROUTING

--
1.6.2.2

2010-04-26 17:51:41

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 3/6] change msr numbers for kvmclock

Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.

Signed-off-by: Glauber Costa <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 4 ++++
arch/x86/kvm/x86.c | 7 ++++++-
2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ffae142..9734808 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,10 @@
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12

+/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
+#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
+#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+
#define KVM_MAX_MMU_OP_BATCH 32

/* Operations for KVM_HC_MMU_OP */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8824b73..a2ead7f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -575,9 +575,10 @@ static inline u32 bit(int bitno)
* kvm-specific. Those are put in the beginning of the list.
*/

-#define KVM_SAVE_MSRS_BEGIN 5
+#define KVM_SAVE_MSRS_BEGIN 7
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+ MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1099,10 +1100,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_IA32_MISC_ENABLE:
vcpu->arch.ia32_misc_enable_msr = data;
break;
+ case MSR_KVM_WALL_CLOCK_NEW:
case MSR_KVM_WALL_CLOCK:
vcpu->kvm->arch.wall_clock = data;
kvm_write_wall_clock(vcpu->kvm, data);
break;
+ case MSR_KVM_SYSTEM_TIME_NEW:
case MSR_KVM_SYSTEM_TIME: {
if (vcpu->arch.time_page) {
kvm_release_page_dirty(vcpu->arch.time_page);
@@ -1374,9 +1377,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
data = vcpu->arch.efer;
break;
case MSR_KVM_WALL_CLOCK:
+ case MSR_KVM_WALL_CLOCK_NEW:
data = vcpu->kvm->arch.wall_clock;
break;
case MSR_KVM_SYSTEM_TIME:
+ case MSR_KVM_SYSTEM_TIME_NEW:
data = vcpu->arch.time;
break;
case MSR_IA32_P5_MC_ADDR:
--
1.6.2.2

2010-04-26 18:12:04

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/6] Enable pvclock flags in vcpu_time_info structure

On 04/26/2010 10:46 AM, Glauber Costa wrote:
> This patch removes one padding byte and transform it into a flags
> field. New versions of guests using pvclock will query these flags
> upon each read.
>

Is this necessary? Why not just make the pvclock driver maintain a
local flag set, and have the HV backend call into it to update it. Why
does it need to be part of the pvclock structure?

J

> Flags, however, will only be interpreted when the guest decides to.
> It uses the pvclock_valid_flags function to signal that a specific
> set of flags should be taken into consideration. Which flags are valid
> are usually devised via HV negotiation.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/include/asm/pvclock-abi.h | 3 ++-
> arch/x86/include/asm/pvclock.h | 1 +
> arch/x86/kernel/pvclock.c | 9 +++++++++
> 3 files changed, 12 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
> index 6d93508..ec5c41a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info {
> u64 system_time;
> u32 tsc_to_system_mul;
> s8 tsc_shift;
> - u8 pad[3];
> + u8 flags;
> + u8 pad[2];
> } __attribute__((__packed__)); /* 32 bytes */
>
> struct pvclock_wall_clock {
> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
> index 53235fd..c50823f 100644
> --- a/arch/x86/include/asm/pvclock.h
> +++ b/arch/x86/include/asm/pvclock.h
> @@ -6,6 +6,7 @@
>
> /* some helper functions for xen and kvm pv clock sources */
> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
> +void pvclock_valid_flags(u8 flags);
> unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
> void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
> struct pvclock_vcpu_time_info *vcpu,
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 03801f2..8f4af7b 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -31,8 +31,16 @@ struct pvclock_shadow_time {
> u32 tsc_to_nsec_mul;
> int tsc_shift;
> u32 version;
> + u8 flags;
> };
>
> +static u8 valid_flags = 0;
> +
> +void pvclock_valid_flags(u8 flags)
> +{
> + valid_flags = flags;
> +}
> +
> /*
> * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
> * yielding a 64-bit result.
> @@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
> dst->system_timestamp = src->system_time;
> dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
> dst->tsc_shift = src->tsc_shift;
> + dst->flags = src->flags;
> rmb(); /* test version after fetching data */
> } while ((src->version & 1) || (dst->version != src->version));
>
>

2010-04-26 18:45:16

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 1/6] Enable pvclock flags in vcpu_time_info structure

On Mon, Apr 26, 2010 at 11:11:57AM -0700, Jeremy Fitzhardinge wrote:
> On 04/26/2010 10:46 AM, Glauber Costa wrote:
> > This patch removes one padding byte and transform it into a flags
> > field. New versions of guests using pvclock will query these flags
> > upon each read.
> >
>
> Is this necessary? Why not just make the pvclock driver maintain a
> local flag set, and have the HV backend call into it to update it. Why
> does it need to be part of the pvclock structure?
Because it is already there, and we have plenty of space left?

There are obvious other ways, but I don't see any of them being simpler.
If we go by the method you suggested, we'd have, for instance, to register
the memory area where this flags lives. Which is a duplication of the
infrastructure already present in kvmclock.

2010-04-27 02:21:46

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 0/6] pvclock fixes

On 04/26/2010 07:46 AM, Glauber Costa wrote:
> Hi,
>
> This is the last series I've sent, with comments from you merged.
> The first 5 patches are the same, only with the suggested fixes.
> I am leaving documentation out, since the basics won't change, and
> we're still discussing the details.
>
> Patch 6 is new, and is the guest side of the skipping updates
> avi asked for. I haven't yet done any HV work on this
> (specially because I am not convinced exactly where it is safe to
> do).
>
> Let me know what you think.
>

I'm rebasing my patches on top of this series to address the host side
issues. I noticed a couple issues patching against Avi's tree, not sure
if those issues are also present in trunk.

Can you keep me CC'd on any updates to this series?

Thanks,

Zach

2010-04-27 13:41:49

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 6/6] don't compute pvclock adjustments if we trust the tsc

On Mon, Apr 26, 2010 at 01:46:28PM -0400, Glauber Costa wrote:
> If the HV told us we can fully trust the TSC, skip any
> correction
>
> Signed-off-by: Glauber Costa <[email protected]>
> ---
> arch/x86/include/asm/kvm_para.h | 5 +++++
> arch/x86/include/asm/pvclock-abi.h | 1 +
> arch/x86/kernel/kvmclock.c | 3 +++
> arch/x86/kernel/pvclock.c | 4 ++++
> 4 files changed, 13 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index f019f8c..615ebb1 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -21,6 +21,11 @@
> */
> #define KVM_FEATURE_CLOCKSOURCE2 3
>
> +/* The last 8 bits are used to indicate how to interpret the flags field
> + * in pvclock structure. If no bits are set, all flags are ignored.
> + */
> +#define KVM_FEATURE_CLOCKSOURCE_STABLE_TSC 0xf8
> +
> #define MSR_KVM_WALL_CLOCK 0x11
> #define MSR_KVM_SYSTEM_TIME 0x12

Please drop this. Its not certain what is the best method to reduce
contention on the global variable. Whatever it is, can be done later.

2010-04-27 13:41:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 4/6] export new cpuid KVM_CAP

On Mon, Apr 26, 2010 at 01:46:26PM -0400, Glauber Costa wrote:
> Since we're changing the msrs kvmclock uses, we have to communicate
> that to the guest, through cpuid. We can add a new KVM_CAP to the
> hypervisor, and then patch userspace to recognize it.
>
> And if we ever add a new cpuid bit in the future, we have to do that again,
> which create some complexity and delay in feature adoption.
>
> Instead, what I'm proposing in this patch is a new capability, called
> KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> currently supported by the hypervisor. If we ever want to add or remove
> some feature, we only need to tweak into the HV, leaving userspace untouched.
>
> Signed-off-by: Glauber Costa <[email protected]>
> ---
> arch/x86/include/asm/kvm_para.h | 4 ++++
> arch/x86/kvm/x86.c | 6 ++++++
> include/linux/kvm.h | 1 +
> 3 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 9734808..f019f8c 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -16,6 +16,10 @@
> #define KVM_FEATURE_CLOCKSOURCE 0
> #define KVM_FEATURE_NOP_IO_DELAY 1
> #define KVM_FEATURE_MMU_OP 2
> +/* This indicates that the new set of kvmclock msrs
> + * are available. The use of 0x11 and 0x12 is deprecated
> + */
> +#define KVM_FEATURE_CLOCKSOURCE2 3
>
> #define MSR_KVM_WALL_CLOCK 0x11
> #define MSR_KVM_SYSTEM_TIME 0x12
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a2ead7f..04f04aa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_MCE:
> r = KVM_MAX_MCE_BANKS;
> break;
> + case KVM_CAP_X86_CPUID_FEATURE_LIST:
> + r = (1 << KVM_FEATURE_CLOCKSOURCE) |
> + (1 << KVM_FEATURE_NOP_IO_DELAY) |
> + (1 << KVM_FEATURE_MMU_OP) |

Remove this flag, it has been deprecated.

> + (1 << KVM_FEATURE_CLOCKSOURCE2);
> + break;
> default:

Also missing qemu-kvm/qemu patches.

2010-04-27 13:41:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 5/6] Try using new kvm clock msrs

On Mon, Apr 26, 2010 at 01:46:27PM -0400, Glauber Costa wrote:
> We now added a new set of clock-related msrs in replacement of the old
> ones. In theory, we could just try to use them and get a return value
> indicating they do not exist, due to our use of kvm_write_msr_save.
>
> However, kvm clock registration happens very early, and if we ever
> try to write to a non-existant MSR, we raise a lethal #GP, since our
> idt handlers are not in place yet.
>
> So this patch tests for a cpuid feature exported by the host to
> decide which set of msrs are supported.
>
> Signed-off-by: Glauber Costa <[email protected]>
> ---
> arch/x86/kernel/kvmclock.c | 56 +++++++++++++++++++++++++++----------------
> 1 files changed, 35 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index feaeb0d..f2f6aee 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -29,6 +29,8 @@
> #define KVM_SCALE 22
>
> static int kvmclock = 1;
> +static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
> +static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
>
> static int parse_no_kvmclock(char *arg)
> {
> @@ -41,6 +43,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
> static struct pvclock_wall_clock wall_clock;
>
> +

Extra newline.

> /*
> * The wallclock is the time of day when we booted. Since then, some time may
> * have elapsed since the hypervisor wrote the data. So we try to account for
> @@ -54,7 +57,8 @@ static unsigned long kvm_get_wallclock(void)
>
> low = (int)__pa_symbol(&wall_clock);
> high = ((u64)__pa_symbol(&wall_clock) >> 32);
> - native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
> +
> + native_write_msr_safe(msr_kvm_wall_clock, low, high);
>
> vcpu_time = &get_cpu_var(hv_clock);
> pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
> @@ -130,7 +134,8 @@ static int kvm_register_clock(char *txt)
> high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
> printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
> cpu, high, low, txt);
> - return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
> +
> + return native_write_msr_safe(msr_kvm_system_time, low, high);
> }
>
> #ifdef CONFIG_X86_LOCAL_APIC
> @@ -165,14 +170,15 @@ static void __init kvm_smp_prepare_boot_cpu(void)
> #ifdef CONFIG_KEXEC
> static void kvm_crash_shutdown(struct pt_regs *regs)
> {
> - native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
> +
> + native_write_msr(msr_kvm_system_time, 0, 0);
> native_machine_crash_shutdown(regs);
> }
> #endif
>
> static void kvm_shutdown(void)
> {
> - native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
> + native_write_msr(msr_kvm_system_time, 0, 0);
> native_machine_shutdown();
> }
>
> @@ -181,27 +187,35 @@ void __init kvmclock_init(void)
> if (!kvm_para_available())
> return;
>
> - if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
> - if (kvm_register_clock("boot clock"))
> - return;
> - pv_time_ops.sched_clock = kvm_clock_read;
> - x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> - x86_platform.get_wallclock = kvm_get_wallclock;
> - x86_platform.set_wallclock = kvm_set_wallclock;
> + if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
> + msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
> + msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
> + }
> + else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
> + return;

Non conformant indentation.

2010-04-27 13:41:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 2/6] Add a global synchronization point for pvclock

On Mon, Apr 26, 2010 at 01:46:24PM -0400, Glauber Costa wrote:
> In recent stress tests, it was found that pvclock-based systems
> could seriously warp in smp systems. Using ingo's time-warp-test.c,
> I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
> (to be fair, it wasn't that bad in most of them). Investigating further, I
> found out that such warps were caused by the very offset-based calculation
> pvclock is based on.
>
> This happens even on some machines that report constant_tsc in its tsc flags,
> specially on multi-socket ones.
>
> Two reads of the same kernel timestamp at approx the same time, will likely
> have tsc timestamped in different occasions too. This means the delta we
> calculate is unpredictable at best, and can probably be smaller in a cpu
> that is legitimately reading clock in a forward ocasion.
>
> Some adjustments on the host could make this window less likely to happen,
> but still, it pretty much poses as an intrinsic problem of the mechanism.
>
> A while ago, I though about using a shared variable anyway, to hold clock
> last state, but gave up due to the high contention locking was likely
> to introduce, possibly rendering the thing useless on big machines. I argue,
> however, that locking is not necessary.
>
> We do a read-and-return sequence in pvclock, and between read and return,
> the global value can have changed. However, it can only have changed
> by means of an addition of a positive value. So if we detected that our
> clock timestamp is less than the current global, we know that we need to
> return a higher one, even though it is not exactly the one we compared to.
>
> OTOH, if we detect we're greater than the current time source, we atomically
> replace the value with our new readings. This do causes contention on big
> boxes (but big here means *BIG*), but it seems like a good trade off, since
> it provide us with a time source guaranteed to be stable wrt time warps.
>
> After this patch is applied, I don't see a single warp in time during 5 days
> of execution, in any of the machines I saw them before.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Marcelo Tosatti <[email protected]>
> CC: Zachary Amsden <[email protected]>
> ---
> arch/x86/kernel/pvclock.c | 24 ++++++++++++++++++++++++
> 1 files changed, 24 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 8f4af7b..6cf6dec 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -118,11 +118,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
> return pv_tsc_khz;
> }
>
> +static atomic64_t last_value = ATOMIC64_INIT(0);
> +
> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> {
> struct pvclock_shadow_time shadow;
> unsigned version;
> cycle_t ret, offset;
> + u64 last;
>
> do {
> version = pvclock_get_time_values(&shadow, src);
> @@ -132,6 +135,27 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> barrier();
> } while (version != src->version);
>
> + /*
> + * Assumption here is that last_value, a global accumulator, always goes
> + * forward. If we are less than that, we should not be much smaller.
> + * We assume there is an error marging we're inside, and then the correction
> + * does not sacrifice accuracy.
> + *
> + * For reads: global may have changed between test and return,
> + * but this means someone else updated poked the clock at a later time.
> + * We just need to make sure we are not seeing a backwards event.
> + *
> + * For updates: last_value = ret is not enough, since two vcpus could be
> + * updating at the same time, and one of them could be slightly behind,
> + * making the assumption that last_value always go forward fail to hold.
> + */
> + last = atomic64_read(&last_value);
> + do {
> + if (ret < last)
> + return last;
> + last = atomic64_cmpxchg(&last_value, last, ret);
> + } while (unlikely(last != ret));

Wraparound?

2010-04-27 15:10:05

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 4/6] export new cpuid KVM_CAP

On Tue, Apr 27, 2010 at 10:30:48AM -0300, Marcelo Tosatti wrote:
> On Mon, Apr 26, 2010 at 01:46:26PM -0400, Glauber Costa wrote:
> > Since we're changing the msrs kvmclock uses, we have to communicate
> > that to the guest, through cpuid. We can add a new KVM_CAP to the
> > hypervisor, and then patch userspace to recognize it.
> >
> > And if we ever add a new cpuid bit in the future, we have to do that again,
> > which create some complexity and delay in feature adoption.
> >
> > Instead, what I'm proposing in this patch is a new capability, called
> > KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> > currently supported by the hypervisor. If we ever want to add or remove
> > some feature, we only need to tweak into the HV, leaving userspace untouched.
> >
> > Signed-off-by: Glauber Costa <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_para.h | 4 ++++
> > arch/x86/kvm/x86.c | 6 ++++++
> > include/linux/kvm.h | 1 +
> > 3 files changed, 11 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index 9734808..f019f8c 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -16,6 +16,10 @@
> > #define KVM_FEATURE_CLOCKSOURCE 0
> > #define KVM_FEATURE_NOP_IO_DELAY 1
> > #define KVM_FEATURE_MMU_OP 2
> > +/* This indicates that the new set of kvmclock msrs
> > + * are available. The use of 0x11 and 0x12 is deprecated
> > + */
> > +#define KVM_FEATURE_CLOCKSOURCE2 3
> >
> > #define MSR_KVM_WALL_CLOCK 0x11
> > #define MSR_KVM_SYSTEM_TIME 0x12
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a2ead7f..04f04aa 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
> > case KVM_CAP_MCE:
> > r = KVM_MAX_MCE_BANKS;
> > break;
> > + case KVM_CAP_X86_CPUID_FEATURE_LIST:
> > + r = (1 << KVM_FEATURE_CLOCKSOURCE) |
> > + (1 << KVM_FEATURE_NOP_IO_DELAY) |
> > + (1 << KVM_FEATURE_MMU_OP) |
>
> Remove this flag, it has been deprecated.

The deprecation of this flag has nothing to do with this series.
Anyway, we can't pick its number, even if it is really deprecated,
since we can still have old hvs around, so, don't really see the point here

>
> > + (1 << KVM_FEATURE_CLOCKSOURCE2);
> > + break;
> > default:
>
> Also missing qemu-kvm/qemu patches.
>
Of course it is missing. It is a totally separate story.

2010-04-27 15:11:41

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 6/6] don't compute pvclock adjustments if we trust the tsc

On Tue, Apr 27, 2010 at 10:40:13AM -0300, Marcelo Tosatti wrote:
> On Mon, Apr 26, 2010 at 01:46:28PM -0400, Glauber Costa wrote:
> > If the HV told us we can fully trust the TSC, skip any
> > correction
> >
> > Signed-off-by: Glauber Costa <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_para.h | 5 +++++
> > arch/x86/include/asm/pvclock-abi.h | 1 +
> > arch/x86/kernel/kvmclock.c | 3 +++
> > arch/x86/kernel/pvclock.c | 4 ++++
> > 4 files changed, 13 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index f019f8c..615ebb1 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -21,6 +21,11 @@
> > */
> > #define KVM_FEATURE_CLOCKSOURCE2 3
> >
> > +/* The last 8 bits are used to indicate how to interpret the flags field
> > + * in pvclock structure. If no bits are set, all flags are ignored.
> > + */
> > +#define KVM_FEATURE_CLOCKSOURCE_STABLE_TSC 0xf8
> > +
> > #define MSR_KVM_WALL_CLOCK 0x11
> > #define MSR_KVM_SYSTEM_TIME 0x12
>
> Please drop this. Its not certain what is the best method to reduce
> contention on the global variable. Whatever it is, can be done later.
You tell me to drop it, avi tells me to write this, I am happy to pick
any one of the two options. Just don't want to get ping-ponged between
haveit-dropit.

2010-04-27 16:55:29

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 4/6] export new cpuid KVM_CAP

On Tue, Apr 27, 2010 at 10:30:48AM -0300, Marcelo Tosatti wrote:
> On Mon, Apr 26, 2010 at 01:46:26PM -0400, Glauber Costa wrote:
> > Since we're changing the msrs kvmclock uses, we have to communicate
> > that to the guest, through cpuid. We can add a new KVM_CAP to the
> > hypervisor, and then patch userspace to recognize it.
> >
> > And if we ever add a new cpuid bit in the future, we have to do that again,
> > which create some complexity and delay in feature adoption.
> >
> > Instead, what I'm proposing in this patch is a new capability, called
> > KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> > currently supported by the hypervisor. If we ever want to add or remove
> > some feature, we only need to tweak into the HV, leaving userspace untouched.
> >
> > Signed-off-by: Glauber Costa <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_para.h | 4 ++++
> > arch/x86/kvm/x86.c | 6 ++++++
> > include/linux/kvm.h | 1 +
> > 3 files changed, 11 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index 9734808..f019f8c 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -16,6 +16,10 @@
> > #define KVM_FEATURE_CLOCKSOURCE 0
> > #define KVM_FEATURE_NOP_IO_DELAY 1
> > #define KVM_FEATURE_MMU_OP 2
> > +/* This indicates that the new set of kvmclock msrs
> > + * are available. The use of 0x11 and 0x12 is deprecated
> > + */
> > +#define KVM_FEATURE_CLOCKSOURCE2 3
> >
> > #define MSR_KVM_WALL_CLOCK 0x11
> > #define MSR_KVM_SYSTEM_TIME 0x12
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a2ead7f..04f04aa 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
> > case KVM_CAP_MCE:
> > r = KVM_MAX_MCE_BANKS;
> > break;
> > + case KVM_CAP_X86_CPUID_FEATURE_LIST:
> > + r = (1 << KVM_FEATURE_CLOCKSOURCE) |
> > + (1 << KVM_FEATURE_NOP_IO_DELAY) |
> > + (1 << KVM_FEATURE_MMU_OP) |
>
> Remove this flag, it has been deprecated.

Ok, sorry. My bad here.

at a quick glance, I did not realize you were commenting on this
patch specifically. KVM_CAP_X86_CPUID_FEATURE_LIST is a new
entity, so there is no need to return a deprecated feature.

I will remove that.

2010-04-27 18:00:43

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 2/6] Add a global synchronization point for pvclock

On 04/27/2010 06:28 AM, Marcelo Tosatti wrote:
>> + last = atomic64_read(&last_value);
>> + do {
>> + if (ret < last)
>> + return last;
>> + last = atomic64_cmpxchg(&last_value, last, ret);
>> + } while (unlikely(last != ret));
>>
> Wraparound?
>

If the units nanoseconds, it's good for ~500,000 years of uptime.

J

2010-04-27 18:04:05

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 6/6] don't compute pvclock adjustments if we trust the tsc

On 04/27/2010 04:40 PM, Marcelo Tosatti wrote:
> On Mon, Apr 26, 2010 at 01:46:28PM -0400, Glauber Costa wrote:
>
>> If the HV told us we can fully trust the TSC, skip any
>> correction
>>
>>
> Please drop this. Its not certain what is the best method to reduce
> contention on the global variable. Whatever it is, can be done later.
>
>

The problem is lead time. If we (or the cpu vendors) fix this in the
hypervisor/processors in a year, then we want existing guests to take
advantage of it immediately.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-27 18:07:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/6] Enable pvclock flags in vcpu_time_info structure

On 04/26/2010 08:46 PM, Glauber Costa wrote:
> This patch removes one padding byte and transform it into a flags
> field. New versions of guests using pvclock will query these flags
> upon each read.
>
> Flags, however, will only be interpreted when the guest decides to.
> It uses the pvclock_valid_flags function to signal that a specific
> set of flags should be taken into consideration. Which flags are valid
> are usually devised via HV negotiation.
>
>
> +void pvclock_valid_flags(u8 flags);
>

set_pvclock_valid_flags() (or just set_pvclock_flags?
pvclock_set_flags()?). The original name looks like a getter.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-27 18:12:53

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 4/6] export new cpuid KVM_CAP

On 04/26/2010 08:46 PM, Glauber Costa wrote:
> Since we're changing the msrs kvmclock uses, we have to communicate
> that to the guest, through cpuid. We can add a new KVM_CAP to the
> hypervisor, and then patch userspace to recognize it.
>
> And if we ever add a new cpuid bit in the future, we have to do that again,
> which create some complexity and delay in feature adoption.
>
> Instead, what I'm proposing in this patch is a new capability, called
> KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> currently supported by the hypervisor. If we ever want to add or remove
> some feature, we only need to tweak into the HV, leaving userspace untouched.
>
> Signed-off-by: Glauber Costa<[email protected]>
> ---
> arch/x86/include/asm/kvm_para.h | 4 ++++
> arch/x86/kvm/x86.c | 6 ++++++
> include/linux/kvm.h | 1 +
> 3 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 9734808..f019f8c 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -16,6 +16,10 @@
> #define KVM_FEATURE_CLOCKSOURCE 0
> #define KVM_FEATURE_NOP_IO_DELAY 1
> #define KVM_FEATURE_MMU_OP 2
> +/* This indicates that the new set of kvmclock msrs
> + * are available. The use of 0x11 and 0x12 is deprecated
> + */
> +#define KVM_FEATURE_CLOCKSOURCE2 3
>

Please document this in Documentation/kvm/cpuid.txt (also /msr.txt).

>
> #define MSR_KVM_WALL_CLOCK 0x11
> #define MSR_KVM_SYSTEM_TIME 0x12
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a2ead7f..04f04aa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_MCE:
> r = KVM_MAX_MCE_BANKS;
> break;
> + case KVM_CAP_X86_CPUID_FEATURE_LIST:
> + r = (1<< KVM_FEATURE_CLOCKSOURCE) |
> + (1<< KVM_FEATURE_NOP_IO_DELAY) |
> + (1<< KVM_FEATURE_MMU_OP) |
> + (1<< KVM_FEATURE_CLOCKSOURCE2);
> + break;
> default:
> r = 0;
> break;
>

Hmm. We already have an API to get cpuid bits:
KVM_GET_SUPPORTED_CPUID2. The nice thing about it is that it will mean
-cpu host will work out of the box.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-27 18:57:57

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 6/6] don't compute pvclock adjustments if we trust the tsc

On Tue, Apr 27, 2010 at 09:03:55PM +0300, Avi Kivity wrote:
> On 04/27/2010 04:40 PM, Marcelo Tosatti wrote:
> >On Mon, Apr 26, 2010 at 01:46:28PM -0400, Glauber Costa wrote:
> >>If the HV told us we can fully trust the TSC, skip any
> >>correction
> >>
> >Please drop this. Its not certain what is the best method to reduce
> >contention on the global variable. Whatever it is, can be done later.
> >
>
> The problem is lead time. If we (or the cpu vendors) fix this in
> the hypervisor/processors in a year, then we want existing guests to
> take advantage of it immediately.
100 % Agreed.

2010-04-27 19:09:07

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 4/6] export new cpuid KVM_CAP

On Tue, Apr 27, 2010 at 09:12:47PM +0300, Avi Kivity wrote:
> On 04/26/2010 08:46 PM, Glauber Costa wrote:
> >Since we're changing the msrs kvmclock uses, we have to communicate
> >that to the guest, through cpuid. We can add a new KVM_CAP to the
> >hypervisor, and then patch userspace to recognize it.
> >
> >And if we ever add a new cpuid bit in the future, we have to do that again,
> >which create some complexity and delay in feature adoption.
> >
> >Instead, what I'm proposing in this patch is a new capability, called
> >KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> >currently supported by the hypervisor. If we ever want to add or remove
> >some feature, we only need to tweak into the HV, leaving userspace untouched.
> >
> >Signed-off-by: Glauber Costa<[email protected]>
> >---
> > arch/x86/include/asm/kvm_para.h | 4 ++++
> > arch/x86/kvm/x86.c | 6 ++++++
> > include/linux/kvm.h | 1 +
> > 3 files changed, 11 insertions(+), 0 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index 9734808..f019f8c 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -16,6 +16,10 @@
> > #define KVM_FEATURE_CLOCKSOURCE 0
> > #define KVM_FEATURE_NOP_IO_DELAY 1
> > #define KVM_FEATURE_MMU_OP 2
> >+/* This indicates that the new set of kvmclock msrs
> >+ * are available. The use of 0x11 and 0x12 is deprecated
> >+ */
> >+#define KVM_FEATURE_CLOCKSOURCE2 3
>
> Please document this in Documentation/kvm/cpuid.txt (also /msr.txt).
>
> >
> > #define MSR_KVM_WALL_CLOCK 0x11
> > #define MSR_KVM_SYSTEM_TIME 0x12
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index a2ead7f..04f04aa 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
> > case KVM_CAP_MCE:
> > r = KVM_MAX_MCE_BANKS;
> > break;
> >+ case KVM_CAP_X86_CPUID_FEATURE_LIST:
> >+ r = (1<< KVM_FEATURE_CLOCKSOURCE) |
> >+ (1<< KVM_FEATURE_NOP_IO_DELAY) |
> >+ (1<< KVM_FEATURE_MMU_OP) |
> >+ (1<< KVM_FEATURE_CLOCKSOURCE2);
> >+ break;
> > default:
> > r = 0;
> > break;
>
> Hmm. We already have an API to get cpuid bits:
> KVM_GET_SUPPORTED_CPUID2. The nice thing about it is that it will
> mean -cpu host will work out of the box.

Ok, from what I understand, KVM_GET_CPUID2 gets a set of features, and tells
userspace which of them are available. Right?

If this is the case, userspace could ask for 0xffffffff, and then we tell them
which of them are present.

Does that make sense?

2010-04-27 19:09:37

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 1/6] Enable pvclock flags in vcpu_time_info structure

On Tue, Apr 27, 2010 at 09:07:18PM +0300, Avi Kivity wrote:
> On 04/26/2010 08:46 PM, Glauber Costa wrote:
> >This patch removes one padding byte and transform it into a flags
> >field. New versions of guests using pvclock will query these flags
> >upon each read.
> >
> >Flags, however, will only be interpreted when the guest decides to.
> >It uses the pvclock_valid_flags function to signal that a specific
> >set of flags should be taken into consideration. Which flags are valid
> >are usually devised via HV negotiation.
> >
> >
> >+void pvclock_valid_flags(u8 flags);
>
> set_pvclock_valid_flags() (or just set_pvclock_flags?
> pvclock_set_flags()?). The original name looks like a getter.
Ok, will change it.

2010-04-27 19:20:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 4/6] export new cpuid KVM_CAP

On 04/27/2010 10:09 PM, Glauber Costa wrote:
>
>> Hmm. We already have an API to get cpuid bits:
>> KVM_GET_SUPPORTED_CPUID2. The nice thing about it is that it will
>> mean -cpu host will work out of the box.
>>
> Ok, from what I understand, KVM_GET_CPUID2 gets a set of features, and tells
> userspace which of them are available. Right?
>

No. KVM_GET_CPUID2 reads what was set by KVM_SET_CPUID, as modified by
the guest executing the cpuid instruction. KVM_GET_SUPPORTED_CPUID
tells userspace which bits are supported by the host cpu and kvm.

> If this is the case, userspace could ask for 0xffffffff, and then we tell them
> which of them are present.
>
> Does that make sense?
>

The API for KVM_GET_SUPPORTED_CPUID returns all cpuid leaves supported
in one go, IIRC.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.