NB: patch 1 doesn't really belong here, but it makes this a lot
easier for me to test. Patch 1, if it's okay at all, should go
though the kvm tree. The rest should probably go through
tip:x86/vdso once they're reviewed.
I'll do a followup to enable vdso pvclock on 32-bit guests.
I'm not currently set up to test it. (The KVM people could also
do it very easily on top of these patches.)
Andy Lutomirski (5):
x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
x86/vdso: Remove pvclock fixmap machinery
x86/vdso: Enable vdso pvclock access on all vdso variants
arch/x86/entry/vdso/vclock_gettime.c | 151 ++++++++++++++++------------------
arch/x86/entry/vdso/vdso-layout.lds.S | 3 +-
arch/x86/entry/vdso/vdso2c.c | 3 +
arch/x86/entry/vdso/vma.c | 14 ++++
arch/x86/include/asm/fixmap.h | 5 --
arch/x86/include/asm/pvclock.h | 14 ++--
arch/x86/include/asm/vdso.h | 1 +
arch/x86/kernel/kvmclock.c | 11 ++-
arch/x86/kernel/pvclock.c | 24 ------
arch/x86/kvm/x86.c | 75 +----------------
10 files changed, 110 insertions(+), 191 deletions(-)
--
2.5.0
This gets rid of the "did TSC go backwards" logic and just updates
all clocks. It should work better (no more disabling of fast
timing) and more reliably (all of the clocks are actually updated).
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kvm/x86.c | 75 +++---------------------------------------------------
1 file changed, 3 insertions(+), 72 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..c88f91f4b1a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
unsigned int __read_mostly lapic_timer_advance_ns = 0;
module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
-static bool __read_mostly backwards_tsc_observed = false;
-
#define KVM_NR_SHARED_MSRS 16
struct kvm_shared_msrs_global {
@@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
&ka->master_cycle_now);
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
- && !backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
if (ka->use_master_clock)
@@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
struct kvm_vcpu *vcpu;
int i;
int ret;
- u64 local_tsc;
- u64 max_tsc = 0;
- bool stable, backwards_tsc = false;
kvm_shared_msr_cpu_online();
ret = kvm_x86_ops->hardware_enable();
if (ret != 0)
return ret;
- local_tsc = rdtsc();
- stable = !check_tsc_unstable();
list_for_each_entry(kvm, &vm_list, vm_list) {
kvm_for_each_vcpu(i, vcpu, kvm) {
- if (!stable && vcpu->cpu == smp_processor_id())
+ if (vcpu->cpu == smp_processor_id()) {
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
- if (stable && vcpu->arch.last_host_tsc > local_tsc) {
- backwards_tsc = true;
- if (vcpu->arch.last_host_tsc > max_tsc)
- max_tsc = vcpu->arch.last_host_tsc;
+ kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
+ vcpu);
}
}
}
- /*
- * Sometimes, even reliable TSCs go backwards. This happens on
- * platforms that reset TSC during suspend or hibernate actions, but
- * maintain synchronization. We must compensate. Fortunately, we can
- * detect that condition here, which happens early in CPU bringup,
- * before any KVM threads can be running. Unfortunately, we can't
- * bring the TSCs fully up to date with real time, as we aren't yet far
- * enough into CPU bringup that we know how much real time has actually
- * elapsed; our helper function, get_kernel_ns() will be using boot
- * variables that haven't been updated yet.
- *
- * So we simply find the maximum observed TSC above, then record the
- * adjustment to TSC in each VCPU. When the VCPU later gets loaded,
- * the adjustment will be applied. Note that we accumulate
- * adjustments, in case multiple suspend cycles happen before some VCPU
- * gets a chance to run again. In the event that no KVM threads get a
- * chance to run, we will miss the entire elapsed period, as we'll have
- * reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
- * loose cycle time. This isn't too big a deal, since the loss will be
- * uniform across all VCPUs (not to mention the scenario is extremely
- * unlikely). It is possible that a second hibernate recovery happens
- * much faster than a first, causing the observed TSC here to be
- * smaller; this would require additional padding adjustment, which is
- * why we set last_host_tsc to the local tsc observed here.
- *
- * N.B. - this code below runs only on platforms with reliable TSC,
- * as that is the only way backwards_tsc is set above. Also note
- * that this runs for ALL vcpus, which is not a bug; all VCPUs should
- * have the same delta_cyc adjustment applied if backwards_tsc
- * is detected. Note further, this adjustment is only done once,
- * as we reset last_host_tsc on all VCPUs to stop this from being
- * called multiple times (one for each physical CPU bringup).
- *
- * Platforms with unreliable TSCs don't have to deal with this, they
- * will be compensated by the logic in vcpu_load, which sets the TSC to
- * catchup mode. This will catchup all VCPUs to real time, but cannot
- * guarantee that they stay in perfect synchronization.
- */
- if (backwards_tsc) {
- u64 delta_cyc = max_tsc - local_tsc;
- backwards_tsc_observed = true;
- list_for_each_entry(kvm, &vm_list, vm_list) {
- kvm_for_each_vcpu(i, vcpu, kvm) {
- vcpu->arch.tsc_offset_adjustment += delta_cyc;
- vcpu->arch.last_host_tsc = local_tsc;
- kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
- }
-
- /*
- * We have to disable TSC offset matching.. if you were
- * booting a VM while issuing an S4 host suspend....
- * you may have some problem. Solving this issue is
- * left as an exercise to the reader.
- */
- kvm->arch.last_tsc_nsec = 0;
- kvm->arch.last_tsc_write = 0;
- }
-
- }
return 0;
}
--
2.5.0
From: Andy Lutomirski <[email protected]>
The pvclock vdso code was too abstracted to understand easily and
excessively paranoid. Simplify it for a huge speedup.
This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.
Before, vclock_gettime using kvm-clock took about 45ns on my machine.
With this change, it takes 29ns, which is almost as fast as the pure TSC
implementation.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 81 ++++++++++++++++++++----------------
1 file changed, 46 insertions(+), 35 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index ca94fa649251..c325ba1bdddf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
static notrace cycle_t vread_pvclock(int *mode)
{
- const struct pvclock_vsyscall_time_info *pvti;
+ const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
cycle_t ret;
- u64 last;
- u32 version;
- u8 flags;
- unsigned cpu, cpu1;
-
+ u64 tsc, pvti_tsc;
+ u64 last, delta, pvti_system_time;
+ u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
/*
- * Note: hypervisor must guarantee that:
- * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
- * 2. that per-CPU pvclock time info is updated if the
- * underlying CPU changes.
- * 3. that version is increased whenever underlying CPU
- * changes.
+ * Note: The kernel and hypervisor must guarantee that cpu ID
+ * number maps 1:1 to per-CPU pvclock time info.
+ *
+ * Because the hypervisor is entirely unaware of guest userspace
+ * preemption, it cannot guarantee that per-CPU pvclock time
+ * info is updated if the underlying CPU changes or that that
+ * version is increased whenever underlying CPU changes.
*
+ * On KVM, we are guaranteed that pvti updates for any vCPU are
+ * atomic as seen by *all* vCPUs. This is an even stronger
+ * guarantee than we get with a normal seqlock.
+ *
+ * On Xen, we don't appear to have that guarantee, but Xen still
+ * supplies a valid seqlock using the version field.
+
+ * We only do pvclock vdso timing at all if
+ * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+ * mean that all vCPUs have matching pvti and that the TSC is
+ * synced, so we can just look at vCPU 0's pvti.
*/
- do {
- cpu = __getcpu() & VGETCPU_CPU_MASK;
- /* TODO: We can put vcpu id into higher bits of pvti.version.
- * This will save a couple of cycles by getting rid of
- * __getcpu() calls (Gleb).
- */
-
- pvti = get_pvti(cpu);
-
- version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
-
- /*
- * Test we're still on the cpu as well as the version.
- * We could have been migrated just after the first
- * vgetcpu but before fetching the version, so we
- * wouldn't notice a version change.
- */
- cpu1 = __getcpu() & VGETCPU_CPU_MASK;
- } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
- if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+ if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+ return 0;
+ }
+
+ do {
+ version = pvti->version;
+
+ /* This is also a read barrier, so we'll read version first. */
+ tsc = rdtsc_ordered();
+
+ pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+ pvti_tsc_shift = pvti->tsc_shift;
+ pvti_system_time = pvti->system_time;
+ pvti_tsc = pvti->tsc_timestamp;
+
+ /* Make sure that the version double-check is last. */
+ smp_rmb();
+ } while (unlikely((version & 1) || version != pvti->version));
+
+ delta = tsc - pvti_tsc;
+ ret = pvti_system_time +
+ pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+ pvti_tsc_shift);
/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
--
2.5.0
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 20 ++++++++------------
arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++-
arch/x86/entry/vdso/vdso2c.c | 3 +++
arch/x86/entry/vdso/vma.c | 13 +++++++++++++
arch/x86/include/asm/pvclock.h | 9 +++++++++
arch/x86/include/asm/vdso.h | 1 +
arch/x86/kernel/kvmclock.c | 5 +++++
7 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index c325ba1bdddf..5dd363d54348 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
}
#endif
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern u8 pvclock_page
+ __attribute__((visibility("hidden")));
+#endif
+
#ifndef BUILD_VDSO32
#include <linux/kernel.h>
@@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
#ifdef CONFIG_PARAVIRT_CLOCK
-static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
{
- const struct pvclock_vsyscall_time_info *pvti_base;
- int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
- int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
-
- BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
-
- pvti_base = (struct pvclock_vsyscall_time_info *)
- __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
-
- return &pvti_base[offset];
+ return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
}
static notrace cycle_t vread_pvclock(int *mode)
{
- const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
+ const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
cycle_t ret;
u64 tsc, pvti_tsc;
u64 last, delta, pvti_system_time;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index de2c921025f5..4158acc17df0 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
* segment.
*/
- vvar_start = . - 2 * PAGE_SIZE;
+ vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
#undef EMIT_VVAR
hpet_page = vvar_start + PAGE_SIZE;
+ pvclock_page = vvar_start + 2 * PAGE_SIZE;
. = SIZEOF_HEADERS;
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 785d9922b106..491020b2826d 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -73,6 +73,7 @@ enum {
sym_vvar_start,
sym_vvar_page,
sym_hpet_page,
+ sym_pvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
};
@@ -80,6 +81,7 @@ enum {
const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
+ sym_pvclock_page,
};
struct vdso_sym {
@@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_start] = {"vvar_start", true},
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
+ [sym_pvclock_page] = {"pvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 64df47148160..aa828191c654 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
.name = "[vvar]",
.pages = no_pages,
};
+ struct pvclock_vsyscall_time_info *pvti;
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
}
#endif
+ pvti = pvclock_pvti_cpu0_va();
+ if (pvti && image->sym_pvclock_page) {
+ ret = remap_pfn_range(vma,
+ text_start + image->sym_pvclock_page,
+ __pa(pvti) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
+
+ if (ret)
+ goto up_fail;
+ }
+
up_fail:
if (ret)
current->mm->context.vdso = NULL;
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 7a6bed5c08bc..3864398c7cb2 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,15 @@
#include <linux/clocksource.h>
#include <asm/pvclock-abi.h>
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+ return NULL;
+}
+#endif
+
/* some helper functions for xen and kvm pv clock sources */
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 756de9190aec..deabaf9759b6 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -22,6 +22,7 @@ struct vdso_image {
long sym_vvar_page;
long sym_hpet_page;
+ long sym_pvclock_page;
long sym_VDSO32_NOTE_MASK;
long sym___kernel_sigreturn;
long sym___kernel_rt_sigreturn;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 2bd81e302427..ec1b06dc82d2 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -45,6 +45,11 @@ early_param("no-kvmclock", parse_no_kvmclock);
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock wall_clock;
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+ return hv_clock;
+}
+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
--
2.5.0
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 1 -
arch/x86/entry/vdso/vma.c | 1 +
arch/x86/include/asm/fixmap.h | 5 -----
arch/x86/include/asm/pvclock.h | 5 -----
arch/x86/kernel/kvmclock.c | 6 ------
arch/x86/kernel/pvclock.c | 24 ------------------------
6 files changed, 1 insertion(+), 41 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 5dd363d54348..59a98c25bde7 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -45,7 +45,6 @@ extern u8 pvclock_page
#include <linux/kernel.h>
#include <asm/vsyscall.h>
-#include <asm/fixmap.h>
#include <asm/pvclock.h>
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index aa828191c654..b8f69e264ac4 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
#include <linux/random.h>
#include <linux/elf.h>
#include <linux/cpu.h>
+#include <asm/pvclock.h>
#include <asm/vgtod.h>
#include <asm/proto.h>
#include <asm/vdso.h>
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f80d70009ff8..6d7d0e52ed5a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,7 +19,6 @@
#include <asm/acpi.h>
#include <asm/apicdef.h>
#include <asm/page.h>
-#include <asm/pvclock.h>
#ifdef CONFIG_X86_32
#include <linux/threads.h>
#include <asm/kmap_types.h>
@@ -72,10 +71,6 @@ enum fixed_addresses {
#ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
#endif
-#ifdef CONFIG_PARAVIRT_CLOCK
- PVCLOCK_FIXMAP_BEGIN,
- PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
-#endif
#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 3864398c7cb2..66df22b2e0c9 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
} __attribute__((__aligned__(SMP_CACHE_BYTES)));
#define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
-#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
- int size);
-struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
#endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec1b06dc82d2..72cef58693c7 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
{
#ifdef CONFIG_X86_64
int cpu;
- int ret;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
@@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
- if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
- put_cpu();
- return ret;
- }
-
put_cpu();
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d229a58..99bfc025111d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
}
-
-#ifdef CONFIG_X86_64
-/*
- * Initialize the generic pvclock vsyscall state. This will allocate
- * a/some page(s) for the per-vcpu pvclock information, set up a
- * fixmap mapping for the page(s)
- */
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
- int size)
-{
- int idx;
-
- WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
-
- for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
- __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
- __pa(i) + (idx*PAGE_SIZE),
- PAGE_KERNEL_VVAR);
- }
-
- return 0;
-}
-#endif
--
2.5.0
Now that pvclock doesn't require access to the fixmap, all vdso
variants can use it.
The kernel side isn't wired up for 32-bit kernels yet, but this
covers 32-bit and x32 userspace on 64-bit kernels.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 91 ++++++++++++++++--------------------
1 file changed, 40 insertions(+), 51 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 59a98c25bde7..8602f06c759f 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,8 +17,10 @@
#include <asm/vvar.h>
#include <asm/unistd.h>
#include <asm/msr.h>
+#include <asm/pvclock.h>
#include <linux/math64.h>
#include <linux/time.h>
+#include <linux/kernel.h>
#define gtod (&VVAR(vsyscall_gtod_data))
@@ -43,10 +45,6 @@ extern u8 pvclock_page
#ifndef BUILD_VDSO32
-#include <linux/kernel.h>
-#include <asm/vsyscall.h>
-#include <asm/pvclock.h>
-
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
{
long ret;
@@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
return ret;
}
-#ifdef CONFIG_PARAVIRT_CLOCK
+#else
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+ long ret;
+
+ asm(
+ "mov %%ebx, %%edx \n"
+ "mov %2, %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret)
+ : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
+ : "memory", "edx");
+ return ret;
+}
+
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+ long ret;
+
+ asm(
+ "mov %%ebx, %%edx \n"
+ "mov %2, %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret)
+ : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
+ : "memory", "edx");
+ return ret;
+}
+
+#endif
+
+#ifdef CONFIG_PARAVIRT_CLOCK
static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
{
return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
@@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
do {
version = pvti->version;
- /* This is also a read barrier, so we'll read version first. */
- tsc = rdtsc_ordered();
+ smp_rmb();
+ tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
pvti_system_time = pvti->system_time;
@@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
pvti_tsc_shift);
- /* refer to tsc.c read_tsc() comment for rationale */
+ /* refer to vread_tsc() comment for rationale */
last = gtod->cycle_last;
if (likely(ret >= last))
@@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
}
#endif
-#else
-
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
-{
- long ret;
-
- asm(
- "mov %%ebx, %%edx \n"
- "mov %2, %%ebx \n"
- "call __kernel_vsyscall \n"
- "mov %%edx, %%ebx \n"
- : "=a" (ret)
- : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
- : "memory", "edx");
- return ret;
-}
-
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
- long ret;
-
- asm(
- "mov %%ebx, %%edx \n"
- "mov %2, %%ebx \n"
- "call __kernel_vsyscall \n"
- "mov %%edx, %%ebx \n"
- : "=a" (ret)
- : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
- : "memory", "edx");
- return ret;
-}
-
-#ifdef CONFIG_PARAVIRT_CLOCK
-
-static notrace cycle_t vread_pvclock(int *mode)
-{
- *mode = VCLOCK_NONE;
- return 0;
-}
-#endif
-
-#endif
-
notrace static cycle_t vread_tsc(void)
{
cycle_t ret = (cycle_t)rdtsc_ordered();
--
2.5.0
On 10/12/2015 00:12, Andy Lutomirski wrote:
> From: Andy Lutomirski <[email protected]>
>
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid. Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 45ns on my machine.
> With this change, it takes 29ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/vdso/vclock_gettime.c | 81 ++++++++++++++++++++----------------
> 1 file changed, 46 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index ca94fa649251..c325ba1bdddf 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
> static notrace cycle_t vread_pvclock(int *mode)
> {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>
> /*
> - * Note: hypervisor must guarantee that:
> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> - * 2. that per-CPU pvclock time info is updated if the
> - * underlying CPU changes.
> - * 3. that version is increased whenever underlying CPU
> - * changes.
> + * Note: The kernel and hypervisor must guarantee that cpu ID
> + * number maps 1:1 to per-CPU pvclock time info.
> + *
> + * Because the hypervisor is entirely unaware of guest userspace
> + * preemption, it cannot guarantee that per-CPU pvclock time
> + * info is updated if the underlying CPU changes or that that
> + * version is increased whenever underlying CPU changes.
> *
> + * On KVM, we are guaranteed that pvti updates for any vCPU are
> + * atomic as seen by *all* vCPUs. This is an even stronger
> + * guarantee than we get with a normal seqlock.
> + *
> + * On Xen, we don't appear to have that guarantee, but Xen still
> + * supplies a valid seqlock using the version field.
> +
> + * We only do pvclock vdso timing at all if
> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> + * mean that all vCPUs have matching pvti and that the TSC is
> + * synced, so we can just look at vCPU 0's pvti.
> */
> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> - * This will save a couple of cycles by getting rid of
> - * __getcpu() calls (Gleb).
> - */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> - /*
> - * Test we're still on the cpu as well as the version.
> - * We could have been migrated just after the first
> - * vgetcpu but before fetching the version, so we
> - * wouldn't notice a version change.
> - */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> - (pvti->pvti.version & 1) ||
> - pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> *mode = VCLOCK_NONE;
> + return 0;
> + }
> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + tsc = rdtsc_ordered();
> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));
> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);
>
> /* refer to tsc.c read_tsc() comment for rationale */
> last = gtod->cycle_last;
>
Reviewed-by: Paolo Bonzini <[email protected]>
On 10/12/2015 00:12, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/vdso/vclock_gettime.c | 20 ++++++++------------
> arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++-
> arch/x86/entry/vdso/vdso2c.c | 3 +++
> arch/x86/entry/vdso/vma.c | 13 +++++++++++++
> arch/x86/include/asm/pvclock.h | 9 +++++++++
> arch/x86/include/asm/vdso.h | 1 +
> arch/x86/kernel/kvmclock.c | 5 +++++
> 7 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index c325ba1bdddf..5dd363d54348 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
> }
> #endif
>
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +extern u8 pvclock_page
> + __attribute__((visibility("hidden")));
> +#endif
> +
> #ifndef BUILD_VDSO32
>
> #include <linux/kernel.h>
> @@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
>
> #ifdef CONFIG_PARAVIRT_CLOCK
>
> -static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> +static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
> {
> - const struct pvclock_vsyscall_time_info *pvti_base;
> - int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> - int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> -
> - BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> -
> - pvti_base = (struct pvclock_vsyscall_time_info *)
> - __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> -
> - return &pvti_base[offset];
> + return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
> }
>
> static notrace cycle_t vread_pvclock(int *mode)
> {
> - const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> + const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
> cycle_t ret;
> u64 tsc, pvti_tsc;
> u64 last, delta, pvti_system_time;
> diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
> index de2c921025f5..4158acc17df0 100644
> --- a/arch/x86/entry/vdso/vdso-layout.lds.S
> +++ b/arch/x86/entry/vdso/vdso-layout.lds.S
> @@ -25,7 +25,7 @@ SECTIONS
> * segment.
> */
>
> - vvar_start = . - 2 * PAGE_SIZE;
> + vvar_start = . - 3 * PAGE_SIZE;
> vvar_page = vvar_start;
>
> /* Place all vvars at the offsets in asm/vvar.h. */
> @@ -36,6 +36,7 @@ SECTIONS
> #undef EMIT_VVAR
>
> hpet_page = vvar_start + PAGE_SIZE;
> + pvclock_page = vvar_start + 2 * PAGE_SIZE;
>
> . = SIZEOF_HEADERS;
>
> diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
> index 785d9922b106..491020b2826d 100644
> --- a/arch/x86/entry/vdso/vdso2c.c
> +++ b/arch/x86/entry/vdso/vdso2c.c
> @@ -73,6 +73,7 @@ enum {
> sym_vvar_start,
> sym_vvar_page,
> sym_hpet_page,
> + sym_pvclock_page,
> sym_VDSO_FAKE_SECTION_TABLE_START,
> sym_VDSO_FAKE_SECTION_TABLE_END,
> };
> @@ -80,6 +81,7 @@ enum {
> const int special_pages[] = {
> sym_vvar_page,
> sym_hpet_page,
> + sym_pvclock_page,
> };
>
> struct vdso_sym {
> @@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
> [sym_vvar_start] = {"vvar_start", true},
> [sym_vvar_page] = {"vvar_page", true},
> [sym_hpet_page] = {"hpet_page", true},
> + [sym_pvclock_page] = {"pvclock_page", true},
> [sym_VDSO_FAKE_SECTION_TABLE_START] = {
> "VDSO_FAKE_SECTION_TABLE_START", false
> },
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 64df47148160..aa828191c654 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
> .name = "[vvar]",
> .pages = no_pages,
> };
> + struct pvclock_vsyscall_time_info *pvti;
>
> if (calculate_addr) {
> addr = vdso_addr(current->mm->start_stack,
> @@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
> }
> #endif
>
> + pvti = pvclock_pvti_cpu0_va();
> + if (pvti && image->sym_pvclock_page) {
> + ret = remap_pfn_range(vma,
> + text_start + image->sym_pvclock_page,
> + __pa(pvti) >> PAGE_SHIFT,
> + PAGE_SIZE,
> + PAGE_READONLY);
> +
> + if (ret)
> + goto up_fail;
> + }
> +
> up_fail:
> if (ret)
> current->mm->context.vdso = NULL;
> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
> index 7a6bed5c08bc..3864398c7cb2 100644
> --- a/arch/x86/include/asm/pvclock.h
> +++ b/arch/x86/include/asm/pvclock.h
> @@ -4,6 +4,15 @@
> #include <linux/clocksource.h>
> #include <asm/pvclock-abi.h>
>
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
> +#else
> +static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
> +{
> + return NULL;
> +}
> +#endif
> +
> /* some helper functions for xen and kvm pv clock sources */
> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
> u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> index 756de9190aec..deabaf9759b6 100644
> --- a/arch/x86/include/asm/vdso.h
> +++ b/arch/x86/include/asm/vdso.h
> @@ -22,6 +22,7 @@ struct vdso_image {
>
> long sym_vvar_page;
> long sym_hpet_page;
> + long sym_pvclock_page;
> long sym_VDSO32_NOTE_MASK;
> long sym___kernel_sigreturn;
> long sym___kernel_rt_sigreturn;
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 2bd81e302427..ec1b06dc82d2 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -45,6 +45,11 @@ early_param("no-kvmclock", parse_no_kvmclock);
> static struct pvclock_vsyscall_time_info *hv_clock;
> static struct pvclock_wall_clock wall_clock;
>
> +struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
> +{
> + return hv_clock;
> +}
> +
> /*
> * The wallclock is the time of day when we booted. Since then, some time may
> * have elapsed since the hypervisor wrote the data. So we try to account for
>
Acked-by: Paolo Bonzini <[email protected]>
On 10/12/2015 00:12, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/vdso/vclock_gettime.c | 1 -
> arch/x86/entry/vdso/vma.c | 1 +
> arch/x86/include/asm/fixmap.h | 5 -----
> arch/x86/include/asm/pvclock.h | 5 -----
> arch/x86/kernel/kvmclock.c | 6 ------
> arch/x86/kernel/pvclock.c | 24 ------------------------
> 6 files changed, 1 insertion(+), 41 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index 5dd363d54348..59a98c25bde7 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -45,7 +45,6 @@ extern u8 pvclock_page
>
> #include <linux/kernel.h>
> #include <asm/vsyscall.h>
> -#include <asm/fixmap.h>
> #include <asm/pvclock.h>
>
> notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index aa828191c654..b8f69e264ac4 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -12,6 +12,7 @@
> #include <linux/random.h>
> #include <linux/elf.h>
> #include <linux/cpu.h>
> +#include <asm/pvclock.h>
> #include <asm/vgtod.h>
> #include <asm/proto.h>
> #include <asm/vdso.h>
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index f80d70009ff8..6d7d0e52ed5a 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -19,7 +19,6 @@
> #include <asm/acpi.h>
> #include <asm/apicdef.h>
> #include <asm/page.h>
> -#include <asm/pvclock.h>
> #ifdef CONFIG_X86_32
> #include <linux/threads.h>
> #include <asm/kmap_types.h>
> @@ -72,10 +71,6 @@ enum fixed_addresses {
> #ifdef CONFIG_X86_VSYSCALL_EMULATION
> VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
> #endif
> -#ifdef CONFIG_PARAVIRT_CLOCK
> - PVCLOCK_FIXMAP_BEGIN,
> - PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
> -#endif
> #endif
> FIX_DBGP_BASE,
> FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
> index 3864398c7cb2..66df22b2e0c9 100644
> --- a/arch/x86/include/asm/pvclock.h
> +++ b/arch/x86/include/asm/pvclock.h
> @@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
> } __attribute__((__aligned__(SMP_CACHE_BYTES)));
>
> #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
> -#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
> -
> -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
> - int size);
> -struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
>
> #endif /* _ASM_X86_PVCLOCK_H */
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index ec1b06dc82d2..72cef58693c7 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
> {
> #ifdef CONFIG_X86_64
> int cpu;
> - int ret;
> u8 flags;
> struct pvclock_vcpu_time_info *vcpu_time;
> unsigned int size;
> @@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
> return 1;
> }
>
> - if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
> - put_cpu();
> - return ret;
> - }
> -
> put_cpu();
>
> kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 2f355d229a58..99bfc025111d 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,
>
> set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
> }
> -
> -#ifdef CONFIG_X86_64
> -/*
> - * Initialize the generic pvclock vsyscall state. This will allocate
> - * a/some page(s) for the per-vcpu pvclock information, set up a
> - * fixmap mapping for the page(s)
> - */
> -
> -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
> - int size)
> -{
> - int idx;
> -
> - WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
> -
> - for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
> - __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
> - __pa(i) + (idx*PAGE_SIZE),
> - PAGE_KERNEL_VVAR);
> - }
> -
> - return 0;
> -}
> -#endif
>
Acked-by: Paolo Bonzini <[email protected]>
On 10/12/2015 00:12, Andy Lutomirski wrote:
> Now that pvclock doesn't require access to the fixmap, all vdso
> variants can use it.
>
> The kernel side isn't wired up for 32-bit kernels yet, but this
> covers 32-bit and x32 userspace on 64-bit kernels.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/vdso/vclock_gettime.c | 91 ++++++++++++++++--------------------
> 1 file changed, 40 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index 59a98c25bde7..8602f06c759f 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -17,8 +17,10 @@
> #include <asm/vvar.h>
> #include <asm/unistd.h>
> #include <asm/msr.h>
> +#include <asm/pvclock.h>
> #include <linux/math64.h>
> #include <linux/time.h>
> +#include <linux/kernel.h>
>
> #define gtod (&VVAR(vsyscall_gtod_data))
>
> @@ -43,10 +45,6 @@ extern u8 pvclock_page
>
> #ifndef BUILD_VDSO32
>
> -#include <linux/kernel.h>
> -#include <asm/vsyscall.h>
> -#include <asm/pvclock.h>
> -
> notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> {
> long ret;
> @@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
> return ret;
> }
>
> -#ifdef CONFIG_PARAVIRT_CLOCK
>
> +#else
> +
> +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> +{
> + long ret;
> +
> + asm(
> + "mov %%ebx, %%edx \n"
> + "mov %2, %%ebx \n"
> + "call __kernel_vsyscall \n"
> + "mov %%edx, %%ebx \n"
> + : "=a" (ret)
> + : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
> + : "memory", "edx");
> + return ret;
> +}
> +
> +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
> +{
> + long ret;
> +
> + asm(
> + "mov %%ebx, %%edx \n"
> + "mov %2, %%ebx \n"
> + "call __kernel_vsyscall \n"
> + "mov %%edx, %%ebx \n"
> + : "=a" (ret)
> + : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
> + : "memory", "edx");
> + return ret;
> +}
> +
> +#endif
> +
> +#ifdef CONFIG_PARAVIRT_CLOCK
> static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
> {
> return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
> @@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
> do {
> version = pvti->version;
>
> - /* This is also a read barrier, so we'll read version first. */
> - tsc = rdtsc_ordered();
> + smp_rmb();
>
> + tsc = rdtsc_ordered();
> pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> pvti_tsc_shift = pvti->tsc_shift;
> pvti_system_time = pvti->system_time;
> @@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
> pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> pvti_tsc_shift);
>
> - /* refer to tsc.c read_tsc() comment for rationale */
> + /* refer to vread_tsc() comment for rationale */
> last = gtod->cycle_last;
>
> if (likely(ret >= last))
> @@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
> }
> #endif
>
> -#else
> -
> -notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> -{
> - long ret;
> -
> - asm(
> - "mov %%ebx, %%edx \n"
> - "mov %2, %%ebx \n"
> - "call __kernel_vsyscall \n"
> - "mov %%edx, %%ebx \n"
> - : "=a" (ret)
> - : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
> - : "memory", "edx");
> - return ret;
> -}
> -
> -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
> -{
> - long ret;
> -
> - asm(
> - "mov %%ebx, %%edx \n"
> - "mov %2, %%ebx \n"
> - "call __kernel_vsyscall \n"
> - "mov %%edx, %%ebx \n"
> - : "=a" (ret)
> - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
> - : "memory", "edx");
> - return ret;
> -}
> -
> -#ifdef CONFIG_PARAVIRT_CLOCK
> -
> -static notrace cycle_t vread_pvclock(int *mode)
> -{
> - *mode = VCLOCK_NONE;
> - return 0;
> -}
> -#endif
> -
> -#endif
> -
> notrace static cycle_t vread_tsc(void)
> {
> cycle_t ret = (cycle_t)rdtsc_ordered();
>
Acked-by: Paolo Bonzini <[email protected]>
On Thu, Dec 10, 2015 at 7:20 PM, Andy Lutomirski <[email protected]> wrote:
> NB: patch 1 doesn't really belong here, but it makes this a lot
Ugh, please disregard the resend. I typoed my git send-email command slightly.
--Andy
* Paolo Bonzini <[email protected]> wrote:
>
>
> On 10/12/2015 00:12, Andy Lutomirski wrote:
> > From: Andy Lutomirski <[email protected]>
> >
> > The pvclock vdso code was too abstracted to understand easily and
> > excessively paranoid. Simplify it for a huge speedup.
> >
> > This opens the door for additional simplifications, as the vdso no
> > longer accesses the pvti for any vcpu other than vcpu 0.
> >
> > Before, vclock_gettime using kvm-clock took about 45ns on my machine.
> > With this change, it takes 29ns, which is almost as fast as the pure TSC
> > implementation.
> >
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > arch/x86/entry/vdso/vclock_gettime.c | 81 ++++++++++++++++++++----------------
> > 1 file changed, 46 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> > index ca94fa649251..c325ba1bdddf 100644
> > --- a/arch/x86/entry/vdso/vclock_gettime.c
> > +++ b/arch/x86/entry/vdso/vclock_gettime.c
> > @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >
> > static notrace cycle_t vread_pvclock(int *mode)
> > {
> > - const struct pvclock_vsyscall_time_info *pvti;
> > + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> > cycle_t ret;
> > - u64 last;
> > - u32 version;
> > - u8 flags;
> > - unsigned cpu, cpu1;
> > -
> > + u64 tsc, pvti_tsc;
> > + u64 last, delta, pvti_system_time;
> > + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >
> > /*
> > - * Note: hypervisor must guarantee that:
> > - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> > - * 2. that per-CPU pvclock time info is updated if the
> > - * underlying CPU changes.
> > - * 3. that version is increased whenever underlying CPU
> > - * changes.
> > + * Note: The kernel and hypervisor must guarantee that cpu ID
> > + * number maps 1:1 to per-CPU pvclock time info.
> > + *
> > + * Because the hypervisor is entirely unaware of guest userspace
> > + * preemption, it cannot guarantee that per-CPU pvclock time
> > + * info is updated if the underlying CPU changes or that that
> > + * version is increased whenever underlying CPU changes.
> > *
> > + * On KVM, we are guaranteed that pvti updates for any vCPU are
> > + * atomic as seen by *all* vCPUs. This is an even stronger
> > + * guarantee than we get with a normal seqlock.
> > + *
> > + * On Xen, we don't appear to have that guarantee, but Xen still
> > + * supplies a valid seqlock using the version field.
> > +
> > + * We only do pvclock vdso timing at all if
> > + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> > + * mean that all vCPUs have matching pvti and that the TSC is
> > + * synced, so we can just look at vCPU 0's pvti.
> > */
> > - do {
> > - cpu = __getcpu() & VGETCPU_CPU_MASK;
> > - /* TODO: We can put vcpu id into higher bits of pvti.version.
> > - * This will save a couple of cycles by getting rid of
> > - * __getcpu() calls (Gleb).
> > - */
> > -
> > - pvti = get_pvti(cpu);
> > -
> > - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> > -
> > - /*
> > - * Test we're still on the cpu as well as the version.
> > - * We could have been migrated just after the first
> > - * vgetcpu but before fetching the version, so we
> > - * wouldn't notice a version change.
> > - */
> > - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> > - } while (unlikely(cpu != cpu1 ||
> > - (pvti->pvti.version & 1) ||
> > - pvti->pvti.version != version));
> > -
> > - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> > +
> > + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> > *mode = VCLOCK_NONE;
> > + return 0;
> > + }
> > +
> > + do {
> > + version = pvti->version;
> > +
> > + /* This is also a read barrier, so we'll read version first. */
> > + tsc = rdtsc_ordered();
> > +
> > + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> > + pvti_tsc_shift = pvti->tsc_shift;
> > + pvti_system_time = pvti->system_time;
> > + pvti_tsc = pvti->tsc_timestamp;
> > +
> > + /* Make sure that the version double-check is last. */
> > + smp_rmb();
> > + } while (unlikely((version & 1) || version != pvti->version));
> > +
> > + delta = tsc - pvti_tsc;
> > + ret = pvti_system_time +
> > + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> > + pvti_tsc_shift);
> >
> > /* refer to tsc.c read_tsc() comment for rationale */
> > last = gtod->cycle_last;
> >
>
> Reviewed-by: Paolo Bonzini <[email protected]>
Thanks. I've added your Reviewed-by to the 1/5 patch as well - to be able to put
the whole series into the tip:x86/entry tree. Let me know if you'd like it to be
done differently.
Thanks,
Ingo
* Andy Lutomirski <[email protected]> wrote:
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index f80d70009ff8..6d7d0e52ed5a 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -19,7 +19,6 @@
> #include <asm/acpi.h>
> #include <asm/apicdef.h>
> #include <asm/page.h>
> -#include <asm/pvclock.h>
> #ifdef CONFIG_X86_32
> #include <linux/threads.h>
> #include <asm/kmap_types.h>
So this change triggered a build failure on 64-bit allmodconfig - fixed via the
patch below. Your change unearthed a latent bug, a missing header inclusion.
Thanks,
Ingo
============>
>From d51953b0873358d13b189996e6976dfa12a9b59d Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Fri, 11 Dec 2015 09:01:30 +0100
Subject: [PATCH] x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog()
This build failure triggers on 64-bit allmodconfig:
arch/x86/platform/uv/uv_nmi.c:493:2: error: implicit declaration of function ‘clocksource_touch_watchdog’ [-Werror=implicit-function-declaration]
which is caused by recent changes exposing a missing clocksource.h include
in uv_nmi.c:
cc1e24fdb064 x86/vdso: Remove pvclock fixmap machinery
this file got clocksource.h indirectly via fixmap.h - that stealth route
of header inclusion is now gone.
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/platform/uv/uv_nmi.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/platform/uv/uv_nmi.c b/arch/x86/platform/uv/uv_nmi.c
index 327f21c3bde1..8dd80050d705 100644
--- a/arch/x86/platform/uv/uv_nmi.c
+++ b/arch/x86/platform/uv/uv_nmi.c
@@ -28,6 +28,7 @@
#include <linux/nmi.h>
#include <linux/sched.h>
#include <linux/slab.h>
+#include <linux/clocksource.h>
#include <asm/apic.h>
#include <asm/current.h>
On 11/12/2015 08:52, Ingo Molnar wrote:
>
> * Paolo Bonzini <[email protected]> wrote:
>
>>
>>
>> On 10/12/2015 00:12, Andy Lutomirski wrote:
>>> From: Andy Lutomirski <[email protected]>
>>>
>>> The pvclock vdso code was too abstracted to understand easily and
>>> excessively paranoid. Simplify it for a huge speedup.
>>>
>>> This opens the door for additional simplifications, as the vdso no
>>> longer accesses the pvti for any vcpu other than vcpu 0.
>>>
>>> Before, vclock_gettime using kvm-clock took about 45ns on my machine.
>>> With this change, it takes 29ns, which is almost as fast as the pure TSC
>>> implementation.
>>>
>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>> ---
>>> arch/x86/entry/vdso/vclock_gettime.c | 81 ++++++++++++++++++++----------------
>>> 1 file changed, 46 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
>>> index ca94fa649251..c325ba1bdddf 100644
>>> --- a/arch/x86/entry/vdso/vclock_gettime.c
>>> +++ b/arch/x86/entry/vdso/vclock_gettime.c
>>> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>>
>>> static notrace cycle_t vread_pvclock(int *mode)
>>> {
>>> - const struct pvclock_vsyscall_time_info *pvti;
>>> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>>> cycle_t ret;
>>> - u64 last;
>>> - u32 version;
>>> - u8 flags;
>>> - unsigned cpu, cpu1;
>>> -
>>> + u64 tsc, pvti_tsc;
>>> + u64 last, delta, pvti_system_time;
>>> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>>
>>> /*
>>> - * Note: hypervisor must guarantee that:
>>> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>>> - * 2. that per-CPU pvclock time info is updated if the
>>> - * underlying CPU changes.
>>> - * 3. that version is increased whenever underlying CPU
>>> - * changes.
>>> + * Note: The kernel and hypervisor must guarantee that cpu ID
>>> + * number maps 1:1 to per-CPU pvclock time info.
>>> + *
>>> + * Because the hypervisor is entirely unaware of guest userspace
>>> + * preemption, it cannot guarantee that per-CPU pvclock time
>>> + * info is updated if the underlying CPU changes or that that
>>> + * version is increased whenever underlying CPU changes.
>>> *
>>> + * On KVM, we are guaranteed that pvti updates for any vCPU are
>>> + * atomic as seen by *all* vCPUs. This is an even stronger
>>> + * guarantee than we get with a normal seqlock.
>>> + *
>>> + * On Xen, we don't appear to have that guarantee, but Xen still
>>> + * supplies a valid seqlock using the version field.
>>> +
>>> + * We only do pvclock vdso timing at all if
>>> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>>> + * mean that all vCPUs have matching pvti and that the TSC is
>>> + * synced, so we can just look at vCPU 0's pvti.
>>> */
>>> - do {
>>> - cpu = __getcpu() & VGETCPU_CPU_MASK;
>>> - /* TODO: We can put vcpu id into higher bits of pvti.version.
>>> - * This will save a couple of cycles by getting rid of
>>> - * __getcpu() calls (Gleb).
>>> - */
>>> -
>>> - pvti = get_pvti(cpu);
>>> -
>>> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>>> -
>>> - /*
>>> - * Test we're still on the cpu as well as the version.
>>> - * We could have been migrated just after the first
>>> - * vgetcpu but before fetching the version, so we
>>> - * wouldn't notice a version change.
>>> - */
>>> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>>> - } while (unlikely(cpu != cpu1 ||
>>> - (pvti->pvti.version & 1) ||
>>> - pvti->pvti.version != version));
>>> -
>>> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>>> +
>>> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>> *mode = VCLOCK_NONE;
>>> + return 0;
>>> + }
>>> +
>>> + do {
>>> + version = pvti->version;
>>> +
>>> + /* This is also a read barrier, so we'll read version first. */
>>> + tsc = rdtsc_ordered();
>>> +
>>> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>>> + pvti_tsc_shift = pvti->tsc_shift;
>>> + pvti_system_time = pvti->system_time;
>>> + pvti_tsc = pvti->tsc_timestamp;
>>> +
>>> + /* Make sure that the version double-check is last. */
>>> + smp_rmb();
>>> + } while (unlikely((version & 1) || version != pvti->version));
>>> +
>>> + delta = tsc - pvti_tsc;
>>> + ret = pvti_system_time +
>>> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>>> + pvti_tsc_shift);
>>>
>>> /* refer to tsc.c read_tsc() comment for rationale */
>>> last = gtod->cycle_last;
>>>
>>
>> Reviewed-by: Paolo Bonzini <[email protected]>
>
> Thanks. I've added your Reviewed-by to the 1/5 patch as well - to be able to put
> the whole series into the tip:x86/entry tree. Let me know if you'd like it to be
> done differently.
The 1/5 patch is entirely in KVM and is not necessary for the rest of
the series to work. I would like it to be separate, because Marcelo has
not yet chimed in to say why it was necessary.
Can you just apply patches 2-5?
Paolo
On Fri, Dec 11, 2015 at 12:06 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
>> index f80d70009ff8..6d7d0e52ed5a 100644
>> --- a/arch/x86/include/asm/fixmap.h
>> +++ b/arch/x86/include/asm/fixmap.h
>> @@ -19,7 +19,6 @@
>> #include <asm/acpi.h>
>> #include <asm/apicdef.h>
>> #include <asm/page.h>
>> -#include <asm/pvclock.h>
>> #ifdef CONFIG_X86_32
>> #include <linux/threads.h>
>> #include <asm/kmap_types.h>
>
> So this change triggered a build failure on 64-bit allmodconfig - fixed via the
> patch below. Your change unearthed a latent bug, a missing header inclusion.
>
> Thanks,
>
> Ingo
>
> ============>
> From d51953b0873358d13b189996e6976dfa12a9b59d Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <[email protected]>
> Date: Fri, 11 Dec 2015 09:01:30 +0100
> Subject: [PATCH] x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog()
>
> This build failure triggers on 64-bit allmodconfig:
>
> arch/x86/platform/uv/uv_nmi.c:493:2: error: implicit declaration of function ‘clocksource_touch_watchdog’ [-Werror=implicit-function-declaration]
>
> which is caused by recent changes exposing a missing clocksource.h include
> in uv_nmi.c:
>
> cc1e24fdb064 x86/vdso: Remove pvclock fixmap machinery
>
> this file got clocksource.h indirectly via fixmap.h - that stealth route
> of header inclusion is now gone.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
LGTM.
--Andy
On Fri, Dec 11, 2015 at 12:42 AM, Paolo Bonzini <[email protected]> wrote:
>
>
> On 11/12/2015 08:52, Ingo Molnar wrote:
>>
>> * Paolo Bonzini <[email protected]> wrote:
>>>
>>> Reviewed-by: Paolo Bonzini <[email protected]>
>>
>> Thanks. I've added your Reviewed-by to the 1/5 patch as well - to be able to put
>> the whole series into the tip:x86/entry tree. Let me know if you'd like it to be
>> done differently.
>
> The 1/5 patch is entirely in KVM and is not necessary for the rest of
> the series to work. I would like it to be separate, because Marcelo has
> not yet chimed in to say why it was necessary.
>
> Can you just apply patches 2-5?
Yes, please. I don't grok the clock update mechanism in the KVM host
well enough to be sure that patch 1 is actually correct. All I know
is that it works better on my laptop with the patch than without the
patch and that it seems at least conceptually correct.
In any event, patch 1 is a host patch and 2-5 are guest patches, and
they only interact to the extent that it's hard for me to test 2-5 on
the guest without patch 1 on the host because without patch 1 my
laptop's host kernel tends to disable stable kvmclock, thus disabling
the entire mechanism in the guest.
--Andy
Commit-ID: 677a73a9aa5433ea728200c26a7b3506d5eaa92b
Gitweb: http://git.kernel.org/tip/677a73a9aa5433ea728200c26a7b3506d5eaa92b
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 10 Dec 2015 19:20:18 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Dec 2015 08:56:02 +0100
x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
This gets rid of the "did TSC go backwards" logic and just
updates all clocks. It should work better (no more disabling of
fast timing) and more reliably (all of the clocks are actually
updated).
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/861716d768a1da6d1fd257b7972f8df13baf7f85.1449702533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kvm/x86.c | 75 +++---------------------------------------------------
1 file changed, 3 insertions(+), 72 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00462bd..6e32e87 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
unsigned int __read_mostly lapic_timer_advance_ns = 0;
module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
-static bool __read_mostly backwards_tsc_observed = false;
-
#define KVM_NR_SHARED_MSRS 16
struct kvm_shared_msrs_global {
@@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
&ka->master_cycle_now);
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
- && !backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
if (ka->use_master_clock)
@@ -7366,88 +7363,22 @@ int kvm_arch_hardware_enable(void)
struct kvm_vcpu *vcpu;
int i;
int ret;
- u64 local_tsc;
- u64 max_tsc = 0;
- bool stable, backwards_tsc = false;
kvm_shared_msr_cpu_online();
ret = kvm_x86_ops->hardware_enable();
if (ret != 0)
return ret;
- local_tsc = rdtsc();
- stable = !check_tsc_unstable();
list_for_each_entry(kvm, &vm_list, vm_list) {
kvm_for_each_vcpu(i, vcpu, kvm) {
- if (!stable && vcpu->cpu == smp_processor_id())
+ if (vcpu->cpu == smp_processor_id()) {
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
- if (stable && vcpu->arch.last_host_tsc > local_tsc) {
- backwards_tsc = true;
- if (vcpu->arch.last_host_tsc > max_tsc)
- max_tsc = vcpu->arch.last_host_tsc;
+ kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
+ vcpu);
}
}
}
- /*
- * Sometimes, even reliable TSCs go backwards. This happens on
- * platforms that reset TSC during suspend or hibernate actions, but
- * maintain synchronization. We must compensate. Fortunately, we can
- * detect that condition here, which happens early in CPU bringup,
- * before any KVM threads can be running. Unfortunately, we can't
- * bring the TSCs fully up to date with real time, as we aren't yet far
- * enough into CPU bringup that we know how much real time has actually
- * elapsed; our helper function, get_kernel_ns() will be using boot
- * variables that haven't been updated yet.
- *
- * So we simply find the maximum observed TSC above, then record the
- * adjustment to TSC in each VCPU. When the VCPU later gets loaded,
- * the adjustment will be applied. Note that we accumulate
- * adjustments, in case multiple suspend cycles happen before some VCPU
- * gets a chance to run again. In the event that no KVM threads get a
- * chance to run, we will miss the entire elapsed period, as we'll have
- * reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
- * loose cycle time. This isn't too big a deal, since the loss will be
- * uniform across all VCPUs (not to mention the scenario is extremely
- * unlikely). It is possible that a second hibernate recovery happens
- * much faster than a first, causing the observed TSC here to be
- * smaller; this would require additional padding adjustment, which is
- * why we set last_host_tsc to the local tsc observed here.
- *
- * N.B. - this code below runs only on platforms with reliable TSC,
- * as that is the only way backwards_tsc is set above. Also note
- * that this runs for ALL vcpus, which is not a bug; all VCPUs should
- * have the same delta_cyc adjustment applied if backwards_tsc
- * is detected. Note further, this adjustment is only done once,
- * as we reset last_host_tsc on all VCPUs to stop this from being
- * called multiple times (one for each physical CPU bringup).
- *
- * Platforms with unreliable TSCs don't have to deal with this, they
- * will be compensated by the logic in vcpu_load, which sets the TSC to
- * catchup mode. This will catchup all VCPUs to real time, but cannot
- * guarantee that they stay in perfect synchronization.
- */
- if (backwards_tsc) {
- u64 delta_cyc = max_tsc - local_tsc;
- backwards_tsc_observed = true;
- list_for_each_entry(kvm, &vm_list, vm_list) {
- kvm_for_each_vcpu(i, vcpu, kvm) {
- vcpu->arch.tsc_offset_adjustment += delta_cyc;
- vcpu->arch.last_host_tsc = local_tsc;
- kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
- }
-
- /*
- * We have to disable TSC offset matching.. if you were
- * booting a VM while issuing an S4 host suspend....
- * you may have some problem. Solving this issue is
- * left as an exercise to the reader.
- */
- kvm->arch.last_tsc_nsec = 0;
- kvm->arch.last_tsc_write = 0;
- }
-
- }
return 0;
}
Commit-ID: 6b078f5de7fc0851af4102493c7b5bb07e49c4cb
Gitweb: http://git.kernel.org/tip/6b078f5de7fc0851af4102493c7b5bb07e49c4cb
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 10 Dec 2015 19:20:19 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Dec 2015 08:56:02 +0100
x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
The pvclock vdso code was too abstracted to understand easily
and excessively paranoid. Simplify it for a huge speedup.
This opens the door for additional simplifications, as the vdso
no longer accesses the pvti for any vcpu other than vcpu 0.
Before, vclock_gettime using kvm-clock took about 45ns on my
machine. With this change, it takes 29ns, which is almost as
fast as the pure TSC implementation.
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/6b51dcc41f1b101f963945c5ec7093d72bdac429.1449702533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 81 ++++++++++++++++++++----------------
1 file changed, 46 insertions(+), 35 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index ca94fa6..c325ba1 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
static notrace cycle_t vread_pvclock(int *mode)
{
- const struct pvclock_vsyscall_time_info *pvti;
+ const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
cycle_t ret;
- u64 last;
- u32 version;
- u8 flags;
- unsigned cpu, cpu1;
-
+ u64 tsc, pvti_tsc;
+ u64 last, delta, pvti_system_time;
+ u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
/*
- * Note: hypervisor must guarantee that:
- * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
- * 2. that per-CPU pvclock time info is updated if the
- * underlying CPU changes.
- * 3. that version is increased whenever underlying CPU
- * changes.
+ * Note: The kernel and hypervisor must guarantee that cpu ID
+ * number maps 1:1 to per-CPU pvclock time info.
+ *
+ * Because the hypervisor is entirely unaware of guest userspace
+ * preemption, it cannot guarantee that per-CPU pvclock time
+ * info is updated if the underlying CPU changes or that that
+ * version is increased whenever underlying CPU changes.
*
+ * On KVM, we are guaranteed that pvti updates for any vCPU are
+ * atomic as seen by *all* vCPUs. This is an even stronger
+ * guarantee than we get with a normal seqlock.
+ *
+ * On Xen, we don't appear to have that guarantee, but Xen still
+ * supplies a valid seqlock using the version field.
+
+ * We only do pvclock vdso timing at all if
+ * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+ * mean that all vCPUs have matching pvti and that the TSC is
+ * synced, so we can just look at vCPU 0's pvti.
*/
- do {
- cpu = __getcpu() & VGETCPU_CPU_MASK;
- /* TODO: We can put vcpu id into higher bits of pvti.version.
- * This will save a couple of cycles by getting rid of
- * __getcpu() calls (Gleb).
- */
-
- pvti = get_pvti(cpu);
-
- version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
-
- /*
- * Test we're still on the cpu as well as the version.
- * We could have been migrated just after the first
- * vgetcpu but before fetching the version, so we
- * wouldn't notice a version change.
- */
- cpu1 = __getcpu() & VGETCPU_CPU_MASK;
- } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
- if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+ if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+ return 0;
+ }
+
+ do {
+ version = pvti->version;
+
+ /* This is also a read barrier, so we'll read version first. */
+ tsc = rdtsc_ordered();
+
+ pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+ pvti_tsc_shift = pvti->tsc_shift;
+ pvti_system_time = pvti->system_time;
+ pvti_tsc = pvti->tsc_timestamp;
+
+ /* Make sure that the version double-check is last. */
+ smp_rmb();
+ } while (unlikely((version & 1) || version != pvti->version));
+
+ delta = tsc - pvti_tsc;
+ ret = pvti_system_time +
+ pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+ pvti_tsc_shift);
/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
Commit-ID: dac16fba6fc590fa7239676b35ed75dae4c4cd2b
Gitweb: http://git.kernel.org/tip/dac16fba6fc590fa7239676b35ed75dae4c4cd2b
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 10 Dec 2015 19:20:20 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Dec 2015 08:56:03 +0100
x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/9d37826fdc7e2d2809efe31d5345f97186859284.1449702533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 20 ++++++++------------
arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++-
arch/x86/entry/vdso/vdso2c.c | 3 +++
arch/x86/entry/vdso/vma.c | 13 +++++++++++++
arch/x86/include/asm/pvclock.h | 9 +++++++++
arch/x86/include/asm/vdso.h | 1 +
arch/x86/kernel/kvmclock.c | 5 +++++
7 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index c325ba1..5dd363d 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
}
#endif
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern u8 pvclock_page
+ __attribute__((visibility("hidden")));
+#endif
+
#ifndef BUILD_VDSO32
#include <linux/kernel.h>
@@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
#ifdef CONFIG_PARAVIRT_CLOCK
-static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
{
- const struct pvclock_vsyscall_time_info *pvti_base;
- int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
- int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
-
- BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
-
- pvti_base = (struct pvclock_vsyscall_time_info *)
- __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
-
- return &pvti_base[offset];
+ return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
}
static notrace cycle_t vread_pvclock(int *mode)
{
- const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
+ const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
cycle_t ret;
u64 tsc, pvti_tsc;
u64 last, delta, pvti_system_time;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index de2c921..4158acc 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
* segment.
*/
- vvar_start = . - 2 * PAGE_SIZE;
+ vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
#undef EMIT_VVAR
hpet_page = vvar_start + PAGE_SIZE;
+ pvclock_page = vvar_start + 2 * PAGE_SIZE;
. = SIZEOF_HEADERS;
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 785d992..491020b 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -73,6 +73,7 @@ enum {
sym_vvar_start,
sym_vvar_page,
sym_hpet_page,
+ sym_pvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
};
@@ -80,6 +81,7 @@ enum {
const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
+ sym_pvclock_page,
};
struct vdso_sym {
@@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_start] = {"vvar_start", true},
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
+ [sym_pvclock_page] = {"pvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 64df471..aa82819 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
.name = "[vvar]",
.pages = no_pages,
};
+ struct pvclock_vsyscall_time_info *pvti;
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
}
#endif
+ pvti = pvclock_pvti_cpu0_va();
+ if (pvti && image->sym_pvclock_page) {
+ ret = remap_pfn_range(vma,
+ text_start + image->sym_pvclock_page,
+ __pa(pvti) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
+
+ if (ret)
+ goto up_fail;
+ }
+
up_fail:
if (ret)
current->mm->context.vdso = NULL;
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 7a6bed5..3864398 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,15 @@
#include <linux/clocksource.h>
#include <asm/pvclock-abi.h>
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+ return NULL;
+}
+#endif
+
/* some helper functions for xen and kvm pv clock sources */
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 756de91..deabaf9 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -22,6 +22,7 @@ struct vdso_image {
long sym_vvar_page;
long sym_hpet_page;
+ long sym_pvclock_page;
long sym_VDSO32_NOTE_MASK;
long sym___kernel_sigreturn;
long sym___kernel_rt_sigreturn;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 2bd81e3..ec1b06d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -45,6 +45,11 @@ early_param("no-kvmclock", parse_no_kvmclock);
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock wall_clock;
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+ return hv_clock;
+}
+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
Commit-ID: cc1e24fdb064d3126a494716f22ad4fc39306742
Gitweb: http://git.kernel.org/tip/cc1e24fdb064d3126a494716f22ad4fc39306742
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 10 Dec 2015 19:20:21 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Dec 2015 08:56:03 +0100
x86/vdso: Remove pvclock fixmap machinery
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/4933029991103ae44672c82b97a20035f5c1fe4f.1449702533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 1 -
arch/x86/entry/vdso/vma.c | 1 +
arch/x86/include/asm/fixmap.h | 5 -----
arch/x86/include/asm/pvclock.h | 5 -----
arch/x86/kernel/kvmclock.c | 6 ------
arch/x86/kernel/pvclock.c | 24 ------------------------
6 files changed, 1 insertion(+), 41 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 5dd363d..59a98c2 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -45,7 +45,6 @@ extern u8 pvclock_page
#include <linux/kernel.h>
#include <asm/vsyscall.h>
-#include <asm/fixmap.h>
#include <asm/pvclock.h>
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index aa82819..b8f69e2 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
#include <linux/random.h>
#include <linux/elf.h>
#include <linux/cpu.h>
+#include <asm/pvclock.h>
#include <asm/vgtod.h>
#include <asm/proto.h>
#include <asm/vdso.h>
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f80d700..6d7d0e5 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,7 +19,6 @@
#include <asm/acpi.h>
#include <asm/apicdef.h>
#include <asm/page.h>
-#include <asm/pvclock.h>
#ifdef CONFIG_X86_32
#include <linux/threads.h>
#include <asm/kmap_types.h>
@@ -72,10 +71,6 @@ enum fixed_addresses {
#ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
#endif
-#ifdef CONFIG_PARAVIRT_CLOCK
- PVCLOCK_FIXMAP_BEGIN,
- PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
-#endif
#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 3864398..66df22b 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
} __attribute__((__aligned__(SMP_CACHE_BYTES)));
#define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
-#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
- int size);
-struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
#endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec1b06d..72cef58 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
{
#ifdef CONFIG_X86_64
int cpu;
- int ret;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
@@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
- if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
- put_cpu();
- return ret;
- }
-
put_cpu();
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d2..99bfc02 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
}
-
-#ifdef CONFIG_X86_64
-/*
- * Initialize the generic pvclock vsyscall state. This will allocate
- * a/some page(s) for the per-vcpu pvclock information, set up a
- * fixmap mapping for the page(s)
- */
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
- int size)
-{
- int idx;
-
- WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
-
- for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
- __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
- __pa(i) + (idx*PAGE_SIZE),
- PAGE_KERNEL_VVAR);
- }
-
- return 0;
-}
-#endif
Commit-ID: 76480a6a55a03d0fe5dd6290ccde7f78678ab85e
Gitweb: http://git.kernel.org/tip/76480a6a55a03d0fe5dd6290ccde7f78678ab85e
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 10 Dec 2015 19:20:22 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Dec 2015 08:56:03 +0100
x86/vdso: Enable vdso pvclock access on all vdso variants
Now that pvclock doesn't require access to the fixmap, all vdso
variants can use it.
The kernel side isn't wired up for 32-bit kernels yet, but this
covers 32-bit and x32 userspace on 64-bit kernels.
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/a7ef693b7a4c88dd2173dc1d4bf6bc27023626eb.1449702533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 91 ++++++++++++++++--------------------
1 file changed, 40 insertions(+), 51 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 59a98c2..8602f06 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,8 +17,10 @@
#include <asm/vvar.h>
#include <asm/unistd.h>
#include <asm/msr.h>
+#include <asm/pvclock.h>
#include <linux/math64.h>
#include <linux/time.h>
+#include <linux/kernel.h>
#define gtod (&VVAR(vsyscall_gtod_data))
@@ -43,10 +45,6 @@ extern u8 pvclock_page
#ifndef BUILD_VDSO32
-#include <linux/kernel.h>
-#include <asm/vsyscall.h>
-#include <asm/pvclock.h>
-
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
{
long ret;
@@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
return ret;
}
-#ifdef CONFIG_PARAVIRT_CLOCK
+#else
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+ long ret;
+
+ asm(
+ "mov %%ebx, %%edx \n"
+ "mov %2, %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret)
+ : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
+ : "memory", "edx");
+ return ret;
+}
+
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+ long ret;
+
+ asm(
+ "mov %%ebx, %%edx \n"
+ "mov %2, %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret)
+ : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
+ : "memory", "edx");
+ return ret;
+}
+
+#endif
+
+#ifdef CONFIG_PARAVIRT_CLOCK
static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
{
return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
@@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
do {
version = pvti->version;
- /* This is also a read barrier, so we'll read version first. */
- tsc = rdtsc_ordered();
+ smp_rmb();
+ tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
pvti_system_time = pvti->system_time;
@@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
pvti_tsc_shift);
- /* refer to tsc.c read_tsc() comment for rationale */
+ /* refer to vread_tsc() comment for rationale */
last = gtod->cycle_last;
if (likely(ret >= last))
@@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
}
#endif
-#else
-
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
-{
- long ret;
-
- asm(
- "mov %%ebx, %%edx \n"
- "mov %2, %%ebx \n"
- "call __kernel_vsyscall \n"
- "mov %%edx, %%ebx \n"
- : "=a" (ret)
- : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
- : "memory", "edx");
- return ret;
-}
-
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
- long ret;
-
- asm(
- "mov %%ebx, %%edx \n"
- "mov %2, %%ebx \n"
- "call __kernel_vsyscall \n"
- "mov %%edx, %%ebx \n"
- : "=a" (ret)
- : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
- : "memory", "edx");
- return ret;
-}
-
-#ifdef CONFIG_PARAVIRT_CLOCK
-
-static notrace cycle_t vread_pvclock(int *mode)
-{
- *mode = VCLOCK_NONE;
- return 0;
-}
-#endif
-
-#endif
-
notrace static cycle_t vread_tsc(void)
{
cycle_t ret = (cycle_t)rdtsc_ordered();
On 14/12/2015 09:16, tip-bot for Andy Lutomirski wrote:
> Commit-ID: 677a73a9aa5433ea728200c26a7b3506d5eaa92b
> Gitweb: http://git.kernel.org/tip/677a73a9aa5433ea728200c26a7b3506d5eaa92b
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Thu, 10 Dec 2015 19:20:18 -0800
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Fri, 11 Dec 2015 08:56:02 +0100
>
> x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
>
> This gets rid of the "did TSC go backwards" logic and just
> updates all clocks. It should work better (no more disabling of
> fast timing) and more reliably (all of the clocks are actually
> updated).
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/861716d768a1da6d1fd257b7972f8df13baf7f85.1449702533.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/kvm/x86.c | 75 +++---------------------------------------------------
> 1 file changed, 3 insertions(+), 72 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 00462bd..6e32e87 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> unsigned int __read_mostly lapic_timer_advance_ns = 0;
> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
>
> -static bool __read_mostly backwards_tsc_observed = false;
> -
> #define KVM_NR_SHARED_MSRS 16
>
> struct kvm_shared_msrs_global {
> @@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> &ka->master_cycle_now);
>
> ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> - && !backwards_tsc_observed
> && !ka->boot_vcpu_runs_old_kvmclock;
>
> if (ka->use_master_clock)
> @@ -7366,88 +7363,22 @@ int kvm_arch_hardware_enable(void)
> struct kvm_vcpu *vcpu;
> int i;
> int ret;
> - u64 local_tsc;
> - u64 max_tsc = 0;
> - bool stable, backwards_tsc = false;
>
> kvm_shared_msr_cpu_online();
> ret = kvm_x86_ops->hardware_enable();
> if (ret != 0)
> return ret;
>
> - local_tsc = rdtsc();
> - stable = !check_tsc_unstable();
> list_for_each_entry(kvm, &vm_list, vm_list) {
> kvm_for_each_vcpu(i, vcpu, kvm) {
> - if (!stable && vcpu->cpu == smp_processor_id())
> + if (vcpu->cpu == smp_processor_id()) {
> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> - if (stable && vcpu->arch.last_host_tsc > local_tsc) {
> - backwards_tsc = true;
> - if (vcpu->arch.last_host_tsc > max_tsc)
> - max_tsc = vcpu->arch.last_host_tsc;
> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
> + vcpu);
> }
> }
> }
>
> - /*
> - * Sometimes, even reliable TSCs go backwards. This happens on
> - * platforms that reset TSC during suspend or hibernate actions, but
> - * maintain synchronization. We must compensate. Fortunately, we can
> - * detect that condition here, which happens early in CPU bringup,
> - * before any KVM threads can be running. Unfortunately, we can't
> - * bring the TSCs fully up to date with real time, as we aren't yet far
> - * enough into CPU bringup that we know how much real time has actually
> - * elapsed; our helper function, get_kernel_ns() will be using boot
> - * variables that haven't been updated yet.
> - *
> - * So we simply find the maximum observed TSC above, then record the
> - * adjustment to TSC in each VCPU. When the VCPU later gets loaded,
> - * the adjustment will be applied. Note that we accumulate
> - * adjustments, in case multiple suspend cycles happen before some VCPU
> - * gets a chance to run again. In the event that no KVM threads get a
> - * chance to run, we will miss the entire elapsed period, as we'll have
> - * reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
> - * loose cycle time. This isn't too big a deal, since the loss will be
> - * uniform across all VCPUs (not to mention the scenario is extremely
> - * unlikely). It is possible that a second hibernate recovery happens
> - * much faster than a first, causing the observed TSC here to be
> - * smaller; this would require additional padding adjustment, which is
> - * why we set last_host_tsc to the local tsc observed here.
> - *
> - * N.B. - this code below runs only on platforms with reliable TSC,
> - * as that is the only way backwards_tsc is set above. Also note
> - * that this runs for ALL vcpus, which is not a bug; all VCPUs should
> - * have the same delta_cyc adjustment applied if backwards_tsc
> - * is detected. Note further, this adjustment is only done once,
> - * as we reset last_host_tsc on all VCPUs to stop this from being
> - * called multiple times (one for each physical CPU bringup).
> - *
> - * Platforms with unreliable TSCs don't have to deal with this, they
> - * will be compensated by the logic in vcpu_load, which sets the TSC to
> - * catchup mode. This will catchup all VCPUs to real time, but cannot
> - * guarantee that they stay in perfect synchronization.
> - */
> - if (backwards_tsc) {
> - u64 delta_cyc = max_tsc - local_tsc;
> - backwards_tsc_observed = true;
> - list_for_each_entry(kvm, &vm_list, vm_list) {
> - kvm_for_each_vcpu(i, vcpu, kvm) {
> - vcpu->arch.tsc_offset_adjustment += delta_cyc;
> - vcpu->arch.last_host_tsc = local_tsc;
> - kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> - }
> -
> - /*
> - * We have to disable TSC offset matching.. if you were
> - * booting a VM while issuing an S4 host suspend....
> - * you may have some problem. Solving this issue is
> - * left as an exercise to the reader.
> - */
> - kvm->arch.last_tsc_nsec = 0;
> - kvm->arch.last_tsc_write = 0;
> - }
> -
> - }
> return 0;
> }
>
>
NACK.
Please remove this patch from the tip tree.
Paolo
2015-12-09 15:12-0800, Andy Lutomirski:
> This gets rid of the "did TSC go backwards" logic and just updates
> all clocks. It should work better (no more disabling of fast
> timing) and more reliably (all of the clocks are actually updated).
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
> list_for_each_entry(kvm, &vm_list, vm_list) {
> kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (vcpu->cpu == smp_processor_id()) {
(vmm_exclusive sets vcpu->cpu to -1, so KVM_REQ_MASTERCLOCK_UPDATE might
not run, but vmm_exclusive probably doesn't work anyway.)
> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
> + vcpu);
> }
(Requesting KVM_REQ_MASTERCLOCK_UPDATE once per VM is enough.)
> - if (backwards_tsc) {
> - u64 delta_cyc = max_tsc - local_tsc;
> - backwards_tsc_observed = true;
> - list_for_each_entry(kvm, &vm_list, vm_list) {
> - kvm_for_each_vcpu(i, vcpu, kvm) {
> - vcpu->arch.tsc_offset_adjustment += delta_cyc;
> - vcpu->arch.last_host_tsc = local_tsc;
tsc_offset_adjustment was set for
/* Apply any externally detected TSC adjustments (due to suspend) */
if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
vcpu->arch.tsc_offset_adjustment = 0;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}
Guest TSC is going to jump backward with this patch, which would make
the guest think that a lot of cycles passed. This has no bearing on
guest timekeeping, because the guest shouldn't be using raw TSC.
If we wanted to do something though, there are at least two options:
1) Fake that TSC continued at roughly its specified rate: compute how
many cycles could have elapsed while the CPU was suspended (using
host time before/after suspend and guest TSC frequency) and adjust
guest TSC.
2) Resume guest TSC at its last cycle before suspend.
(Roughly what KVM does now.)
What are your opinions on TSC faking?
Thanks.
---
Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
the masterclock+suspend issue, if you don't mind ... So far, I don't
even see a reason to update kvmclock on kvm_arch_hardware_enable().
Suspend is a condition that we want to handle, so kvm_resume would be a
better place, but we handle suspend only because TSC and timekeeping has
changed, so I think that the right place is in their event notifiers.
On Wed, Mar 16, 2016 at 3:06 PM, Radim Krcmar <[email protected]> wrote:
> 2015-12-09 15:12-0800, Andy Lutomirski:
>> This gets rid of the "did TSC go backwards" logic and just updates
>> all clocks. It should work better (no more disabling of fast
>> timing) and more reliably (all of the clocks are actually updated).
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> @@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
>> list_for_each_entry(kvm, &vm_list, vm_list) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> + if (vcpu->cpu == smp_processor_id()) {
>
> (vmm_exclusive sets vcpu->cpu to -1, so KVM_REQ_MASTERCLOCK_UPDATE might
> not run, but vmm_exclusive probably doesn't work anyway.)
>
>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
>> + vcpu);
>> }
>
> (Requesting KVM_REQ_MASTERCLOCK_UPDATE once per VM is enough.)
>
>> - if (backwards_tsc) {
>> - u64 delta_cyc = max_tsc - local_tsc;
>> - backwards_tsc_observed = true;
>> - list_for_each_entry(kvm, &vm_list, vm_list) {
>> - kvm_for_each_vcpu(i, vcpu, kvm) {
>> - vcpu->arch.tsc_offset_adjustment += delta_cyc;
>> - vcpu->arch.last_host_tsc = local_tsc;
>
> tsc_offset_adjustment was set for
>
> /* Apply any externally detected TSC adjustments (due to suspend) */
> if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
> adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
> vcpu->arch.tsc_offset_adjustment = 0;
> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> }
>
> Guest TSC is going to jump backward with this patch, which would make
> the guest think that a lot of cycles passed. This has no bearing on
> guest timekeeping, because the guest shouldn't be using raw TSC.
> If we wanted to do something though, there are at least two options:
> 1) Fake that TSC continued at roughly its specified rate: compute how
> many cycles could have elapsed while the CPU was suspended (using
> host time before/after suspend and guest TSC frequency) and adjust
> guest TSC.
> 2) Resume guest TSC at its last cycle before suspend.
> (Roughly what KVM does now.)
>
> What are your opinions on TSC faking?
I'd suggest restarting it wherever it left off, because it's simpler.
If there was a CLOCK_BOOT_RAW, you could try to track it, but I'm not
sure that such a thing exists.
FWIW, if you ever intend to support ART ("always running timer")
passthrough, this is going to be a giant clusterfsck. Good luck. I
haven't gotten a straight answer as to what hardware actually supports
that thing, so even testing isn't no easy.
>
> Thanks.
>
>
> ---
> Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
> the masterclock+suspend issue, if you don't mind ... So far, I don't
> even see a reason to update kvmclock on kvm_arch_hardware_enable().
> Suspend is a condition that we want to handle, so kvm_resume would be a
> better place, but we handle suspend only because TSC and timekeeping has
> changed, so I think that the right place is in their event notifiers.
I'd be glad to try to review things. Please cc me.
One of the Xen people pointed me at the MS Viridian spec for handling
TSC rate changes on migration to or from hosts that don't support TSC
scaling. I wonder if KVM could use the same technique or even the
same API.
--Andy
2016-03-16 15:15-0700, Andy Lutomirski:
> On Wed, Mar 16, 2016 at 3:06 PM, Radim Krcmar <[email protected]> wrote:
>> Guest TSC is going to jump backward with this patch, which would make
>> the guest think that a lot of cycles passed. This has no bearing on
>> guest timekeeping, because the guest shouldn't be using raw TSC.
>> If we wanted to do something though, there are at least two options:
>> 1) Fake that TSC continued at roughly its specified rate: compute how
>> many cycles could have elapsed while the CPU was suspended (using
>> host time before/after suspend and guest TSC frequency) and adjust
>> guest TSC.
>> 2) Resume guest TSC at its last cycle before suspend.
>> (Roughly what KVM does now.)
>>
>> What are your opinions on TSC faking?
>
> I'd suggest restarting it wherever it left off, because it's simpler.
> If there was a CLOCK_BOOT_RAW, you could try to track it, but I'm not
> sure that such a thing exists.
CLOCK_MONOTONIC_RAW can count in suspend, so CLOCK_BOOT_RAW would be a
conditional alias and it probably doesn't exist because of that.
> FWIW, if you ever intend to support ART ("always running timer")
> passthrough, this is going to be a giant clusterfsck. Good luck. I
> haven't gotten a straight answer as to what hardware actually supports
> that thing, so even testing isn't no easy.
Hm, AR TSC would be best handled by doing nothing ... dropping the
faking logic just became tempting.
>> ---
>> Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
>> the masterclock+suspend issue, if you don't mind ... So far, I don't
>> even see a reason to update kvmclock on kvm_arch_hardware_enable().
>> Suspend is a condition that we want to handle, so kvm_resume would be a
>> better place, but we handle suspend only because TSC and timekeeping has
>> changed, so I think that the right place is in their event notifiers.
>
> I'd be glad to try to review things. Please cc me.
Ok.
> One of the Xen people pointed me at the MS Viridian spec for handling
> TSC rate changes on migration to or from hosts that don't support TSC
> scaling. I wonder if KVM could use the same technique or even the
> same API.
The TSC frequency MSR is read-only in Xen, so I guess it's equivalent to
pvclock. I'll take a deeper look, thanks for pointers.
On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <[email protected]> wrote:
> 2016-03-16 15:15-0700, Andy Lutomirski:
>> On Wed, Mar 16, 2016 at 3:06 PM, Radim Krcmar <[email protected]> wrote:
>>> Guest TSC is going to jump backward with this patch, which would make
>>> the guest think that a lot of cycles passed. This has no bearing on
>>> guest timekeeping, because the guest shouldn't be using raw TSC.
>>> If we wanted to do something though, there are at least two options:
>>> 1) Fake that TSC continued at roughly its specified rate: compute how
>>> many cycles could have elapsed while the CPU was suspended (using
>>> host time before/after suspend and guest TSC frequency) and adjust
>>> guest TSC.
>>> 2) Resume guest TSC at its last cycle before suspend.
>>> (Roughly what KVM does now.)
>>>
>>> What are your opinions on TSC faking?
>>
>> I'd suggest restarting it wherever it left off, because it's simpler.
>> If there was a CLOCK_BOOT_RAW, you could try to track it, but I'm not
>> sure that such a thing exists.
>
> CLOCK_MONOTONIC_RAW can count in suspend, so CLOCK_BOOT_RAW would be a
> conditional alias and it probably doesn't exist because of that.
>
>> FWIW, if you ever intend to support ART ("always running timer")
>> passthrough, this is going to be a giant clusterfsck. Good luck. I
>> haven't gotten a straight answer as to what hardware actually supports
>> that thing, so even testing isn't no easy.
>
> Hm, AR TSC would be best handled by doing nothing ... dropping the
> faking logic just became tempting.
As it stands, ART is screwed if you adjust the VMCS's tsc offset. But
I think it's also screwed if you migrate to a machine with a different
ratio of guest TSC ticks to host ART ticks or a different offset,
because the host isn't going to do the rdmsr every time it tries to
access the ART, so passing it through might require a paravirt
mechanism no matter what.
ISTM that, if KVM tries to keep the guest TSC monotonic across
migration, it should probably also keep it monotonic across host
suspend/resume. After all, host suspend/resume is kind of like
migrating from the pre-suspend host to the post-resume host. Maybe it
could even share code.
>
>>> ---
>>> Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
>>> the masterclock+suspend issue, if you don't mind ... So far, I don't
>>> even see a reason to update kvmclock on kvm_arch_hardware_enable().
>>> Suspend is a condition that we want to handle, so kvm_resume would be a
>>> better place, but we handle suspend only because TSC and timekeeping has
>>> changed, so I think that the right place is in their event notifiers.
>>
>> I'd be glad to try to review things. Please cc me.
>
> Ok.
>
>> One of the Xen people pointed me at the MS Viridian spec for handling
>> TSC rate changes on migration to or from hosts that don't support TSC
>> scaling. I wonder if KVM could use the same technique or even the
>> same API.
>
> The TSC frequency MSR is read-only in Xen, so I guess it's equivalent to
> pvclock. I'll take a deeper look, thanks for pointers.
--
Andy Lutomirski
AMA Capital Management, LLC
2016-03-16 16:07-0700, Andy Lutomirski:
> On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <[email protected]> wrote:
>> 2016-03-16 15:15-0700, Andy Lutomirski:
>>> FWIW, if you ever intend to support ART ("always running timer")
>>> passthrough, this is going to be a giant clusterfsck. Good luck. I
>>> haven't gotten a straight answer as to what hardware actually supports
>>> that thing, so even testing isn't no easy.
>>
>> Hm, AR TSC would be best handled by doing nothing ... dropping the
>> faking logic just became tempting.
ART is different from what I initially thought, it's the underlying
mechanism for invariant TSC and nothing more ... we already forbid
migrations when the guest knows about invariant TSC, so we could do the
same and let ART be virtualized. (Suspend has to be forbidden too.)
> As it stands, ART is screwed if you adjust the VMCS's tsc offset. But
Luckily, assigning real hardware can prevent migration or suspend, so we
won't need to adjust the offset during runtime. TSC is a generally
unmigratable device that just happens to live on the CPU.
(It would have been better to hide TSC capability from the guest and only
use rdtsc for kvmclock if the guest wanted fancy features.)
> I think it's also screwed if you migrate to a machine with a different
> ratio of guest TSC ticks to host ART ticks or a different offset,
> because the host isn't going to do the rdmsr every time it tries to
> access the ART, so passing it through might require a paravirt
> mechanism no matter what.
It's almost certain that the other host will have a different offset,
which makes TSC unmigratable in software without even considering ART
or frequencies. Well, KVM already emulates different TSC frequency, so
we could emulate ART without sinking much lower. :)
> ISTM that, if KVM tries to keep the guest TSC monotonic across
> migration, it should probably also keep it monotonic across host
> suspend/resume.
Yes, "Pausing" TSC during suspend or migration is one way of improving
the TSC estimate. If we want to emulate ART, then the estimate is
noticeably lacking, because TSC and ART are defined by a simple
equation (SDM 2015-12, 17.14.4 Invariant Time-Keeping):
TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K
where the guest thinks that CPUID and K are constant (between events
that the guest knows of), so we should give the best estimate of how
many TSC cycles have passed. (The best estimate is still lacking.)
> After all, host suspend/resume is kind of like
> migrating from the pre-suspend host to the post-resume host. Maybe it
> could even share code.
Hopefully ... host suspend/resume is driven by kernel and migration is
driven by userspace, which might complicate sharing.
On Mar 17, 2016 8:10 AM, "Radim Krcmar" <[email protected]> wrote:
>
> 2016-03-16 16:07-0700, Andy Lutomirski:
> > On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <[email protected]> wrote:
> >> 2016-03-16 15:15-0700, Andy Lutomirski:
> >>> FWIW, if you ever intend to support ART ("always running timer")
> >>> passthrough, this is going to be a giant clusterfsck. Good luck. I
> >>> haven't gotten a straight answer as to what hardware actually supports
> >>> that thing, so even testing isn't no easy.
> >>
> >> Hm, AR TSC would be best handled by doing nothing ... dropping the
> >> faking logic just became tempting.
>
> ART is different from what I initially thought, it's the underlying
> mechanism for invariant TSC and nothing more ... we already forbid
> migrations when the guest knows about invariant TSC, so we could do the
> same and let ART be virtualized. (Suspend has to be forbidden too.)
It's more than that -- it's a TSC-like clock that can be read by PCIe devices.
>
> > As it stands, ART is screwed if you adjust the VMCS's tsc offset. But
>
> Luckily, assigning real hardware can prevent migration or suspend, so we
> won't need to adjust the offset during runtime. TSC is a generally
> unmigratable device that just happens to live on the CPU.
>
> (It would have been better to hide TSC capability from the guest and only
> use rdtsc for kvmclock if the guest wanted fancy features.)
>
I think that, if KVM passes through an ART-supporting NIC, it might be
rather messy to try to avoid passing through TSC as well. But maybe a
pvclock-like structure could expose the ART-kvmclock offset and scale.
> > I think it's also screwed if you migrate to a machine with a different
> > ratio of guest TSC ticks to host ART ticks or a different offset,
> > because the host isn't going to do the rdmsr every time it tries to
> > access the ART, so passing it through might require a paravirt
> > mechanism no matter what.
>
> It's almost certain that the other host will have a different offset,
> which makes TSC unmigratable in software without even considering ART
> or frequencies. Well, KVM already emulates different TSC frequency, so
> we could emulate ART without sinking much lower. :)
>
> > ISTM that, if KVM tries to keep the guest TSC monotonic across
> > migration, it should probably also keep it monotonic across host
> > suspend/resume.
>
> Yes, "Pausing" TSC during suspend or migration is one way of improving
> the TSC estimate. If we want to emulate ART, then the estimate is
> noticeably lacking, because TSC and ART are defined by a simple
> equation (SDM 2015-12, 17.14.4 Invariant Time-Keeping):
> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K
>
> where the guest thinks that CPUID and K are constant (between events
> that the guest knows of), so we should give the best estimate of how
> many TSC cycles have passed. (The best estimate is still lacking.)
>
> > After all, host suspend/resume is kind of like
> > migrating from the pre-suspend host to the post-resume host. Maybe it
> > could even share code.
>
> Hopefully ... host suspend/resume is driven by kernel and migration is
> driven by userspace, which might complicate sharing.
Good point.
--Andy
2016-03-17 11:22-0700, Andy Lutomirski:
> On Mar 17, 2016 8:10 AM, "Radim Krcmar" <[email protected]> wrote:
>> 2016-03-16 16:07-0700, Andy Lutomirski:
>>> On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <[email protected]> wrote:
>>>> 2016-03-16 15:15-0700, Andy Lutomirski:
>>>>> FWIW, if you ever intend to support ART ("always running timer")
>>>>> passthrough, this is going to be a giant clusterfsck. Good luck. I
>>>>> haven't gotten a straight answer as to what hardware actually supports
>>>>> that thing, so even testing isn't no easy.
>>>>
>>>> Hm, AR TSC would be best handled by doing nothing ... dropping the
>>>> faking logic just became tempting.
>>
>> ART is different from what I initially thought, it's the underlying
>> mechanism for invariant TSC and nothing more ... we already forbid
>> migrations when the guest knows about invariant TSC, so we could do the
>> same and let ART be virtualized. (Suspend has to be forbidden too.)
>
> It's more than that -- it's a TSC-like clock that can be read by PCIe devices.
So ART is for time synchronization within the machine. Makes sense now.
>>> As it stands, ART is screwed if you adjust the VMCS's tsc offset. But
>>
>> Luckily, assigning real hardware can prevent migration or suspend, so we
>> won't need to adjust the offset during runtime. TSC is a generally
>> unmigratable device that just happens to live on the CPU.
>>
>> (It would have been better to hide TSC capability from the guest and only
>> use rdtsc for kvmclock if the guest wanted fancy features.)
>>
>
> I think that, if KVM passes through an ART-supporting NIC, it might be
> rather messy to try to avoid passing through TSC as well.
I agree. Migrating a guest with ART-supporting NIC is going to be hard
or impossible, so there is no big drawback in exposing TSC.
If KVM adds host TSC_ADJUST and VMCS TSC-offset to guest TSC_ADJUST,
then ART-supporting NIC should use timestamps compatible with VCPUs.
> But maybe a
> pvclock-like structure could expose the ART-kvmclock offset and scale.
I think that getting ART from kvmclock would turn out to be horrible.