2009-10-06 00:50:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH RFC] Extending pvclock down to usermode for vsyscall

This series implements the vread method for the clocksource, which allows
it to be used directly from usermode via vsyscall.

Most of the work is in the generic pvclock code, with a bit of
hypervisor-specific code to set things up. Specifically, it sets up a
page which is an array of pvclock_vcpu_time_info structures indexed by
vcpu number. The usermode code gets the current (v)cpu via vgetcpu,
gets the appropriate time info and computes system time from the tsc.
After doing all that it rechecks the version number to make things
happened consistently. The kernel also installs a preempt notifier to
bump the version number so that usermode can tell it has had its cpu
context switched under it.

I've included the Xen implementation, but it should be simple to implement
for KVM (especially since it already has the feature I had to add to
Xen to implement this: the ability to place the pvclock_vcpu_time_info
structure at an arbitrary address).

With this in place, I can do a gettimeofday in about 100ns on a 2.4GHz
Q6600. I'm sure this could be tuned a bit more, but it is already much
better than a syscall.

(There's also a couple of other little generic pvclock fixes I'll send
separately.)

Thanks,
J

arch/x86/Kconfig | 4 +
arch/x86/include/asm/fixmap.h | 21 +++++
arch/x86/include/asm/pvclock.h | 6 +
arch/x86/include/asm/vsyscall.h | 3
arch/x86/kernel/Makefile | 2
arch/x86/kernel/pvclock.c | 156 ++++++++++++++++++++++++++++++++++++----
arch/x86/xen/Kconfig | 6 +
arch/x86/xen/mmu.c | 3
arch/x86/xen/smp.c | 2
arch/x86/xen/time.c | 52 +++++++++++++
arch/x86/xen/xen-ops.h | 8 ++
include/xen/interface/vcpu.h | 41 ++++++++++
12 files changed, 291 insertions(+), 13 deletions(-)


2009-10-06 00:50:57

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 1/5] x86/pvclock: make sure rdtsc doesn't speculate out of region

rdtsc is not necessarily ordered with surrounding instructions so we
need to make sure that it doesn't get sampled out of order with
respect to reading the time_info.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/pvclock.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..963f349 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -117,10 +117,10 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)

do {
version = pvclock_get_time_values(&shadow, src);
- barrier();
+ rdtsc_barrier();
offset = pvclock_get_nsec_offset(&shadow);
ret = shadow.system_timestamp + offset;
- barrier();
+ rdtsc_barrier();
} while (version != src->version);

return ret;
--
1.6.2.5

2009-10-06 00:51:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 2/5] x86/pvclock: no need to use strong read barriers in pvclock_get_time_values

The time values are guaranteed to be only updated by the pcpu that the
vcpu is currently running on (preempting the guest), so there's never
any risk of cross-pcpu races. Therefore, we only need a barrier to
prevent compiler reordering.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/pvclock.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 963f349..5ecce7f 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -86,12 +86,12 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
{
do {
dst->version = src->version;
- rmb(); /* fetch version before data */
+ barrier(); /* fetch version before data */
dst->tsc_timestamp = src->tsc_timestamp;
dst->system_timestamp = src->system_time;
dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
dst->tsc_shift = src->tsc_shift;
- rmb(); /* test version after fetching data */
+ barrier(); /* test version after fetching data */
} while ((src->version & 1) || (dst->version != src->version));

return dst->version;
--
1.6.2.5

2009-10-06 00:51:08

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 3/5] x86/pvclock: add vsyscall implementation

This patch allows the pvclock mechanism to be used in usermode. To
do this, we map an extra page into usermode containing an array of
pvclock_vcpu_time_info structures which give the information required
to compute a global system clock from the tsc. With this, we can
implement pvclock_clocksource_vread().

One complication is that usermode is subject to two levels of scheduling:
kernel scheduling of tasks onto vcpus, and hypervisor scheduling of
vcpus onto pcpus. In either case the underlying pcpu changed, and with
it, the correct set of parameters to compute tsc->system clock. To
address this we install a preempt notifier on sched_out to increment
that vcpu's version number. Usermode can then check the version number
is unchanged while computing the time and retry if it has (the only
difference from the kernel's version of the algorithm is that the vcpu
may have changed, so we may need to switch pvclock_vcpu_time_info
structures.

To use this feature, hypervisor-specific code is required
to call pvclock_init_vsyscall(), and if successful:
- cause the pvclock_vcpu_time_info structure at
pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for
each vcpu.
- use pvclock_clocksource_vread as the implementation of clocksource
.vread.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Keir Fraser <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Dan Magenheimer <[email protected]>
---
arch/x86/Kconfig | 4 +
arch/x86/include/asm/fixmap.h | 3 +
arch/x86/include/asm/pvclock.h | 6 ++
arch/x86/include/asm/vsyscall.h | 3 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/pvclock.c | 152 ++++++++++++++++++++++++++++++++++++---
6 files changed, 160 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 13ffa5d..93346ff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -518,6 +518,10 @@ config PARAVIRT_CLOCK
bool
default n

+config PARAVIRT_CLOCK_VSYSCALL
+ bool
+ depends on PARAVIRT_CLOCK && PREEMPT_NOTIFIERS
+
endif

config PARAVIRT_DEBUG
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7b2d71d..ff3cffa 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -80,6 +80,9 @@ enum fixed_addresses {
+ ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
VSYSCALL_HPET,
#endif
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+ FIX_PVCLOCK_TIME_INFO,
+#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..d2402b3 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -3,6 +3,7 @@

#include <linux/clocksource.h>
#include <asm/pvclock-abi.h>
+#include <asm/vsyscall.h>

/* some helper functions for xen and kvm pv clock sources */
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
@@ -11,4 +12,9 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
struct timespec *ts);

+int __init pvclock_init_vsyscall(void);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
+
+cycle_t __vsyscall_fn pvclock_clocksource_vread(void);
+
#endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d0983d2..df5fb43 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,9 @@ enum vsyscall_num {
extern int __vgetcpu_mode;
extern volatile unsigned long __jiffies;

+struct getcpu_cache;
+extern long vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache);
+
/* kernel space (writeable) */
extern int vgetcpu_mode;
extern struct timezone sys_tz;
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 430d5b2..97d2e88 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,10 +24,12 @@ CFLAGS_vsyscall_64.o := $(PROFILING) -g0 $(nostackp)
CFLAGS_hpet.o := $(nostackp)
CFLAGS_tsc.o := $(nostackp)
CFLAGS_paravirt.o := $(nostackp)
+CFLAGS_pvclock.o := $(nostackp)
GCOV_PROFILE_vsyscall_64.o := n
GCOV_PROFILE_hpet.o := n
GCOV_PROFILE_tsc.o := n
GCOV_PROFILE_paravirt.o := n
+GCOV_PROFILE_pvclock.o := n

obj-y := process_$(BITS).o signal.o entry_$(BITS).o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5ecce7f..14de7f3 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -17,7 +17,9 @@

#include <linux/kernel.h>
#include <linux/percpu.h>
+
#include <asm/pvclock.h>
+#include <asm/vsyscall.h>

/*
* These are perodically updated
@@ -71,9 +73,10 @@ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
return product;
}

-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static __always_inline
+u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
{
- u64 delta = native_read_tsc() - shadow->tsc_timestamp;
+ u64 delta = __native_read_tsc() - shadow->tsc_timestamp;
return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
}

@@ -81,8 +84,9 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
* Reads a consistent set of time-base values from hypervisor,
* into a shadow data area.
*/
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
- struct pvclock_vcpu_time_info *src)
+static __always_inline
+unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
+ const struct pvclock_vcpu_time_info *src)
{
do {
dst->version = src->version;
@@ -109,18 +113,31 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
return pv_tsc_khz;
}

-cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+ cycle_t *cycles)
{
struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;

+ version = pvclock_get_time_values(&shadow, src);
+ rdtsc_barrier();
+ offset = pvclock_get_nsec_offset(&shadow);
+ ret = shadow.system_timestamp + offset;
+ rdtsc_barrier();
+
+ *cycles = ret;
+ return version;
+}
+
+cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+{
+ unsigned version;
+ cycle_t ret;
+
do {
- version = pvclock_get_time_values(&shadow, src);
- rdtsc_barrier();
- offset = pvclock_get_nsec_offset(&shadow);
- ret = shadow.system_timestamp + offset;
- rdtsc_barrier();
+ version = __pvclock_read_cycles(src, &ret);
} while (version != src->version);

return ret;
@@ -151,3 +168,118 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,

set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
}
+
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+static struct pvclock_vcpu_time_info *pvclock_vsyscall_time_info;
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+ if (pvclock_vsyscall_time_info == NULL)
+ return NULL;
+
+ return &pvclock_vsyscall_time_info[cpu];
+}
+
+static void vti_inc_version(struct pvclock_vcpu_time_info *pvti)
+{
+ /*
+ * This increments the version in an interrupt-atomic way.
+ * We're not concerned about global bus (inter-cpu) atomicity,
+ * but we just need to make sure the update can't be
+ * interrupted by the hypervisor preempting us.
+ */
+#ifdef CONFIG_X86
+ asm("add $2, %0\n" : "+m" (pvti->version));
+#else
+#error FIXME
+#endif
+}
+
+/*
+ * Increment version when switching away from a task so that it can
+ * tell if it has switched vcpus (hypervisor's update of the version
+ * will tell it if it switches pcpus).
+ */
+static void pvclock_vsyscall_sched_out(struct preempt_notifier *pn,
+ struct task_struct *next)
+{
+ int cpu = smp_processor_id();
+ struct pvclock_vcpu_time_info *pvti;
+
+ pvti = pvclock_get_vsyscall_time_info(cpu);
+ if (pvti)
+ vti_inc_version(pvti);
+}
+
+/* Don't care about scheduling in */
+static void pvclock_vsyscall_sched_in(struct preempt_notifier *notifier, int cpu)
+{
+}
+
+static struct preempt_notifier pvclock_vsyscall_notifier;
+static struct preempt_ops pvclock_vsyscall_preempt_ops = {
+ .sched_in = pvclock_vsyscall_sched_in,
+ .sched_out = pvclock_vsyscall_sched_out,
+};
+
+cycle_t __vsyscall_fn pvclock_clocksource_vread(void)
+{
+ const struct pvclock_vcpu_time_info *pvti_base;
+ const struct pvclock_vcpu_time_info *pvti;
+ cycle_t ret;
+ u32 version;
+
+ pvti_base = (struct pvclock_vcpu_time_info *)fix_to_virt(FIX_PVCLOCK_TIME_INFO);
+
+ /*
+ * When looping to get a consistent (time-info, tsc) pair, we
+ * also need to deal with the possibility we can switch vcpus,
+ * so make sure we always re-fetch time-info for the current vcpu.
+ */
+ do {
+ unsigned cpu;
+
+ vgetcpu(&cpu, NULL, NULL);
+ pvti = &pvti_base[cpu];
+
+ version = __pvclock_read_cycles(pvti, &ret);
+ } while (unlikely(pvti->version != version));
+
+ return ret;
+}
+
+/*
+ * Initialize the generic pvclock vsyscall state. This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+int __init pvclock_init_vsyscall(void)
+{
+ int cpu;
+
+ /* Just one page for now */
+ if (nr_cpu_ids * sizeof(struct vcpu_time_info) > PAGE_SIZE) {
+ printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit time_info into a single page\n");
+ return -ENOSPC;
+ }
+
+ pvclock_vsyscall_time_info =
+ (struct pvclock_vcpu_time_info *)get_zeroed_page(GFP_KERNEL);
+ if (pvclock_vsyscall_time_info == NULL)
+ return -ENOMEM;
+
+ for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+ pvclock_vsyscall_time_info[cpu].version = ~0;
+
+ __set_fixmap(FIX_PVCLOCK_TIME_INFO, __pa(pvclock_vsyscall_time_info),
+ PAGE_KERNEL_VSYSCALL);
+
+ preempt_notifier_init(&pvclock_vsyscall_notifier,
+ &pvclock_vsyscall_preempt_ops);
+ preempt_notifier_register(&pvclock_vsyscall_notifier);
+
+ return 0;
+}
+
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
--
1.6.2.5

2009-10-06 00:51:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 4/5] x86/fixmap: add a predicate for usermode fixmaps

Some set of fixmaps are intented to be mapped into usermode. Add an
predicate to test for those fixmaps.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/fixmap.h | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index ff3cffa..0edc2bb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -181,6 +181,24 @@ static inline void __set_fixmap(enum fixed_addresses idx,

extern void __this_fixmap_does_not_exist(void);

+static inline bool user_fixmap(enum fixed_addresses fixmap)
+{
+ switch (fixmap) {
+#ifdef CONFIG_X86_32
+ case FIX_HOLE ... FIX_VDSO:
+#else
+ case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET:
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+ case FIX_PVCLOCK_TIME_INFO:
+#endif
+#endif
+ return true;
+
+ default:
+ return false;
+ }
+}
+
/*
* 'index to address' translation. If anyone tries to use the idx
* directly without translation, we catch the bug with a NULL-deference
--
1.6.2.5

2009-10-06 00:51:38

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 5/5] xen/time: add pvclock_clocksource_vread support

Add support to register pvclock_vcpu_time_info structures in the userspace
mapped page and set the xen clocksource .vread method if that works.
The common pvclock code does everything else.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/Kconfig | 6 +++++
arch/x86/xen/mmu.c | 3 +-
arch/x86/xen/smp.c | 2 +
arch/x86/xen/time.c | 52 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 8 ++++++
include/xen/interface/vcpu.h | 41 +++++++++++++++++++++++++++++++++
6 files changed, 111 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index b83e119..a002004 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -13,6 +13,11 @@ config XEN
kernel to boot in a paravirtualized environment under the
Xen hypervisor.

+config XEN_TIME_VSYSCALL
+ def_bool y
+ depends on PREEMPT_NOTIFIERS
+ select PARAVIRT_CLOCK_VSYSCALL
+
config XEN_MAX_DOMAIN_MEMORY
int "Maximum allowed size of a domain in gigabytes"
default 8 if X86_32
@@ -36,3 +41,4 @@ config XEN_DEBUG_FS
help
Enable statistics output and various tuning options in debugfs.
Enabling this option may incur a significant performance overhead.
+
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4ceb285..99b8aa5 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1814,6 +1814,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
# endif
#else
case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE:
+ case FIX_PVCLOCK_TIME_INFO:
#endif
#ifdef CONFIG_X86_LOCAL_APIC
case FIX_APIC_BASE: /* maps dummy local APIC */
@@ -1834,7 +1835,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
#ifdef CONFIG_X86_64
/* Replicate changes to map the vsyscall page into the user
pagetable vsyscall mapping. */
- if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) {
+ if (user_fixmap(idx)) {
unsigned long vaddr = __fix_to_virt(idx);
set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
}
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 429834e..a2ee882 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -313,6 +313,8 @@ static int __cpuinit xen_cpu_up(unsigned int cpu)
if (rc)
return rc;

+ xen_setup_vcpu_vsyscall_time_info(cpu);
+
rc = HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL);
BUG_ON(rc);

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 0a5aa44..ab3119e 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -474,6 +474,55 @@ void xen_timer_resume(void)
}
}

+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+void xen_setup_vcpu_vsyscall_time_info(int cpu)
+{
+ int ret;
+ struct pvclock_vcpu_time_info *pvti;
+ struct vcpu_register_time_memory_area t;
+
+ pvti = pvclock_get_vsyscall_time_info(cpu);
+ if (!pvti)
+ return;
+
+ t.addr.pv = pvti;
+
+ ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
+ cpu, &t);
+ /*
+ * If the call succeeds, it will update the vcpu_time_info and
+ * set the version to something valid. If it fails, we set
+ * the version to invalid so that usermode doesn't try to use
+ * it.
+ */
+ if (ret != 0)
+ pvti->version = ~0;
+}
+
+static int __init xen_setup_vsyscall_timeinfo(int cpu)
+{
+ int ret;
+
+ ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
+ cpu, NULL);
+ if (ret == -ENOSYS) {
+ printk(KERN_INFO "xen: vcpu_time_info placement not supported\n");
+ return ret;
+ }
+
+ ret = pvclock_init_vsyscall();
+ if (ret != 0) {
+ printk(KERN_INFO "xen: Failed to initialize pvclock vsyscall: %d\n",
+ ret);
+ return ret;
+ }
+
+ xen_setup_vcpu_vsyscall_time_info(cpu);
+
+ return 0;
+}
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
+
__init void xen_time_init(void)
{
int cpu = smp_processor_id();
@@ -496,4 +545,7 @@ __init void xen_time_init(void)

xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
+
+ if (xen_setup_vsyscall_timeinfo(cpu) == 0)
+ xen_clocksource.vread = pvclock_clocksource_vread;
}
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 22494fd..d92ddc8 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -58,6 +58,14 @@ bool xen_vcpu_stolen(int vcpu);

void xen_setup_vcpu_info_placement(void);

+#ifdef CONFIG_XEN_TIME_VSYSCALL
+void xen_setup_vcpu_vsyscall_time_info(int cpu);
+#else
+static inline void xen_setup_vcpu_vsyscall_time_info(int cpu)
+{
+}
+#endif
+
#ifdef CONFIG_SMP
void xen_smp_init(void);

diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
index 87e6f8a..0a8edfd 100644
--- a/include/xen/interface/vcpu.h
+++ b/include/xen/interface/vcpu.h
@@ -170,4 +170,45 @@ struct vcpu_register_vcpu_info {
};
DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_vcpu_info);

+
+/*
+ * Register a memory location to get a secondary copy of the vcpu time
+ * parameters. The master copy still exists as part of the vcpu
+ * shared memory area, and this secondary copy is updated whenever the
+ * master copy is updated.
+ *
+ * The intent is that this copy may be mapped (RO) into userspace so
+ * that usermode can compute system time using the time info and the
+ * tsc. Usermode will see an array of vcpu_time_info structures, one
+ * for each vcpu, and choose the right one by an existing mechanism
+ * which allows it to get the current vcpu number (such as via a
+ * segment limit). It can then apply the normal algorithm to compute
+ * system time from the tsc.
+ *
+ * However, because usermode threads are subject to two levels of
+ * scheduling (kernel scheduling of threads to vcpus, and Xen
+ * scheduling of vcpus to pcpus), we must make sure that the thread
+ * knows it has had a race with either (or both) of these two events.
+ * To allow this, the guest kernel updates the time_info version
+ * number when the vcpu does a context switch, so that usermode will
+ * always see a version number change when the parameters need to be
+ * revalidated. Xen makes sure that it always updates the guest's
+ * version rather than overwriting it. (It assumes that a vcpu will
+ * always update its own version number, so there are no cross-cpu
+ * synchronization issues; the only concern is that if the guest
+ * kernel gets preempted by Xen it doesn't revert the version number
+ * to an older value.)
+ *
+ * @extra_arg == pointer to vcpu_register_time_info_memory_area structure.
+ */
+#define VCPUOP_register_vcpu_time_memory_area 13
+
+struct vcpu_register_time_memory_area {
+ union {
+ struct vcpu_time_info *v;
+ struct pvclock_vcpu_time_info *pv;
+ uint64_t p;
+ } addr;
+};
+
#endif /* __XEN_PUBLIC_VCPU_H__ */
--
1.6.2.5

2009-10-06 09:06:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/06/2009 02:50 AM, Jeremy Fitzhardinge wrote:
> This patch allows the pvclock mechanism to be used in usermode. To
> do this, we map an extra page into usermode containing an array of
> pvclock_vcpu_time_info structures which give the information required
> to compute a global system clock from the tsc. With this, we can
> implement pvclock_clocksource_vread().
>
> One complication is that usermode is subject to two levels of scheduling:
> kernel scheduling of tasks onto vcpus, and hypervisor scheduling of
> vcpus onto pcpus. In either case the underlying pcpu changed, and with
> it, the correct set of parameters to compute tsc->system clock. To
> address this we install a preempt notifier on sched_out to increment
> that vcpu's version number. Usermode can then check the version number
> is unchanged while computing the time and retry if it has (the only
> difference from the kernel's version of the algorithm is that the vcpu
> may have changed, so we may need to switch pvclock_vcpu_time_info
> structures.
>
> To use this feature, hypervisor-specific code is required
> to call pvclock_init_vsyscall(), and if successful:
> - cause the pvclock_vcpu_time_info structure at
> pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for
> each vcpu.
> - use pvclock_clocksource_vread as the implementation of clocksource
> .vread.
>
> +
> +cycle_t __vsyscall_fn pvclock_clocksource_vread(void)
> +{
> + const struct pvclock_vcpu_time_info *pvti_base;
> + const struct pvclock_vcpu_time_info *pvti;
> + cycle_t ret;
> + u32 version;
> +
> + pvti_base = (struct pvclock_vcpu_time_info *)fix_to_virt(FIX_PVCLOCK_TIME_INFO);
> +
> + /*
> + * When looping to get a consistent (time-info, tsc) pair, we
> + * also need to deal with the possibility we can switch vcpus,
> + * so make sure we always re-fetch time-info for the current vcpu.
> + */
> + do {
> + unsigned cpu;
> +
> + vgetcpu(&cpu, NULL, NULL);
> + pvti =&pvti_base[cpu];
> +
> + version = __pvclock_read_cycles(pvti,&ret);
> + } while (unlikely(pvti->version != version));
> +
> + return ret;
> +}
>

Instead of using vgetcpu() and rdtsc() independently, you can use rdtscp
to read both atomically. This removes the need for the preempt notifier.

> +
> +/*
> + * Initialize the generic pvclock vsyscall state. This will allocate
> + * a/some page(s) for the per-vcpu pvclock information, set up a
> + * fixmap mapping for the page(s)
> + */
> +int __init pvclock_init_vsyscall(void)
> +{
> + int cpu;
> +
> + /* Just one page for now */
> + if (nr_cpu_ids * sizeof(struct vcpu_time_info)> PAGE_SIZE) {
> + printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit time_info into a single page\n");
> + return -ENOSPC;
> + }
> +
> + pvclock_vsyscall_time_info =
> + (struct pvclock_vcpu_time_info *)get_zeroed_page(GFP_KERNEL);
> + if (pvclock_vsyscall_time_info == NULL)
> + return -ENOMEM;
> +
>

Need to align the vcpu_time_infos on a cacheline boundary.

> + for (cpu = 0; cpu< nr_cpu_ids; cpu++)
> + pvclock_vsyscall_time_info[cpu].version = ~0;
> +
> + __set_fixmap(FIX_PVCLOCK_TIME_INFO, __pa(pvclock_vsyscall_time_info),
> + PAGE_KERNEL_VSYSCALL);
> +
> + preempt_notifier_init(&pvclock_vsyscall_notifier,
> + &pvclock_vsyscall_preempt_ops);
> + preempt_notifier_register(&pvclock_vsyscall_notifier);
> +
>

preempt notifiers are per-thread, not global, and will upset the cycle
counters. I'd drop them and use rdtscp instead (and give up if the
processor doesn't support it).

--
error compiling committee.c: too many arguments to function

2009-10-06 10:24:37

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/5] x86/fixmap: add a predicate for usermode fixmaps

>>> Jeremy Fitzhardinge <[email protected]> 06.10.09 02:50 >>>
>--- a/arch/x86/include/asm/fixmap.h
>+++ b/arch/x86/include/asm/fixmap.h
>@@ -181,6 +181,24 @@ static inline void __set_fixmap(enum fixed_addresses idx,
>
> extern void __this_fixmap_does_not_exist(void);
>
>+static inline bool user_fixmap(enum fixed_addresses fixmap)
>+{
>+ switch (fixmap) {
>+#ifdef CONFIG_X86_32
>+ case FIX_HOLE ... FIX_VDSO:
>+#else
>+ case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET:
>+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
>+ case FIX_PVCLOCK_TIME_INFO:
>+#endif
>+#endif
>+ return true;
>+
>+ default:

Isn't that nested #ifdef rather meant to be successive to the 32-/64-bit
one?

Jan

2009-10-06 10:29:06

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 5/5] xen/time: add pvclock_clocksource_vread support

>>> Jeremy Fitzhardinge <[email protected]> 06.10.09 02:50 >>>
>--- a/arch/x86/xen/mmu.c
>+++ b/arch/x86/xen/mmu.c
>@@ -1814,6 +1814,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
> # endif
> #else
> case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE:
>+ case FIX_PVCLOCK_TIME_INFO:
> #endif
> #ifdef CONFIG_X86_LOCAL_APIC
> case FIX_APIC_BASE: /* maps dummy local APIC */

And I think this added case also needs to move past the #endif.

Jan

2009-10-06 14:23:46

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: Jeremy Fitzhardinge [mailto:[email protected]]
> With this in place, I can do a gettimeofday in about 100ns on a 2.4GHz
> Q6600. I'm sure this could be tuned a bit more, but it is
> already much better than a syscall.

To evaluate the goodness of this, we really need a full
set of measurements for:

a) cost of rdtsc (and rdtscp if different)
b) cost of vsyscall+pvclock
c) cost of rdtsc emulated
d) cost of a hypercall that returns "hypervisor system time"

On a E6850 (3Ghz but let's use cycles), I measured;

a == 72 cycles
c == 1080 cycles
d == 780 cycles

It may be partly apples and oranges, but it looks
like a good guess for b on my machine is

b == 240 cycles

Not bad, but is there any additional context switch
cost to support it?

> From: Avi Kivity [mailto:[email protected]]
> Instead of using vgetcpu() and rdtsc() independently, you can
> use rdtscp
> to read both atomically. This removes the need for the
> preempt notifier.

Xen does not currently expose rdtscp and so does not emulate
(or context switch) TSC_AUX. Context switching TSC_AUX
is certainly possible, but will likely be expensive.
If the primary reason for vsyscall+pvclock is to maximize
performance for gettimeofday/clock_gettime, this cost
would need to be added to the mix.

> preempt notifiers are per-thread, not global, and will upset
> the cycle
> counters. I'd drop them and use rdtscp instead (and give up if the
> processor doesn't support it).

Even if rdtscp is used, in the Intel processor lineup
only the very latest (Nehalem) supports rdtscp, so
"give up" doesn't seem like a very good option, at least
in the near future.

2009-10-06 15:12:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/06/2009 04:19 PM, Dan Magenheimer wrote:
>> From: Jeremy Fitzhardinge [mailto:[email protected]]
>> With this in place, I can do a gettimeofday in about 100ns on a 2.4GHz
>> Q6600. I'm sure this could be tuned a bit more, but it is
>> already much better than a syscall.
>>
> To evaluate the goodness of this, we really need a full
> set of measurements for:
>
> a) cost of rdtsc (and rdtscp if different)
> b) cost of vsyscall+pvclock
> c) cost of rdtsc emulated
> d) cost of a hypercall that returns "hypervisor system time"
>
> On a E6850 (3Ghz but let's use cycles), I measured;
>
> a == 72 cycles
> c == 1080 cycles
> d == 780 cycles
>
> It may be partly apples and oranges, but it looks
> like a good guess for b on my machine is
>
> b == 240 cycles
>

Two rdtscps should suffice (and I think they are much faster on modern
machines).

> Not bad, but is there any additional context switch
> cost to support it?
>

rdtscp requires an additional msr read/write on heavyweight host context
switches. Should be negligible compared to the savings.

>> From: Avi Kivity [mailto:[email protected]]
>> Instead of using vgetcpu() and rdtsc() independently, you can
>> use rdtscp
>> to read both atomically. This removes the need for the
>> preempt notifier.
>>
> Xen does not currently expose rdtscp and so does not emulate
> (or context switch) TSC_AUX. Context switching TSC_AUX
> is certainly possible, but will likely be expensive.
> If the primary reason for vsyscall+pvclock is to maximize
> performance for gettimeofday/clock_gettime, this cost
> would need to be added to the mix.
>

It will cost ~100 cycles on heavyweight host context switch
(guest-to-guest).

>> preempt notifiers are per-thread, not global, and will upset
>> the cycle
>> counters. I'd drop them and use rdtscp instead (and give up if the
>> processor doesn't support it).
>>
> Even if rdtscp is used, in the Intel processor lineup
> only the very latest (Nehalem) supports rdtscp, so
> "give up" doesn't seem like a very good option, at least
> in the near future.
>

Why not? we still fall back to the guest kernel. By the time guest
kernels with rdtscp support are in the field, these machines will be
quiet old.

--
error compiling committee.c: too many arguments to function

2009-10-06 18:46:53

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/06/09 02:04, Avi Kivity wrote:
> On 10/06/2009 02:50 AM, Jeremy Fitzhardinge wrote:
>> This patch allows the pvclock mechanism to be used in usermode. To
>> do this, we map an extra page into usermode containing an array of
>> pvclock_vcpu_time_info structures which give the information required
>> to compute a global system clock from the tsc. With this, we can
>> implement pvclock_clocksource_vread().
>>
>> One complication is that usermode is subject to two levels of
>> scheduling:
>> kernel scheduling of tasks onto vcpus, and hypervisor scheduling of
>> vcpus onto pcpus. In either case the underlying pcpu changed, and with
>> it, the correct set of parameters to compute tsc->system clock. To
>> address this we install a preempt notifier on sched_out to increment
>> that vcpu's version number. Usermode can then check the version number
>> is unchanged while computing the time and retry if it has (the only
>> difference from the kernel's version of the algorithm is that the vcpu
>> may have changed, so we may need to switch pvclock_vcpu_time_info
>> structures.
>>
>> To use this feature, hypervisor-specific code is required
>> to call pvclock_init_vsyscall(), and if successful:
>> - cause the pvclock_vcpu_time_info structure at
>> pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for
>> each vcpu.
>> - use pvclock_clocksource_vread as the implementation of clocksource
>> .vread.
>>
>> +
>> +cycle_t __vsyscall_fn pvclock_clocksource_vread(void)
>> +{
>> + const struct pvclock_vcpu_time_info *pvti_base;
>> + const struct pvclock_vcpu_time_info *pvti;
>> + cycle_t ret;
>> + u32 version;
>> +
>> + pvti_base = (struct pvclock_vcpu_time_info
>> *)fix_to_virt(FIX_PVCLOCK_TIME_INFO);
>> +
>> + /*
>> + * When looping to get a consistent (time-info, tsc) pair, we
>> + * also need to deal with the possibility we can switch vcpus,
>> + * so make sure we always re-fetch time-info for the current vcpu.
>> + */
>> + do {
>> + unsigned cpu;
>> +
>> + vgetcpu(&cpu, NULL, NULL);
>> + pvti =&pvti_base[cpu];
>> +
>> + version = __pvclock_read_cycles(pvti,&ret);
>> + } while (unlikely(pvti->version != version));
>> +
>> + return ret;
>> +}
>>
>
> Instead of using vgetcpu() and rdtsc() independently, you can use
> rdtscp to read both atomically. This removes the need for the preempt
> notifier.

rdtscp first appeared on Intel with Nehalem, so we need to support older
Intel chips.

You could use rdscp to get (tsc,cpu) atomically, but that's not
sufficient to be able to get a consistent snapshot of (tsc, time_info)
because it doesn't give you the pvclock_vcpu_time_info version number.
If TSC_AUX contained that too, it might be possible. Alternatively you
could compare the tsc with pvclock.tsc_timestamp, but unfortunately the
ABI doesn't specify that tsc_timestamp is updated in any particular
order compared to the rest of the fields, so you still can't use that to
get a consistent snapshot (we can revise the ABI, of course).

So either way it doesn't avoid the need to iterate. vgetcpu will use
rdtscp if available, but I agree it is unfortunate we need to do a
redundant rdtsc in that case.

Avoiding the preempt notifier would be nice. Definitely worth a
followup-patch.

>> +
>> +/*
>> + * Initialize the generic pvclock vsyscall state. This will allocate
>> + * a/some page(s) for the per-vcpu pvclock information, set up a
>> + * fixmap mapping for the page(s)
>> + */
>> +int __init pvclock_init_vsyscall(void)
>> +{
>> + int cpu;
>> +
>> + /* Just one page for now */
>> + if (nr_cpu_ids * sizeof(struct vcpu_time_info)> PAGE_SIZE) {
>> + printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit
>> time_info into a single page\n");
>> + return -ENOSPC;
>> + }
>> +
>> + pvclock_vsyscall_time_info =
>> + (struct pvclock_vcpu_time_info *)get_zeroed_page(GFP_KERNEL);
>> + if (pvclock_vsyscall_time_info == NULL)
>> + return -ENOMEM;
>> +
>>
>
> Need to align the vcpu_time_infos on a cacheline boundary.

OK.

>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++)
>> + pvclock_vsyscall_time_info[cpu].version = ~0;
>> +
>> + __set_fixmap(FIX_PVCLOCK_TIME_INFO,
>> __pa(pvclock_vsyscall_time_info),
>> + PAGE_KERNEL_VSYSCALL);
>> +
>> + preempt_notifier_init(&pvclock_vsyscall_notifier,
>> + &pvclock_vsyscall_preempt_ops);
>> + preempt_notifier_register(&pvclock_vsyscall_notifier);
>> +
>>
>
> preempt notifiers are per-thread, not global, and will upset the cycle
> counters.

Ah, so I need to register it on every new thread? That's a bit awkward.

This is intended to satisfy the cycle-counters who want to do
gettimeofday a million times a second, where I guess the tradeoff of
avoiding a pile of syscalls is worth a bit of context-switch overhead.

> I'd drop them and use rdtscp instead (and give up if the processor
> doesn't support it).
>

None of my test machines have rdtscp, so that won't do ;)

J

2009-10-06 18:48:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/5] x86/fixmap: add a predicate for usermode fixmaps

On 10/06/09 03:23, Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <[email protected]> 06.10.09 02:50 >>>
>>>>
>> --- a/arch/x86/include/asm/fixmap.h
>> +++ b/arch/x86/include/asm/fixmap.h
>> @@ -181,6 +181,24 @@ static inline void __set_fixmap(enum fixed_addresses idx,
>>
>> extern void __this_fixmap_does_not_exist(void);
>>
>> +static inline bool user_fixmap(enum fixed_addresses fixmap)
>> +{
>> + switch (fixmap) {
>> +#ifdef CONFIG_X86_32
>> + case FIX_HOLE ... FIX_VDSO:
>> +#else
>> + case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET:
>> +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
>> + case FIX_PVCLOCK_TIME_INFO:
>> +#endif
>> +#endif
>> + return true;
>> +
>> + default:
>>
> Isn't that nested #ifdef rather meant to be successive to the 32-/64-bit
> one?
>

No, because 32-bit doesn't support vsyscall at present. But I've been
meaning to add a CONFIG_X86_VSYSCALL to make all the vsyscall stuff
conditional on, so if/when 32-bit vsyscall gets done its easy to turn it
all on.

J

2009-10-06 18:48:49

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 5/5] xen/time: add pvclock_clocksource_vread support

On 10/06/09 03:28, Jan Beulich wrote:
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -1814,6 +1814,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
>> # endif
>> #else
>> case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE:
>> + case FIX_PVCLOCK_TIME_INFO:
>> #endif
>> #ifdef CONFIG_X86_LOCAL_APIC
>> case FIX_APIC_BASE: /* maps dummy local APIC */
>>
> And I think this added case also needs to move past the #endif.
>

No, for the same reason.

J

2009-10-07 10:27:05

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote:
>
>> Instead of using vgetcpu() and rdtsc() independently, you can use
>> rdtscp to read both atomically. This removes the need for the preempt
>> notifier.
>>
> rdtscp first appeared on Intel with Nehalem, so we need to support older
> Intel chips.
>

We can support them by falling back to the kernel. I'm a bit worried
about the kernel playing with the hypervisor's version field. It's
better to introduce yet a new version for the kernel, and check both.

> You could use rdscp to get (tsc,cpu) atomically, but that's not
> sufficient to be able to get a consistent snapshot of (tsc, time_info)
> because it doesn't give you the pvclock_vcpu_time_info version number.
> If TSC_AUX contained that too, it might be possible. Alternatively you
> could compare the tsc with pvclock.tsc_timestamp, but unfortunately the
> ABI doesn't specify that tsc_timestamp is updated in any particular
> order compared to the rest of the fields, so you still can't use that to
> get a consistent snapshot (we can revise the ABI, of course).
>
> So either way it doesn't avoid the need to iterate. vgetcpu will use
> rdtscp if available, but I agree it is unfortunate we need to do a
> redundant rdtsc in that case.
>
>

def try_pvclock_vtime():
tsc, p0 = rdtscp()
v0 = pvclock[p0].version
tsc, p = rdtscp()
t = pvclock_time(pvclock[p], tsc)
if p != p0 or pvclock[p].version != v0:
raise Exception("Processor or timebased change under our feet")
return t

def pvclock_time():
while True:
try:
return try_pvlock_time()
except:
pass

So, two rdtscps and two compares.

>>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++)
>>> + pvclock_vsyscall_time_info[cpu].version = ~0;
>>> +
>>> + __set_fixmap(FIX_PVCLOCK_TIME_INFO,
>>> __pa(pvclock_vsyscall_time_info),
>>> + PAGE_KERNEL_VSYSCALL);
>>> +
>>> + preempt_notifier_init(&pvclock_vsyscall_notifier,
>>> +&pvclock_vsyscall_preempt_ops);
>>> + preempt_notifier_register(&pvclock_vsyscall_notifier);
>>> +
>>>
>> preempt notifiers are per-thread, not global, and will upset the cycle
>> counters.
>>
> Ah, so I need to register it on every new thread? That's a bit awkward.
>

It's used to manage processor registers, much like the fpu. If a thread
uses a register that's not saved and restored by the normal context
switch code, it can register a preempt notifier to do that instead.

> This is intended to satisfy the cycle-counters who want to do
> gettimeofday a million times a second, where I guess the tradeoff of
> avoiding a pile of syscalls is worth a bit of context-switch overhead.
>

It's sufficient to increment a version counter on thread migration, no
need to do it on context switch.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-07 19:30:07

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/09 03:25, Avi Kivity wrote:
> On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote:
>>
>>> Instead of using vgetcpu() and rdtsc() independently, you can use
>>> rdtscp to read both atomically. This removes the need for the preempt
>>> notifier.
>>>
>> rdtscp first appeared on Intel with Nehalem, so we need to support older
>> Intel chips.
>>
>
> We can support them by falling back to the kernel.

Yes, but its easy enough to support them with the fast-path.

> I'm a bit worried about the kernel playing with the hypervisor's
> version field.

For Xen I explicitly made it not a problem by adding the notion of a
secondary pvclock_vcpu_time_info structure which is updated by copying,
aside from the version number which is updated as-is.

As far as I can tell it isn't a problem for KVM either. The guest
version number is atomic with respect to preemption by the hypervisor so
there's no scope for racing. (The ABI already guarantees that the
pvclock structures are never updated cross-cpu.)

It ultimately doesn't matter what the version number is so long as it
changes when the parameters are updated, and version numbers can't be
reused within a window where things get confused.

> It's better to introduce yet a new version for the kernel, and check
> both.

Two version numbers are awkward to read atomically at least on 32-bit.
And I don't think its necessary.

> def try_pvclock_vtime():
> tsc, p0 = rdtscp()
> v0 = pvclock[p0].version
> tsc, p = rdtscp()
> t = pvclock_time(pvclock[p], tsc)
> if p != p0 or pvclock[p].version != v0:
> raise Exception("Processor or timebased change under our feet")
> return t
>
> def pvclock_time():
> while True:
> try:
> return try_pvlock_time()
> except:
> pass
>
> So, two rdtscps and two compares.

Yep, that would work.

> It's sufficient to increment a version counter on thread migration, no
> need to do it on context switch.
>

That's true; switch_out is a pessimistic approximation of that. But is
there a convenient hook to test for migration?

J

2009-10-07 20:10:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/2009 09:29 PM, Jeremy Fitzhardinge wrote:
>
>> I'm a bit worried about the kernel playing with the hypervisor's
>> version field.
>>
> For Xen I explicitly made it not a problem by adding the notion of a
> secondary pvclock_vcpu_time_info structure which is updated by copying,
> aside from the version number which is updated as-is.
>

When do you copy?

I'd rather have a single copy for guest and host.

> As far as I can tell it isn't a problem for KVM either. The guest
> version number is atomic with respect to preemption by the hypervisor so
> there's no scope for racing. (The ABI already guarantees that the
> pvclock structures are never updated cross-cpu.)
>
> It ultimately doesn't matter what the version number is so long as it
> changes when the parameters are updated, and version numbers can't be
> reused within a window where things get confused.
>

If the hypervisor does a pvclock->version = somethingelse->version++
then the guest may get confused. But I understand you have a
guest-private ->version?

>> It's better to introduce yet a new version for the kernel, and check
>> both.
>>
> Two version numbers are awkward to read atomically at least on 32-bit.
> And I don't think its necessary.
>

No need to read them atomically.

cpu1 = vgetcpu()
hver1 = pvclock[cpu1].hver
kver1 = pvclock[cpu1].kver
tsc = rdtsc
/* multipication magic with pvclock[cpu1]*/
cpu2 = vgetcpu()
hver2 = pvclock[cpu2].hver
kver2 = pvclock[cpu2].kver
valid = cpu1 == cpu2 && hver1 == hver2 && kver1 == kver2

>> It's sufficient to increment a version counter on thread migration, no
>> need to do it on context switch.
>>
>>
> That's true; switch_out is a pessimistic approximation of that. But is
> there a convenient hook to test for migration?
>

I'd guess not but it's probably easy to add one in the migration thread.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-07 20:52:23

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> We can support them by falling back to the kernel. I'm a bit worried
> about the kernel playing with the hypervisor's version field. It's
> better to introduce yet a new version for the kernel, and check both.

On Nehalem, apps that need timestamp information at a high
frequency will likely use rdtsc/rdtscp directly.

I very much support Jeremy's efforts to make vsyscall+pvclock
work fast on processors other than the very newest ones.

Dan

> -----Original Message-----
> From: Avi Kivity [mailto:[email protected]]
> Sent: Wednesday, October 07, 2009 4:26 AM
> To: Jeremy Fitzhardinge
> Cc: Jeremy Fitzhardinge; Dan Magenheimer; Xen-devel; Kurt Hackel; the
> arch/x86 maintainers; Linux Kernel Mailing List; Glauber de Oliveira
> Costa; Keir Fraser; Zach Brown; Chris Mason
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote:
> >
> >> Instead of using vgetcpu() and rdtsc() independently, you can use
> >> rdtscp to read both atomically. This removes the need for
> the preempt
> >> notifier.
> >>
> > rdtscp first appeared on Intel with Nehalem, so we need to
> support older
> > Intel chips.
> >
>
> We can support them by falling back to the kernel. I'm a bit worried
> about the kernel playing with the hypervisor's version field. It's
> better to introduce yet a new version for the kernel, and check both.
>
> > You could use rdscp to get (tsc,cpu) atomically, but that's not
> > sufficient to be able to get a consistent snapshot of (tsc,
> time_info)
> > because it doesn't give you the pvclock_vcpu_time_info
> version number.
> > If TSC_AUX contained that too, it might be possible.
> Alternatively you
> > could compare the tsc with pvclock.tsc_timestamp, but
> unfortunately the
> > ABI doesn't specify that tsc_timestamp is updated in any particular
> > order compared to the rest of the fields, so you still
> can't use that to
> > get a consistent snapshot (we can revise the ABI, of course).
> >
> > So either way it doesn't avoid the need to iterate.
> vgetcpu will use
> > rdtscp if available, but I agree it is unfortunate we need to do a
> > redundant rdtsc in that case.
> >
> >
>
> def try_pvclock_vtime():
> tsc, p0 = rdtscp()
> v0 = pvclock[p0].version
> tsc, p = rdtscp()
> t = pvclock_time(pvclock[p], tsc)
> if p != p0 or pvclock[p].version != v0:
> raise Exception("Processor or timebased change under our feet")
> return t
>
> def pvclock_time():
> while True:
> try:
> return try_pvlock_time()
> except:
> pass
>
> So, two rdtscps and two compares.
>
> >>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++)
> >>> + pvclock_vsyscall_time_info[cpu].version = ~0;
> >>> +
> >>> + __set_fixmap(FIX_PVCLOCK_TIME_INFO,
> >>> __pa(pvclock_vsyscall_time_info),
> >>> + PAGE_KERNEL_VSYSCALL);
> >>> +
> >>> + preempt_notifier_init(&pvclock_vsyscall_notifier,
> >>> +&pvclock_vsyscall_preempt_ops);
> >>> + preempt_notifier_register(&pvclock_vsyscall_notifier);
> >>> +
> >>>
> >> preempt notifiers are per-thread, not global, and will
> upset the cycle
> >> counters.
> >>
> > Ah, so I need to register it on every new thread? That's a
> bit awkward.
> >
>
> It's used to manage processor registers, much like the fpu.
> If a thread
> uses a register that's not saved and restored by the normal context
> switch code, it can register a preempt notifier to do that instead.
>
> > This is intended to satisfy the cycle-counters who want to do
> > gettimeofday a million times a second, where I guess the tradeoff of
> > avoiding a pile of syscalls is worth a bit of
> context-switch overhead.
> >
>
> It's sufficient to increment a version counter on thread
> migration, no
> need to do it on context switch.
>
> --
> Do not meddle in the internals of kernels, for they are
> subtle and quick to panic.
>
>
>

2009-10-07 21:09:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/2009 10:48 PM, Dan Magenheimer wrote:
>> We can support them by falling back to the kernel. I'm a bit worried
>> about the kernel playing with the hypervisor's version field. It's
>> better to introduce yet a new version for the kernel, and check both.
>>
> On Nehalem, apps that need timestamp information at a high
> frequency will likely use rdtsc/rdtscp directly.
>
>

Then they will get incorrect timing once they are live migrated.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-07 21:19:58

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/09 13:09, Avi Kivity wrote:
> On 10/07/2009 09:29 PM, Jeremy Fitzhardinge wrote:
>>
>>> I'm a bit worried about the kernel playing with the hypervisor's
>>> version field.
>>>
>> For Xen I explicitly made it not a problem by adding the notion of a
>> secondary pvclock_vcpu_time_info structure which is updated by copying,
>> aside from the version number which is updated as-is.
>>
>
> When do you copy?
>
> I'd rather have a single copy for guest and host.

When Xen updates the parameters normally. The interface never really
needed to share the memory between hypervisor and guest, and I think
avoiding it is a bit more robust.

But for KVM, you already use the MSR to place the pvclock_vcpu_time_info
structure, so you could just place it in the page and use the same
memory for kernel and usermode.

> If the hypervisor does a pvclock->version = somethingelse->version++
> then the guest may get confused. But I understand you have a
> guest-private ->version?

The guest should never get confused by the version being changed by the
hypervisor. It's already part of the ABI. Or did you mean something else?

I'm not sure what you mean by "guest-private version"; the versions are
always guest-private: te version is part of the pvclock structure,
which is per-vcpu, which is private to each guest. The guest nevern
maintains a separate long-term copy of the structure, only a transient
snapshot while computing time from the tsc (that's the current pvclock.c
code).

> No need to read them atomically.
>
> cpu1 = vgetcpu()
> hver1 = pvclock[cpu1].hver
> kver1 = pvclock[cpu1].kver
> tsc = rdtsc
> /* multipication magic with pvclock[cpu1]*/
> cpu2 = vgetcpu()
> hver2 = pvclock[cpu2].hver
> kver2 = pvclock[cpu2].kver
> valid = cpu1 == cpu2 && hver1 == hver2 && kver1 == kver2

I don't think that's necessary, but I can certainly live with it if it
makes you happier.

>>> It's sufficient to increment a version counter on thread migration, no
>>> need to do it on context switch.
>>>
>>>
>> That's true; switch_out is a pessimistic approximation of that. But is
>> there a convenient hook to test for migration?
>>
>
> I'd guess not but it's probably easy to add one in the migration thread.

Looking at that now.

J

2009-10-07 21:39:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/2009 11:19 PM, Jeremy Fitzhardinge wrote:
>
>> When do you copy?
>>
>> I'd rather have a single copy for guest and host.
>>
> When Xen updates the parameters normally. The interface never really
> needed to share the memory between hypervisor and guest, and I think
> avoiding it is a bit more robust.
>
> But for KVM, you already use the MSR to place the pvclock_vcpu_time_info
> structure, so you could just place it in the page and use the same
> memory for kernel and usermode.
>

Yes.

>> If the hypervisor does a pvclock->version = somethingelse->version++
>> then the guest may get confused. But I understand you have a
>> guest-private ->version?
>>
> The guest should never get confused by the version being changed by the
> hypervisor. It's already part of the ABI. Or did you mean something else?
>

If the guest does a RMW on the version, but the host does not (copying
it from somewhere else), then the guest RMW can be lost.

Looking at the code, that's what kvm does:

vcpu->hv_clock.version += 2;

shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0);

memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
sizeof(vcpu->hv_clock));

so a guest-side ++version can be lost.

> I'm not sure what you mean by "guest-private version"; the versions are
> always guest-private: te version is part of the pvclock structure,
> which is per-vcpu, which is private to each guest. The guest nevern
> maintains a separate long-term copy of the structure, only a transient
> snapshot while computing time from the tsc (that's the current pvclock.c
> code).
>

Same for kvm. I'm not worried about cross-guest corruption, just the
guest and host working together to confuse the guest.

>> No need to read them atomically.
>>
>> cpu1 = vgetcpu()
>> hver1 = pvclock[cpu1].hver
>> kver1 = pvclock[cpu1].kver
>> tsc = rdtsc
>> /* multipication magic with pvclock[cpu1]*/
>> cpu2 = vgetcpu()
>> hver2 = pvclock[cpu2].hver
>> kver2 = pvclock[cpu2].kver
>> valid = cpu1 == cpu2&& hver1 == hver2&& kver1 == kver2
>>
> I don't think that's necessary, but I can certainly live with it if it
> makes you happier.
>

I think the version issue requires it.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-07 21:52:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/09 14:37, Avi Kivity wrote:
> If the guest does a RMW on the version, but the host does not (copying
> it from somewhere else), then the guest RMW can be lost.
>
> Looking at the code, that's what kvm does:
>
> vcpu->hv_clock.version += 2;
>
> shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0);
>
> memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
> sizeof(vcpu->hv_clock));
>
> so a guest-side ++version can be lost.

I see, yes. The Xen code explicitly reads back the guest version and
increments that (I realize now that's what you meant by guest-private
version). If you were to have a second version number it would have to
be separated as well to avoid being overwritten by the hypervisor.

J

2009-10-07 21:55:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/2009 11:51 PM, Jeremy Fitzhardinge wrote:
> On 10/07/09 14:37, Avi Kivity wrote:
>
>> If the guest does a RMW on the version, but the host does not (copying
>> it from somewhere else), then the guest RMW can be lost.
>>
>> Looking at the code, that's what kvm does:
>>
>> vcpu->hv_clock.version += 2;
>>
>> shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0);
>>
>> memcpy(shared_kaddr + vcpu->time_offset,&vcpu->hv_clock,
>> sizeof(vcpu->hv_clock));
>>
>> so a guest-side ++version can be lost.
>>
> I see, yes. The Xen code explicitly reads back the guest version and
> increments that (I realize now that's what you meant by guest-private
> version). If you were to have a second version number it would have to
> be separated as well to avoid being overwritten by the hypervisor.
>

Yes. We have the space since a cacheline is 64 bytes (minimum) vs 32
bytes of pvclock data.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-07 22:40:12

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> Then they will get incorrect timing once they are live migrated.

I've posted a proposed (OS-independent) solution for that and
am (slowly) in the process of implementing it.

> -----Original Message-----
> From: Avi Kivity [mailto:[email protected]]
> Sent: Wednesday, October 07, 2009 3:08 PM
> To: Dan Magenheimer
> Cc: Jeremy Fitzhardinge; Jeremy Fitzhardinge; Xen-devel; Kurt Hackel;
> the arch/x86 maintainers; Linux Kernel Mailing List; Glauber
> de Oliveira
> Costa; Keir Fraser; Zach Brown; Chris Mason
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> On 10/07/2009 10:48 PM, Dan Magenheimer wrote:
> >> We can support them by falling back to the kernel. I'm a
> bit worried
> >> about the kernel playing with the hypervisor's version field. It's
> >> better to introduce yet a new version for the kernel, and
> check both.
> >>
> > On Nehalem, apps that need timestamp information at a high
> > frequency will likely use rdtsc/rdtscp directly.
> >
> >
>
> Then they will get incorrect timing once they are live migrated.
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
>
>

2009-10-10 00:51:55

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/07/09 03:25, Avi Kivity wrote:
> def try_pvclock_vtime():
> tsc, p0 = rdtscp()
> v0 = pvclock[p0].version
> tsc, p = rdtscp()
> t = pvclock_time(pvclock[p], tsc)
> if p != p0 or pvclock[p].version != v0:
> raise Exception("Processor or timebased change under our feet")
> return t

This doesn't quite work.

If we end up migrating some time after the first rdtscp, then the
accesses to pvclock[] will be cross-cpu. Since we don't made any strong
SMP memory ordering guarantees on updating the structure, the snapshot
isn't guaranteed to be consistent even if we re-check the version at the
end.

So to use rdtscp we need to either redefine the update of
pvclock_vcpu_time_info to be SMP-safe, or keep the additional migration
check.

J

2009-10-10 18:11:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/10/2009 02:24 AM, Jeremy Fitzhardinge wrote:
> On 10/07/09 03:25, Avi Kivity wrote:
>
>> def try_pvclock_vtime():
>> tsc, p0 = rdtscp()
>> v0 = pvclock[p0].version
>> tsc, p = rdtscp()
>> t = pvclock_time(pvclock[p], tsc)
>> if p != p0 or pvclock[p].version != v0:
>> raise Exception("Processor or timebased change under our feet")
>> return t
>>
> This doesn't quite work.
>
> If we end up migrating some time after the first rdtscp, then the
> accesses to pvclock[] will be cross-cpu. Since we don't made any strong
> SMP memory ordering guarantees on updating the structure, the snapshot
> isn't guaranteed to be consistent even if we re-check the version at the
> end.
>

We only hit this if we have a double migration, otherwise we see p != p0.

Most likely all existing implementations do have a write barrier on the
guest entry path, so if we add a read barrier between the two compares,
that ensures we're reading from the same cpu again.

> So to use rdtscp we need to either redefine the update of
> pvclock_vcpu_time_info to be SMP-safe, or keep the additional migration
> check.
>

I think we can update the ABI after verifying all implementations do
have a write barrier.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-12 18:21:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/10/09 11:10, Avi Kivity wrote:
> On 10/10/2009 02:24 AM, Jeremy Fitzhardinge wrote:
>> On 10/07/09 03:25, Avi Kivity wrote:
>>
>>> def try_pvclock_vtime():
>>> tsc, p0 = rdtscp()
>>> v0 = pvclock[p0].version
>>> tsc, p = rdtscp()
>>> t = pvclock_time(pvclock[p], tsc)
>>> if p != p0 or pvclock[p].version != v0:
>>> raise Exception("Processor or timebased change under our feet")
>>> return t
>>>
>> This doesn't quite work.
>>
>> If we end up migrating some time after the first rdtscp, then the
>> accesses to pvclock[] will be cross-cpu. Since we don't made any strong
>> SMP memory ordering guarantees on updating the structure, the snapshot
>> isn't guaranteed to be consistent even if we re-check the version at the
>> end.
>>
>
> We only hit this if we have a double migration, otherwise we see p != p0.
>
> Most likely all existing implementations do have a write barrier on
> the guest entry path, so if we add a read barrier between the two
> compares, that ensures we're reading from the same cpu again.

There's a second problem: If the time_info gets updated between the
first rdtscp and the first version fetch, then we won't have a
consistent tsc,time_info pair. You could check if tsc_timestamp is >
tsc, but that won't necessarily work on save/restore/migrate.

>> So to use rdtscp we need to either redefine the update of
>> pvclock_vcpu_time_info to be SMP-safe, or keep the additional migration
>> check.
>>
>
> I think we can update the ABI after verifying all implementations do
> have a write barrier.
>

I suppose that works if you assume that:

1. every task->vcpu migration is associated with a hv/guest context
switch, and
2. every hv/guest context switch is a write barrier

I guess 2 is a given, but I can at least imagine cases where 1 might not
be true. Maybe. It all seems very subtle.

And I don't really see a gain. You avoid maintaining a second version
number, but at the cost of two rdtscps. In my measurements, the whole
vsyscall takes around 100ns to run, and a single rdtsc takes about 30,
so 30% of total. Unlike rdtsc, rdtscp is documented as being ordered in
the instruction stream, and so will take at least as long; two of them
will completely blow the vsyscall execution time.

(By contrast, lsl only takes around 10ns, which suggests it should be
used preferentially in vgetcpu anyway.)

AMD CPUs have traditionally been much better than Intel at these kinds
of things, so maybe rdtscp makes sense there. Or maybe Nehalem is much
better than my Core2 Q6600.

J

2009-10-12 18:31:37

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/12/2009 08:20 PM, Jeremy Fitzhardinge wrote:
> On 10/10/09 11:10, Avi Kivity wrote:
>
>> On 10/10/2009 02:24 AM, Jeremy Fitzhardinge wrote:
>>
>>> On 10/07/09 03:25, Avi Kivity wrote:
>>>
>>>
>>>> def try_pvclock_vtime():
>>>> tsc, p0 = rdtscp()
>>>> v0 = pvclock[p0].version
>>>> tsc, p = rdtscp()
>>>> t = pvclock_time(pvclock[p], tsc)
>>>> if p != p0 or pvclock[p].version != v0:
>>>> raise Exception("Processor or timebased change under our feet")
>>>> return t
>>>>
> There's a second problem: If the time_info gets updated between the
> first rdtscp and the first version fetch, then we won't have a
> consistent tsc,time_info pair. You could check if tsc_timestamp is>
> tsc, but that won't necessarily work on save/restore/migrate.
>

Good catch. Doesn't that invalidate rdtscp based vgettimeofday on
non-virt as well (assuming p == cpu)?

> I suppose that works if you assume that:
>
> 1. every task->vcpu migration is associated with a hv/guest context
> switch, and
> 2. every hv/guest context switch is a write barrier
>
> I guess 2 is a given, but I can at least imagine cases where 1 might not
> be true. Maybe. It all seems very subtle.
>

What is 1 exactly? task switching to another vcpu? that doesn't incur
hypervisor involvement. vcpu moving to another cpu? That does.

> And I don't really see a gain. You avoid maintaining a second version
> number, but at the cost of two rdtscps. In my measurements, the whole
> vsyscall takes around 100ns to run, and a single rdtsc takes about 30,
> so 30% of total. Unlike rdtsc, rdtscp is documented as being ordered in
> the instruction stream, and so will take at least as long; two of them
> will completely blow the vsyscall execution time.
>

I agree, let's stick with the rdtscpless implementation.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-12 19:14:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/12/09 11:29, Avi Kivity wrote:
> Good catch. Doesn't that invalidate rdtscp based vgettimeofday on
> non-virt as well (assuming p == cpu)?

The tsc clocksource assumes the tsc is (mostly?) synced; it doesn't use
rdtscp or make any attempt at per-cpu corrections.

>> I suppose that works if you assume that:
>>
>> 1. every task->vcpu migration is associated with a hv/guest context
>> switch, and
>> 2. every hv/guest context switch is a write barrier
>>
>> I guess 2 is a given, but I can at least imagine cases where 1 might not
>> be true. Maybe. It all seems very subtle.
>>
>
> What is 1 exactly? task switching to another vcpu? that doesn't
> incur hypervisor involvement. vcpu moving to another cpu? That does.
Aie... OK. So no barrier is required for a task double migration on
vcpus, because it ends up on the same pcpu and the ordering is local; if
there's a vcpu migration to a new pcpu in there too, then we always
expect a barrier.

>> And I don't really see a gain. You avoid maintaining a second version
>> number, but at the cost of two rdtscps. In my measurements, the whole
>> vsyscall takes around 100ns to run, and a single rdtsc takes about 30,
>> so 30% of total. Unlike rdtsc, rdtscp is documented as being ordered in
>> the instruction stream, and so will take at least as long; two of them
>> will completely blow the vsyscall execution time.
>>
>
> I agree, let's stick with the rdtscpless implementation.

OK, I'll use PeterZ's hint to try and find a more complete set of
migration points.

J

2009-10-13 06:41:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/12/2009 09:13 PM, Jeremy Fitzhardinge wrote:
> On 10/12/09 11:29, Avi Kivity wrote:
>
>> Good catch. Doesn't that invalidate rdtscp based vgettimeofday on
>> non-virt as well (assuming p == cpu)?
>>
> The tsc clocksource assumes the tsc is (mostly?) synced; it doesn't use
> rdtscp or make any attempt at per-cpu corrections.
>

So it's broken or disabled when that assumption is wrong? We could
easily fix that now. Might even reuse the pvclock structures.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-13 20:00:58

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/12/09 23:39, Avi Kivity wrote:
> On 10/12/2009 09:13 PM, Jeremy Fitzhardinge wrote:
>> On 10/12/09 11:29, Avi Kivity wrote:
>>
>>> Good catch. Doesn't that invalidate rdtscp based vgettimeofday on
>>> non-virt as well (assuming p == cpu)?
>>>
>> The tsc clocksource assumes the tsc is (mostly?) synced; it doesn't use
>> rdtscp or make any attempt at per-cpu corrections.
>>
>
> So it's broken or disabled when that assumption is wrong? We could
> easily fix that now. Might even reuse the pvclock structures.

Well, the kernel internally makes more or less the same assumption; the
vsyscall clocksource is the same as the kernel's internal one. I think
idea is that it just drops back to something like hpet if the tsc
doesn't have very simple SMP characteristics.

If the kernel could characterize the per-cpu properties of the tsc more
accurately, then it could use the pvclock mechanism on native.

J

2009-10-14 12:34:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/14/2009 05:00 AM, Jeremy Fitzhardinge wrote:
>
>> So it's broken or disabled when that assumption is wrong? We could
>> easily fix that now. Might even reuse the pvclock structures.
>>
> Well, the kernel internally makes more or less the same assumption; the
> vsyscall clocksource is the same as the kernel's internal one. I think
> idea is that it just drops back to something like hpet if the tsc
> doesn't have very simple SMP characteristics.
>
> If the kernel could characterize the per-cpu properties of the tsc more
> accurately, then it could use the pvclock mechanism on native.
>

It does - that's how kvm implements pvclock on the host side. See
kvmclock_cpufreq_notifier() in arch/x86/kvm/x86.c.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-15 19:18:13

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/14/09 05:32, Avi Kivity wrote:
> On 10/14/2009 05:00 AM, Jeremy Fitzhardinge wrote:
>>
>>> So it's broken or disabled when that assumption is wrong? We could
>>> easily fix that now. Might even reuse the pvclock structures.
>>>
>> Well, the kernel internally makes more or less the same assumption; the
>> vsyscall clocksource is the same as the kernel's internal one. I think
>> idea is that it just drops back to something like hpet if the tsc
>> doesn't have very simple SMP characteristics.
>>
>> If the kernel could characterize the per-cpu properties of the tsc more
>> accurately, then it could use the pvclock mechanism on native.
>>
>
> It does - that's how kvm implements pvclock on the host side. See
> kvmclock_cpufreq_notifier() in arch/x86/kvm/x86.c.

The tsc clocksource currently seems a fair bit more picky though; it
doesn't attempt to sync tscs or work out their relative offsets or
rates. Which seems a bit defeatist.

J

2009-10-27 17:31:59

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

Is there any way for an application to conclusively determine
programmatically if the "fast vsyscall" pvclock is functional
vs the much much slower gettimeofday/clock_gettime equivalents?

If not, might it be possible to implement some (sysfs?)
way to determine this, that would also be backwards compatible
to existing OS's that don't have pvclock+vsyscall supported?

Thanks,
Dan

> -----Original Message-----
> From: Avi Kivity [mailto:[email protected]]
> Sent: Wednesday, October 14, 2009 6:33 AM
> To: Jeremy Fitzhardinge
> Cc: Dan Magenheimer; Jeremy Fitzhardinge; Kurt Hackel; the arch/x86
> maintainers; Linux Kernel Mailing List; Glauber de Oliveira Costa;
> Xen-devel; Keir Fraser; Zach Brown; Chris Mason; Ingo Molnar
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> On 10/14/2009 05:00 AM, Jeremy Fitzhardinge wrote:
> >
> >> So it's broken or disabled when that assumption is wrong? We could
> >> easily fix that now. Might even reuse the pvclock structures.
> >>
> > Well, the kernel internally makes more or less the same
> assumption; the
> > vsyscall clocksource is the same as the kernel's internal
> one. I think
> > idea is that it just drops back to something like hpet if the tsc
> > doesn't have very simple SMP characteristics.
> >
> > If the kernel could characterize the per-cpu properties of
> the tsc more
> > accurately, then it could use the pvclock mechanism on native.
> >
>
> It does - that's how kvm implements pvclock on the host side. See
> kvmclock_cpufreq_notifier() in arch/x86/kvm/x86.c.
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
>
>

2009-10-27 18:20:07

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/27/09 10:29, Dan Magenheimer wrote:
> Is there any way for an application to conclusively determine
> programmatically if the "fast vsyscall" pvclock is functional
> vs the much much slower gettimeofday/clock_gettime equivalents?
>
> If not, might it be possible to implement some (sysfs?)
> way to determine this, that would also be backwards compatible
> to existing OS's that don't have pvclock+vsyscall supported?
>

It would probably be simplest and most portable for the app to just
measure the performance of gettimeofday and see if it meets its needs.

J

2009-10-28 05:52:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/27/2009 08:20 PM, Jeremy Fitzhardinge wrote:
> On 10/27/09 10:29, Dan Magenheimer wrote:
>
>> Is there any way for an application to conclusively determine
>> programmatically if the "fast vsyscall" pvclock is functional
>> vs the much much slower gettimeofday/clock_gettime equivalents?
>>
>> If not, might it be possible to implement some (sysfs?)
>> way to determine this, that would also be backwards compatible
>> to existing OS's that don't have pvclock+vsyscall supported?
>>
>>
> It would probably be simplest and most portable for the app to just
> measure the performance of gettimeofday and see if it meets its needs.
>

How can you reliably measure performance in a virtualized environment?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-28 09:29:48

by Glauber Costa

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On Wed, Oct 28, 2009 at 07:52:04AM +0200, Avi Kivity wrote:
> On 10/27/2009 08:20 PM, Jeremy Fitzhardinge wrote:
>> On 10/27/09 10:29, Dan Magenheimer wrote:
>>
>>> Is there any way for an application to conclusively determine
>>> programmatically if the "fast vsyscall" pvclock is functional
>>> vs the much much slower gettimeofday/clock_gettime equivalents?
>>>
>>> If not, might it be possible to implement some (sysfs?)
>>> way to determine this, that would also be backwards compatible
>>> to existing OS's that don't have pvclock+vsyscall supported?
>>>
>>>
>> It would probably be simplest and most portable for the app to just
>> measure the performance of gettimeofday and see if it meets its needs.
>>
>
> How can you reliably measure performance in a virtualized environment?
If we loop gettimeofday(), I would expect the vsyscall-based version not to show
up in strace, right?

2009-10-28 09:34:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/28/2009 11:29 AM, Glauber Costa wrote:
>> How can you reliably measure performance in a virtualized environment?
>>
> If we loop gettimeofday(), I would expect the vsyscall-based version not to show
> up in strace, right?
>

Much better to have an API for this. Life is hacky enough already.

--
error compiling committee.c: too many arguments to function

2009-10-28 17:47:37

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/28/09 02:34, Avi Kivity wrote:
> On 10/28/2009 11:29 AM, Glauber Costa wrote:
>>> How can you reliably measure performance in a virtualized environment?
>>>
>> If we loop gettimeofday(), I would expect the vsyscall-based version
>> not to show
>> up in strace, right?
>>
>
> Much better to have an API for this. Life is hacky enough already.

My point is that if an app cares about property X then it should just
measure property X. The fact that gettimeofday is a vsyscall is just an
implementation detail that apps don't really care about. What they care
about is whether gettimeofday is fast or not.

If the environment has such unstable timing that the effect can't be
measured, then it is moot whether its a vsyscall or not (but in that
case its almost certainly better to use the standard API rather than
trying to roll your own timesource with rdtsc).

J

2009-10-29 12:14:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote:
>> Much better to have an API for this. Life is hacky enough already.
>>
> My point is that if an app cares about property X then it should just
> measure property X. The fact that gettimeofday is a vsyscall is just an
> implementation detail that apps don't really care about. What they care
> about is whether gettimeofday is fast or not.
>

But we can not make a reliable measurement.

> If the environment has such unstable timing that the effect can't be
> measured, then it is moot whether its a vsyscall or not (but in that
> case its almost certainly better to use the standard API rather than
> trying to roll your own timesource with rdtsc).
>

If you're interested in gettimeofday() for a global monotonic counter
you can fall back to atomic_fetch_and_add() which will be faster than a
syscall even on large systems. Maybe we should provide a vsyscall for
global monotonic counters and implement it using a atomics or tsc
instead of these hacks (I'm assuming here that the gettimeofday() calls
are used to implement an atomic counter - are they?)

--
error compiling committee.c: too many arguments to function

2009-10-29 13:03:56

by Chris Mason

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On Thu, Oct 29, 2009 at 02:13:50PM +0200, Avi Kivity wrote:
> On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote:
> >>Much better to have an API for this. Life is hacky enough already.
> >My point is that if an app cares about property X then it should just
> >measure property X. The fact that gettimeofday is a vsyscall is just an
> >implementation detail that apps don't really care about. What they care
> >about is whether gettimeofday is fast or not.
>
> But we can not make a reliable measurement.

I can't imagine how we'd decide what fast is? Please don't make the
applications guess.

-chris

2009-10-29 14:47:11

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: Avi Kivity [mailto:[email protected]]
>
> On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote:
> >> Much better to have an API for this. Life is hacky enough already.
> >>
> > My point is that if an app cares about property X then it
> should just
> > measure property X. The fact that gettimeofday is a
> vsyscall is just an
> > implementation detail that apps don't really care about.
> What they care
> > about is whether gettimeofday is fast or not.
> >
>
> But we can not make a reliable measurement.
>
> > If the environment has such unstable timing that the effect can't be
> > measured, then it is moot whether its a vsyscall or not (but in that
> > case its almost certainly better to use the standard API rather than
> > trying to roll your own timesource with rdtsc).
> >
>
> If you're interested in gettimeofday() for a global monotonic counter
> you can fall back to atomic_fetch_and_add() which will be
> faster than a
> syscall even on large systems. Maybe we should provide a
> vsyscall for
> global monotonic counters and implement it using a atomics or tsc
> instead of these hacks (I'm assuming here that the
> gettimeofday() calls
> are used to implement an atomic counter - are they?)

No, the apps I'm familiar with (a DB and a JVM) need a timestamp
not a monotonic counter. The timestamps must be relatively
accurate (e.g. we've been talking about gettimeofday generically,
but these apps would use clock_gettime for nsec resolution),
monotonically increasing, and work properly across a VM
migration. The timestamps are taken up to a 100K/sec or
more so the apps need to ensure they are using the fastest
mechanism available that meets those requirements.

2009-10-29 15:07:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/29/2009 04:46 PM, Dan Magenheimer wrote:
> No, the apps I'm familiar with (a DB and a JVM) need a timestamp
> not a monotonic counter. The timestamps must be relatively
> accurate (e.g. we've been talking about gettimeofday generically,
> but these apps would use clock_gettime for nsec resolution),
> monotonically increasing, and work properly across a VM
> migration. The timestamps are taken up to a 100K/sec or
> more so the apps need to ensure they are using the fastest
> mechanism available that meets those requirements.
>

Out of interest, do you know (and can you relate) why those apps need
100k/sec monotonically increasing timestamps?

--
error compiling committee.c: too many arguments to function

2009-10-29 15:57:07

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: Avi Kivity [mailto:[email protected]]
> Sent: Thursday, October 29, 2009 9:07 AM
> To: Dan Magenheimer
> Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt
> Hackel; the arch/x86 maintainers; Linux Kernel Mailing List;
> Glauber de
> Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris Mason; Ingo
> Molnar
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> On 10/29/2009 04:46 PM, Dan Magenheimer wrote:
> > No, the apps I'm familiar with (a DB and a JVM) need a timestamp
> > not a monotonic counter. The timestamps must be relatively
> > accurate (e.g. we've been talking about gettimeofday generically,
> > but these apps would use clock_gettime for nsec resolution),
> > monotonically increasing, and work properly across a VM
> > migration. The timestamps are taken up to a 100K/sec or
> > more so the apps need to ensure they are using the fastest
> > mechanism available that meets those requirements.
>
> Out of interest, do you know (and can you relate) why those apps need
> 100k/sec monotonically increasing timestamps?

I don't have any public data available for this DB usage, but basically
assume it is measuring transactions at a very high throughput, some
of which are to a memory-resident portion of the DB. Anecdotally,
I'm told the difference between non-vsyscall gettimeofday
and native rdtsc (on a machine with Invariant TSC support) can
affect overall DB performance by as much as 10-20%.

I did find the following public link for the JVM:

http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/intro/jmc3.html

Search for "flight recorder". This feature is intended to
be enabled all the time, but with non-vsyscall gettimeofday
the performance impact is unacceptably high, so they are using
rdtscp instead (on those machines where it is available). With
rdtscp, the performance impact is not measureable.

Though the processor/server vendors have finally fixed the
"unsynced TSC" problem on recent x86 platforms, thus allowing
enterprise software to obtain timestamps at rdtsc performance,
the problem comes back all over again with virtualization
because of migration. Jeremy's vsyscall+pvclock is a great
solution if the app can ensure that it is present; if not,
the apps will instead continue to use rdtsc as even emulated
rdtsc is 2-3x faster than non-vsyscall gettimeofday.

Does that help?

2009-10-29 16:17:08

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On a related note, though some topic drift, many of
the problems that occur in virtualization due to migration
could be better addressed if Linux had an architected
interface to allow it to be signaled if a migration
occurred, and if Linux could signal applications of
the same. I don't have any cycles (pun intended) to
think about this right now, but if anyone else starts
looking at it, I'd love to be cc'ed.

Thanks,
Dan

> -----Original Message-----
> From: Dan Magenheimer
> Sent: Thursday, October 29, 2009 9:56 AM
> To: Avi Kivity
> Cc: Jeremy Fitzhardinge; Jeremy Fitzhardinge; Kurt Hackel; Glauber
> Costa; the arch/x86 maintainers; Linux Kernel Mailing List; Glauber de
> Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Ingo Molnar; Chris
> Mason
> Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> > From: Avi Kivity [mailto:[email protected]]
> > Sent: Thursday, October 29, 2009 9:07 AM
> > To: Dan Magenheimer
> > Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt
> > Hackel; the arch/x86 maintainers; Linux Kernel Mailing List;
> > Glauber de
> > Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris
> Mason; Ingo
> > Molnar
> > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> > implementation
> >
> >
> > On 10/29/2009 04:46 PM, Dan Magenheimer wrote:
> > > No, the apps I'm familiar with (a DB and a JVM) need a timestamp
> > > not a monotonic counter. The timestamps must be relatively
> > > accurate (e.g. we've been talking about gettimeofday generically,
> > > but these apps would use clock_gettime for nsec resolution),
> > > monotonically increasing, and work properly across a VM
> > > migration. The timestamps are taken up to a 100K/sec or
> > > more so the apps need to ensure they are using the fastest
> > > mechanism available that meets those requirements.
> >
> > Out of interest, do you know (and can you relate) why those
> apps need
> > 100k/sec monotonically increasing timestamps?
>
> I don't have any public data available for this DB usage, but
> basically
> assume it is measuring transactions at a very high throughput, some
> of which are to a memory-resident portion of the DB. Anecdotally,
> I'm told the difference between non-vsyscall gettimeofday
> and native rdtsc (on a machine with Invariant TSC support) can
> affect overall DB performance by as much as 10-20%.
>
> I did find the following public link for the JVM:
>
> http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/int
ro/jmc3.html

Search for "flight recorder". This feature is intended to
be enabled all the time, but with non-vsyscall gettimeofday
the performance impact is unacceptably high, so they are using
rdtscp instead (on those machines where it is available). With
rdtscp, the performance impact is not measureable.

Though the processor/server vendors have finally fixed the
"unsynced TSC" problem on recent x86 platforms, thus allowing
enterprise software to obtain timestamps at rdtsc performance,
the problem comes back all over again with virtualization
because of migration. Jeremy's vsyscall+pvclock is a great
solution if the app can ensure that it is present; if not,
the apps will instead continue to use rdtsc as even emulated
rdtsc is 2-3x faster than non-vsyscall gettimeofday.

Does that help?

_______________________________________________
Xen-devel mailing list
[email protected]
http://lists.xensource.com/xen-devel

2009-11-01 09:29:35

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/29/2009 06:15 PM, Dan Magenheimer wrote:
> On a related note, though some topic drift, many of
> the problems that occur in virtualization due to migration
> could be better addressed if Linux had an architected
> interface to allow it to be signaled if a migration
> occurred, and if Linux could signal applications of
> the same. I don't have any cycles (pun intended) to
> think about this right now, but if anyone else starts
> looking at it, I'd love to be cc'ed.
>

IMO that's not a good direction. The hypervisor should not depend on
the guest for migration (the guest may be broken, or malicious, or being
debugged, or slow). So the notification must be asynchronous, which
means that it will only be delivered to applications after migration has
completed.

Instead of a "migration has occured, run for the hills" signal we're
better of finding out why applications want to know about this event and
addressing specific needs.

--
error compiling committee.c: too many arguments to function

2009-11-01 09:33:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 10/29/2009 05:55 PM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:[email protected]]
>> Sent: Thursday, October 29, 2009 9:07 AM
>> To: Dan Magenheimer
>> Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt
>> Hackel; the arch/x86 maintainers; Linux Kernel Mailing List;
>> Glauber de
>> Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris Mason; Ingo
>> Molnar
>> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
>> implementation
>>
>>
>> On 10/29/2009 04:46 PM, Dan Magenheimer wrote:
>>
>>> No, the apps I'm familiar with (a DB and a JVM) need a timestamp
>>> not a monotonic counter. The timestamps must be relatively
>>> accurate (e.g. we've been talking about gettimeofday generically,
>>> but these apps would use clock_gettime for nsec resolution),
>>> monotonically increasing, and work properly across a VM
>>> migration. The timestamps are taken up to a 100K/sec or
>>> more so the apps need to ensure they are using the fastest
>>> mechanism available that meets those requirements.
>>>
>> Out of interest, do you know (and can you relate) why those apps need
>> 100k/sec monotonically increasing timestamps?
>>
> I don't have any public data available for this DB usage, but basically
> assume it is measuring transactions at a very high throughput, some
> of which are to a memory-resident portion of the DB. Anecdotally,
> I'm told the difference between non-vsyscall gettimeofday
> and native rdtsc (on a machine with Invariant TSC support) can
> affect overall DB performance by as much as 10-20%.
>

Sorry, that doesn't explain anything.

> I did find the following public link for the JVM:
>
> http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/intro/jmc3.html
>
> Search for "flight recorder". This feature is intended to
> be enabled all the time, but with non-vsyscall gettimeofday
> the performance impact is unacceptably high, so they are using
> rdtscp instead (on those machines where it is available). With
> rdtscp, the performance impact is not measureable.
>
> Though the processor/server vendors have finally fixed the
> "unsynced TSC" problem on recent x86 platforms, thus allowing
> enterprise software to obtain timestamps at rdtsc performance,
> the problem comes back all over again with virtualization
> because of migration. Jeremy's vsyscall+pvclock is a great
> solution if the app can ensure that it is present; if not,
> the apps will instead continue to use rdtsc as even emulated
> rdtsc is 2-3x faster than non-vsyscall gettimeofday.
>
> Does that help?
>

For profiling work fast timestamping is of course great, but surely
there is no monotonicity requirement?

I don't think we'll be able to provide monotonicity with vsyscall on
tsc-broken hosts, so we'll be limited to correcting the tsc frequency
after migration for good-tsc hosts.

--
error compiling committee.c: too many arguments to function

2009-11-02 15:29:42

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: Avi Kivity [mailto:[email protected]]
>
> On 10/29/2009 06:15 PM, Dan Magenheimer wrote:
> > On a related note, though some topic drift, many of
> > the problems that occur in virtualization due to migration
> > could be better addressed if Linux had an architected
> > interface to allow it to be signaled if a migration
> > occurred, and if Linux could signal applications of
> > the same. I don't have any cycles (pun intended) to
> > think about this right now, but if anyone else starts
> > looking at it, I'd love to be cc'ed.
>
> IMO that's not a good direction. The hypervisor should not depend on
> the guest for migration (the guest may be broken, or
> malicious, or being
> debugged, or slow). So the notification must be asynchronous, which
> means that it will only be delivered to applications after
> migration has
> completed.

I definitely agree that the hypervisor can't wait for a guest
to respond.

You've likely thought through this a lot more than I have,
but I was thinking that if the kernel received the notification
as some form of interrupt, it could determine immediately
if any running threads had registered for "SIG_MIGRATE"
and deliver the signal synchronously.

> Instead of a "migration has occured, run for the hills" signal we're
> better of finding out why applications want to know about
> this event and
> addressing specific needs.

Perhaps. It certainly isn't warranted for this one
special case of timestamp handling. But I'll bet 5-10 years
from now, after we've handled a few special cases, we'll
wish that we would have handled it more generically.

Dan

2009-11-02 15:42:00

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 11/02/2009 05:28 PM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:[email protected]]
>>
>> On 10/29/2009 06:15 PM, Dan Magenheimer wrote:
>>
>>> On a related note, though some topic drift, many of
>>> the problems that occur in virtualization due to migration
>>> could be better addressed if Linux had an architected
>>> interface to allow it to be signaled if a migration
>>> occurred, and if Linux could signal applications of
>>> the same. I don't have any cycles (pun intended) to
>>> think about this right now, but if anyone else starts
>>> looking at it, I'd love to be cc'ed.
>>>
>> IMO that's not a good direction. The hypervisor should not depend on
>> the guest for migration (the guest may be broken, or
>> malicious, or being
>> debugged, or slow). So the notification must be asynchronous, which
>> means that it will only be delivered to applications after
>> migration has
>> completed.
>>
> I definitely agree that the hypervisor can't wait for a guest
> to respond.
>
> You've likely thought through this a lot more than I have,
> but I was thinking that if the kernel received the notification
> as some form of interrupt, it could determine immediately
> if any running threads had registered for "SIG_MIGRATE"
> and deliver the signal synchronously.
>

Interrupts cannot be delivered immediately. Exceptions can, but not all
guest code is prepared to handle them. Once you start to handle the
exception, migration is complete and you are late.


>> Instead of a "migration has occured, run for the hills" signal we're
>> better of finding out why applications want to know about
>> this event and
>> addressing specific needs.
>>
> Perhaps. It certainly isn't warranted for this one
> special case of timestamp handling. But I'll bet 5-10 years
> from now, after we've handled a few special cases, we'll
> wish that we would have handled it more generically.
>

Or we'll find that backwards compatibility for the generic signal is
killing some optimization.

--
error compiling committee.c: too many arguments to function

2009-11-02 15:47:19

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> > I don't have any public data available for this DB usage,
>
> Sorry, that doesn't explain anything.

Well for now just consider the DB usage as another use
of profiling. But one can easily draw scenarios where
a monotonic timestamp is also used to guarantee transaction
ordering.

> > Search for "flight recorder". This feature is intended to
> > be enabled all the time, but with non-vsyscall gettimeofday
> > the performance impact is unacceptably high, so they are using
>
> For profiling work fast timestamping is of course great, but surely
> there is no monotonicity requirement?

Yes and no. Monotonicity is a poor substitute for a more
generic mechanism that might provide an indication that a
discontinuity has occurred (forward or backward); if an app
could get both the timestamp AND some kind of "continuity
generation counter" (basically a much more sophisticated
form of TSC_AUX that changes whenever the timestamp is
coming from a different source), perhaps all problems could be solved.

> I don't think we'll be able to provide monotonicity with vsyscall on
> tsc-broken hosts, so we'll be limited to correcting the tsc frequency
> after migration for good-tsc hosts.

True, though clock_gettime(CLOCK_MONOTONIC) can provide
the monotonicity where it is required.

Dan

2009-11-03 05:12:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 11/02/2009 05:46 PM, Dan Magenheimer wrote:
>>> I don't have any public data available for this DB usage,
>>>
>> Sorry, that doesn't explain anything.
>>
> Well for now just consider the DB usage as another use
> of profiling. But one can easily draw scenarios where
> a monotonic timestamp is also used to guarantee transaction
> ordering.
>

In this case we should provide a facility for this. Providing a global
monotonic counter may be easier than providing a monotonic clock. Hence
my question.

>>> Search for "flight recorder". This feature is intended to
>>> be enabled all the time, but with non-vsyscall gettimeofday
>>> the performance impact is unacceptably high, so they are using
>>>
>> For profiling work fast timestamping is of course great, but surely
>> there is no monotonicity requirement?
>>
> Yes and no. Monotonicity is a poor substitute for a more
> generic mechanism that might provide an indication that a
> discontinuity has occurred (forward or backward); if an app
> could get both the timestamp AND some kind of "continuity
> generation counter" (basically a much more sophisticated
> form of TSC_AUX that changes whenever the timestamp is
> coming from a different source), perhaps all problems could be solved.
>

I doubt it. A discontinuity has occured, but what do we know about it?
nothing.

>> I don't think we'll be able to provide monotonicity with vsyscall on
>> tsc-broken hosts, so we'll be limited to correcting the tsc frequency
>> after migration for good-tsc hosts.
>>
> True, though clock_gettime(CLOCK_MONOTONIC) can provide
> the monotonicity where it is required.
>

We have that already. The question is how to implement it in a vsyscall.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-11-04 20:31:34

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: Avi Kivity [mailto:[email protected]]
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
> On 11/02/2009 05:46 PM, Dan Magenheimer wrote:
> >>> I don't have any public data available for this DB usage,
> >>>
> >> Sorry, that doesn't explain anything.
> >>
> > Well for now just consider the DB usage as another use
> > of profiling. But one can easily draw scenarios where
> > a monotonic timestamp is also used to guarantee transaction
> > ordering.
> >
>
> In this case we should provide a facility for this.
> Providing a global
> monotonic counter may be easier than providing a monotonic
> clock. Hence
> my question.

Maybe I'm misunderstanding something, but enterprise
apps can do this entirely on their own without any
kernel help, correct? Or are you trying to provide
it across guests, e.g. for clusters or something?

> >> For profiling work fast timestamping is of course great, but surely
> >> there is no monotonicity requirement?
> >>
> > Yes and no. Monotonicity is a poor substitute for a more
> > generic mechanism that might provide an indication that a
> > discontinuity has occurred (forward or backward); if an app
> > could get both the timestamp AND some kind of "continuity
> > generation counter" (basically a much more sophisticated
> > form of TSC_AUX that changes whenever the timestamp is
> > coming from a different source), perhaps all problems could
> be solved.
>
> I doubt it. A discontinuity has occured, but what do we know
> about it? nothing.

Actually, I think for many/most profiling applications,
just knowing a discontinuity occurred between two
timestamps is very useful as that one specific measurement
can be discarded. If a discontinuity is invisible,
one clearly knows that a negative interval is bad,
but if an interval is very small or very large,
one never knows if it is due to a discontinuity or
due to some other reason.

This would argue for a syscall/vsyscall that can
"return" two values: the "time" and a second
"continuity generation" counter.

> >> I don't think we'll be able to provide monotonicity with
> vsyscall on
> >> tsc-broken hosts, so we'll be limited to correcting the
> tsc frequency
> >> after migration for good-tsc hosts.
> >>
> > True, though clock_gettime(CLOCK_MONOTONIC) can provide
> > the monotonicity where it is required.
>
> We have that already. The question is how to implement it in
> a vsyscall.

Oh, I see. I missed that very crucial point.

So, just to verify/clarify... There is NO WAY for
a vsyscall to ensure monotonicity (presumably because
the previous reading can't be safely stored?). So
speed and "correctness" are mutually exclusive?

If true, yes, that's a potentially significant problem\
though an intelligent app can layer monotonicity
on top of the call I suppose.

2009-11-04 21:19:36

by john stultz

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On Thu, Oct 29, 2009 at 7:07 AM, Avi Kivity <[email protected]> wrote:
> On 10/29/2009 04:46 PM, Dan Magenheimer wrote:
>>
>> No, the apps I'm familiar with (a DB and a JVM) need a timestamp
>> not a monotonic counter. ?The timestamps must be relatively
>> accurate (e.g. we've been talking about gettimeofday generically,
>> but these apps would use clock_gettime for nsec resolution),
>> monotonically increasing, and work properly across a VM
>> migration. ?The timestamps are taken up to a 100K/sec or
>> more so the apps need to ensure they are using the fastest
>> mechanism available that meets those requirements.
>>
>
> Out of interest, do you know (and can you relate) why those apps need
> 100k/sec monotonically increasing timestamps?

This is sort of tangential, but depending on the need, this might be
of interest: Recently I've added a new clock_id,
CLOCK_MONOTONIC_COARSE (as well as CLOCK_REALTIME_COARSE), which
return a HZ granular timestamp (same granularity as filesystem
timestamps). Its very fast to access, since there's no hardware to
touch, and is accessible via vsyscall.

The idea being, if your hitting clock_gettime 100k/sec but you really
don't have the need for nsec granular timestamps, it might provide a
really nice performance boost.

Here's the commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da15cfdae03351c689736f8d142618592e3cebc3

thanks
-john

2009-11-04 21:29:36

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: john stultz [mailto:[email protected]]
> On Thu, Oct 29, 2009 at 7:07 AM, Avi Kivity <[email protected]> wrote:
> >
> > Out of interest, do you know (and can you relate) why those
> apps need
> > 100k/sec monotonically increasing timestamps?
>
> This is sort of tangential, but depending on the need, this might be
> of interest: Recently I've added a new clock_id,
> CLOCK_MONOTONIC_COARSE (as well as CLOCK_REALTIME_COARSE), which
> return a HZ granular timestamp (same granularity as filesystem
> timestamps). Its very fast to access, since there's no hardware to
> touch, and is accessible via vsyscall.
>
> The idea being, if your hitting clock_gettime 100k/sec but you really
> don't have the need for nsec granular timestamps, it might provide a
> really nice performance boost.
>
> Here's the commit:

Hi John --

Yes, possibly of interest. But does it work with CONFIG_NO_HZ?
(I'm expecting that over time NO_HZ will become widespread
for VM OS's, though interested in if you agree.)

Also very interested in your thoughts about a variation
that returns something similar to a TSC_AUX to notify
caller that the underlying reference clock has/may have
changed.

Dan

2009-11-05 00:02:23

by john stultz

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On Wed, 2009-11-04 at 13:28 -0800, Dan Magenheimer wrote:
> > From: john stultz [mailto:[email protected]]
> > On Thu, Oct 29, 2009 at 7:07 AM, Avi Kivity <[email protected]> wrote:
> > >
> > > Out of interest, do you know (and can you relate) why those
> > apps need
> > > 100k/sec monotonically increasing timestamps?
> >
> > This is sort of tangential, but depending on the need, this might be
> > of interest: Recently I've added a new clock_id,
> > CLOCK_MONOTONIC_COARSE (as well as CLOCK_REALTIME_COARSE), which
> > return a HZ granular timestamp (same granularity as filesystem
> > timestamps). Its very fast to access, since there's no hardware to
> > touch, and is accessible via vsyscall.
> >
> > The idea being, if your hitting clock_gettime 100k/sec but you really
> > don't have the need for nsec granular timestamps, it might provide a
> > really nice performance boost.
> >
> > Here's the commit:
>
> Hi John --
>
> Yes, possibly of interest. But does it work with CONFIG_NO_HZ?
> (I'm expecting that over time NO_HZ will become widespread
> for VM OS's, though interested in if you agree.)

It should work, with CONFIG_NO_HZ, as soon as we come out of a long idle
(likely due to a timer tick), the timekeeping code will accumulate all
the skipped ticks.

If we ever get to non-idle NOHZ, we'll need some extra work here
(probably lazy accumulation done conditionally in the read path), but
that's also true for filesystem timestamps.


> Also very interested in your thoughts about a variation
> that returns something similar to a TSC_AUX to notify
> caller that the underlying reference clock has/may have
> changed.

I haven't been following that closely. Personally, experience makes me
skeptical of workarounds for unsynced TSCs. But I'm sure there's sharper
folks out there that might make it work. The kernel just requires that
it *really really* works, and not "mostly" works. :)

thanks
-john



2009-11-05 00:47:17

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> > Yes, possibly of interest. But does it work with CONFIG_NO_HZ?
> > (I'm expecting that over time NO_HZ will become widespread
> > for VM OS's, though interested in if you agree.)
>
> It should work, with CONFIG_NO_HZ, as soon as we come out of
> a long idle
> (likely due to a timer tick), the timekeeping code will accumulate all
> the skipped ticks.
>
> If we ever get to non-idle NOHZ, we'll need some extra work here
> (probably lazy accumulation done conditionally in the read path), but
> that's also true for filesystem timestamps.

OK, sounds good.

> > Also very interested in your thoughts about a variation
> > that returns something similar to a TSC_AUX to notify
> > caller that the underlying reference clock has/may have
> > changed.
>
> I haven't been following that closely. Personally, experience makes me
> skeptical of workarounds for unsynced TSCs. But I'm sure
> there's sharper
> folks out there that might make it work. The kernel just requires that
> it *really really* works, and not "mostly" works. :)

This is less a workaround for unsynced TSCs than it
is for VM migration (and maybe also time where a
VM is out-of-context or moved to a different pcpu)
though it could probably
be made to work on unsynced TSC boxes also.
Basically an application needing hi-res profiling
info would do:

nsec1 = clock_gettime2(MONOTONIC,&aux1);
(time passes)
nsec2 = clock_gettime2(MONOTONIC,&aux2);
if (aux1 != aux2)
discard_measurement();
else
use_measurement(nsec2-nsec1);

and system software (hypervisor or kernel or
both) is responsible for ensuring aux value
monotonically increases whenever a different
crystal is used.

Without something like this as a vsyscall,
apps will just use rdtscp (which must be emulated
to work properly across a migration).

2009-11-05 06:47:47

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 11/04/2009 10:30 PM, Dan Magenheimer wrote:
>>
>> In this case we should provide a facility for this.
>> Providing a global
>> monotonic counter may be easier than providing a monotonic
>> clock. Hence
>> my question.
>>
> Maybe I'm misunderstanding something, but enterprise
> apps can do this entirely on their own without any
> kernel help, correct?
>

Within a process, yes. Across processes, not without writable shared
memory.

That's why I'm trying to understand what the actual requirements are.
Real monotonic, accurate, high resolution, low cost time sources are
hard to come by.

>> I doubt it. A discontinuity has occured, but what do we know
>> about it? nothing.
>>
> Actually, I think for many/most profiling applications,
> just knowing a discontinuity occurred between two
> timestamps is very useful as that one specific measurement
> can be discarded. If a discontinuity is invisible,
> one clearly knows that a negative interval is bad,
> but if an interval is very small or very large,
> one never knows if it is due to a discontinuity or
> due to some other reason.
>
> This would argue for a syscall/vsyscall that can
> "return" two values: the "time" and a second
> "continuity generation" counter.
>
>

I doubt it. You should expect discontinuities in user space due to
being swapped out, scheduled out, migrated to a different cpu, or your
laptop lid being closed. There are no guarantees to a userspace
application. Even the kernel can expect discontinuities due to SMIs.
So an explicit notification about one type of discontinuity adds nothing.

>>> True, though clock_gettime(CLOCK_MONOTONIC) can provide
>>> the monotonicity where it is required.
>>>
>> We have that already. The question is how to implement it in
>> a vsyscall.
>>
> Oh, I see. I missed that very crucial point.
>
> So, just to verify/clarify... There is NO WAY for
> a vsyscall to ensure monotonicity (presumably because
> the previous reading can't be safely stored?). So
> speed and "correctness" are mutually exclusive?
>

Yes.

> If true, yes, that's a potentially significant problem\
> though an intelligent app can layer monotonicity
> on top of the call I suppose.
>

Unless it's a multi-process app with limited trust.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-11-05 14:55:15

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

> From: Avi Kivity [mailto:[email protected]]
>
> Within a process, yes. Across processes, not without writable shared
> memory.
>
> That's why I'm trying to understand what the actual
> requirements are.
> Real monotonic, accurate, high resolution, low cost time sources are
> hard to come by.

Hmmm... this has significant implications for the rdtsc
emulation discussion on xen-devel. Since that's not
a Linux question, I'll start another thread on xen-devel
with a shorter cc list.

> > Actually, I think for many/most profiling applications,
> > just knowing a discontinuity occurred between two
> > timestamps is very useful as that one specific measurement
> > can be discarded. If a discontinuity is invisible,
> > one clearly knows that a negative interval is bad,
> > but if an interval is very small or very large,
> > one never knows if it is due to a discontinuity or
> > due to some other reason.
> >
> > This would argue for a syscall/vsyscall that can
> > "return" two values: the "time" and a second
> > "continuity generation" counter.
>
> I doubt it. You should expect discontinuities in user space due to
> being swapped out, scheduled out, migrated to a different
> cpu, or your
> laptop lid being closed. There are no guarantees to a userspace
> application. Even the kernel can expect discontinuities due
> to SMIs.
> So an explicit notification about one type of discontinuity
> adds nothing.

Good point. I'm interested in enterprise apps that have more
control over the machine (and rarely suffer from laptop lid
closures :-) and would intend for all discontinuities visible
to a hypervisor or kernel to increment "AUX", but bare-metal-
kernel-invisible discontinuities such as SMI do throw a wrench
in the works.

Well, all this discussion has convince me that
my original proposals do make sense for enterprise apps to be
virtualization-aware and use rdtsc/p directly for timestamping
needs rather than OS APIs (with the hypervisor deciding
whether or not to emulate rdtsc/p based on the underlying
physical machine and whether or not migration is enabled
or has occurred).

2009-11-05 15:18:17

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation

On 05/11/2009 14:52, "Dan Magenheimer" <[email protected]> wrote:

> Well, all this discussion has convince me that
> my original proposals do make sense

You surprise me, Dan. ;-)

-- Keir