2009-10-14 19:29:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

Hi all,

This series contains several things:

- Unify the separate vdso and vsyscall implementations of vgettimeofday and
vgetcpu. There's at least one bugfix which was only applied to one copy
(ignore tcache in vgetcpu was only applied to the vdso version); this
should avoid such skews in future.

- Bug fixes for the Xen and KVM clocksource.read functions to make sure
the returned time doesn't regress compared to clocksource.cycle_last.
(Probably stable material.)

- Make sure the pvclock rdtsc is surrounded by appropriate barriers so
that it doesn't get speculated to the wrong place with respect to reading
the time parameters. (Probably stable material.)

- General cleanups of the pvclock algorithm (there's no need to make a local
copy of the time parameters before use).

- Add a new CONFIG_X86_VSYSCALL to control the compilation of vsyscall-related
code rather than just using CONFIG_X86_64 - we may want to implement 32-bit
vsyscall at some point, and this will make it easier.

- Add the sched notifier for task migration between CPUs, for use by
pvclock vread.

- Implement a pvclock vread function, so that pvclock-using clocksources can be
used by vsyscall/vdso vgettimeofday and vclock_gettime.

- Use pvclock vread in the Xen clocksource.

The following changes since commit 74fca6a42863ffacaf7ba6f1936a9f228950f657:
Linus Torvalds (1):
Linux 2.6.31

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git xen/vsyscall

Jeremy Fitzhardinge (12):
x86/vsyscall: use common implementation for vgetcpu
x86/vgetcpu: ignore tcache in common code.
x86/vsyscall: use common implementation for vgettimeofday
kvm/pvclock: add monotonicity check
xen/pvclock: add monotonicity check
x86/pvclock: make sure rdtsc doesn't speculate out of region
pvclock: there's no need to copy time_info into shadow
x86: create specific X86_VSYSCALL config variable
sched: add notifier for cross-cpu migrations
x86/pvclock: add vsyscall implementation
x86/fixmap: add a predicate for usermode fixmaps
xen/time: add pvclock_clocksource_vread support

arch/x86/Kconfig | 10 ++-
arch/x86/include/asm/fixmap.h | 25 +++++-
arch/x86/include/asm/pvclock.h | 6 ++
arch/x86/include/asm/vgtod.h | 64 +++++++++++++-
arch/x86/include/asm/vsyscall.h | 30 ++++++
arch/x86/kernel/Makefile | 5 +-
arch/x86/kernel/hpet.c | 4 +-
arch/x86/kernel/kvmclock.c | 7 ++-
arch/x86/kernel/pvclock.c | 186 +++++++++++++++++++++++++++++---------
arch/x86/kernel/tsc.c | 4 +-
arch/x86/kernel/vsyscall_64.c | 91 +-------------------
arch/x86/vdso/vclock_gettime.c | 46 +---------
arch/x86/vdso/vgetcpu.c | 17 +---
arch/x86/xen/Kconfig | 6 ++
arch/x86/xen/mmu.c | 5 +-
arch/x86/xen/smp.c | 2 +
arch/x86/xen/time.c | 61 +++++++++++++-
arch/x86/xen/xen-ops.h | 8 ++
include/linux/sched.h | 8 ++
include/xen/interface/vcpu.h | 41 +++++++++
kernel/sched.c | 14 +++
21 files changed, 438 insertions(+), 202 deletions(-)


2009-10-14 19:29:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 01/12] x86/vsyscall: use common implementation for vgetcpu

There were two implementations of vgetcpu: one for vdso and one for
vsyscalls. They are actually functionally different, as the vdso one
ignores the tcache parameter (due to change 4307d1e5ada595c8), but the
vsyscall implementation honours it. This patch preserves that functional
difference, but a subsequent will make both behave the same way.

The actual compiled code is still duplicated as the vdso and vsyscall
environments are quite different (variable vs fixed address, for example),
and the access to kernel variables is different as well. The
implementation is made common as an inline functions, so there's only
one copy of the source.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/vsyscall.h | 38 ++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/vsyscall_64.c | 30 +-----------------------------
arch/x86/vdso/vgetcpu.c | 18 +++---------------
3 files changed, 42 insertions(+), 44 deletions(-)

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d0983d2..2fcd505 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -39,6 +39,44 @@ extern struct timezone sys_tz;

extern void map_vsyscall(void);

+#include <linux/getcpu.h>
+
+static __always_inline
+int __vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache,
+ unsigned long jiffies, int vgetcpu_mode)
+{
+ unsigned int p;
+ unsigned long j = 0;
+
+ /* Fast cache - only recompute value once per jiffies and avoid
+ relatively costly rdtscp/cpuid otherwise.
+ This works because the scheduler usually keeps the process
+ on the same CPU and this syscall doesn't guarantee its
+ results anyways.
+ We do this here because otherwise user space would do it on
+ its own in a likely inferior way (no access to jiffies).
+ If you don't like it pass NULL. */
+ if (tcache && tcache->blob[0] == (j = jiffies)) {
+ p = tcache->blob[1];
+ } else if (vgetcpu_mode == VGETCPU_RDTSCP) {
+ /* Load per CPU data from RDTSCP */
+ native_read_tscp(&p);
+ } else {
+ /* Load per CPU data from GDT */
+ asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+ }
+ if (tcache) {
+ tcache->blob[0] = j;
+ tcache->blob[1] = p;
+ }
+ if (cpu)
+ *cpu = p & 0xfff;
+ if (node)
+ *node = p >> 12;
+ return 0;
+}
+
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 25ee06a..f71dda9 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -190,35 +190,7 @@ time_t __vsyscall(1) vtime(time_t *t)
long __vsyscall(2)
vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
{
- unsigned int p;
- unsigned long j = 0;
-
- /* Fast cache - only recompute value once per jiffies and avoid
- relatively costly rdtscp/cpuid otherwise.
- This works because the scheduler usually keeps the process
- on the same CPU and this syscall doesn't guarantee its
- results anyways.
- We do this here because otherwise user space would do it on
- its own in a likely inferior way (no access to jiffies).
- If you don't like it pass NULL. */
- if (tcache && tcache->blob[0] == (j = __jiffies)) {
- p = tcache->blob[1];
- } else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
- /* Load per CPU data from RDTSCP */
- native_read_tscp(&p);
- } else {
- /* Load per CPU data from GDT */
- asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
- }
- if (tcache) {
- tcache->blob[0] = j;
- tcache->blob[1] = p;
- }
- if (cpu)
- *cpu = p & 0xfff;
- if (node)
- *node = p >> 12;
- return 0;
+ return __vgetcpu(cpu, node, tcache, __jiffies, __vgetcpu_mode);
}

static long __vsyscall(3) venosys_1(void)
diff --git a/arch/x86/vdso/vgetcpu.c b/arch/x86/vdso/vgetcpu.c
index 9fbc6b2..1f19f74 100644
--- a/arch/x86/vdso/vgetcpu.c
+++ b/arch/x86/vdso/vgetcpu.c
@@ -14,22 +14,10 @@
#include "vextern.h"

notrace long
-__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
{
- unsigned int p;
-
- if (*vdso_vgetcpu_mode == VGETCPU_RDTSCP) {
- /* Load per CPU data from RDTSCP */
- native_read_tscp(&p);
- } else {
- /* Load per CPU data from GDT */
- asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
- }
- if (cpu)
- *cpu = p & 0xfff;
- if (node)
- *node = p >> 12;
- return 0;
+ /* Ignore tcache to preserve old behaviour */
+ return __vgetcpu(cpu, node, NULL, *vdso_jiffies, *vdso_vgetcpu_mode);
}

long getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
--
1.6.2.5

2009-10-14 19:29:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 02/12] x86/vgetcpu: ignore tcache in common code.

The vdso implementation of vgetcpu has been ignoring the tcache
parameter since change 4307d1e5ada595c87f9a4d16db16ba5edb70dcb1,
but that overlooked the vsyscall implementation of vgetcpu.

This change ignores tcache in the common code, but passes the
pointer in from vdso so that it can be still used if there's a
new tcache implementation which addresses the problems mentioned
in 4307d1e5ada595c87f.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/vsyscall.h | 19 ++++---------------
arch/x86/vdso/vgetcpu.c | 3 +--
2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index 2fcd505..bb90047 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -48,27 +48,16 @@ int __vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache,
unsigned int p;
unsigned long j = 0;

- /* Fast cache - only recompute value once per jiffies and avoid
- relatively costly rdtscp/cpuid otherwise.
- This works because the scheduler usually keeps the process
- on the same CPU and this syscall doesn't guarantee its
- results anyways.
- We do this here because otherwise user space would do it on
- its own in a likely inferior way (no access to jiffies).
- If you don't like it pass NULL. */
- if (tcache && tcache->blob[0] == (j = jiffies)) {
- p = tcache->blob[1];
- } else if (vgetcpu_mode == VGETCPU_RDTSCP) {
+ /* tcache is ignored - it is too unreliable */
+
+ if (vgetcpu_mode == VGETCPU_RDTSCP) {
/* Load per CPU data from RDTSCP */
native_read_tscp(&p);
} else {
/* Load per CPU data from GDT */
asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
}
- if (tcache) {
- tcache->blob[0] = j;
- tcache->blob[1] = p;
- }
+
if (cpu)
*cpu = p & 0xfff;
if (node)
diff --git a/arch/x86/vdso/vgetcpu.c b/arch/x86/vdso/vgetcpu.c
index 1f19f74..2c4f9ab 100644
--- a/arch/x86/vdso/vgetcpu.c
+++ b/arch/x86/vdso/vgetcpu.c
@@ -16,8 +16,7 @@
notrace long
__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
{
- /* Ignore tcache to preserve old behaviour */
- return __vgetcpu(cpu, node, NULL, *vdso_jiffies, *vdso_vgetcpu_mode);
+ return __vgetcpu(cpu, node, tcache, *vdso_jiffies, *vdso_vgetcpu_mode);
}

long getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
--
1.6.2.5

2009-10-14 19:29:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 03/12] x86/vsyscall: use common implementation for vgettimeofday

There are two implementations of vgettimeofday; one as a vsyscall
and one in the vdso. The are functionally identical, but the code
is duplicated in two more-or-less equivalent forms.

The vdso implementation is also shared with the vdso vclock_gettime.

To unify the two implementations, the vdso code is hoisted into vgtod.h
as inline functions, and the callers modified accordingly. Because
vdso and vsyscall have different ways to access struct vsyscall_gtod_data,
it is passed into each function.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: John Wright <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/vgtod.h | 64 +++++++++++++++++++++++++++++++++++++++-
arch/x86/kernel/vsyscall_64.c | 61 +-------------------------------------
arch/x86/vdso/vclock_gettime.c | 46 ++---------------------------
3 files changed, 67 insertions(+), 104 deletions(-)

diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index dc27a69..9045ea4 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -1,9 +1,11 @@
#ifndef _ASM_X86_VGTOD_H
#define _ASM_X86_VGTOD_H

-#include <asm/vsyscall.h>
#include <linux/clocksource.h>

+#include <asm/vsyscall.h>
+#include <asm/unistd.h>
+
struct vsyscall_gtod_data {
seqlock_t lock;

@@ -26,4 +28,64 @@ extern struct vsyscall_gtod_data __vsyscall_gtod_data
__section_vsyscall_gtod_data;
extern struct vsyscall_gtod_data vsyscall_gtod_data;

+/*
+ * Common implementation of vdso/vsyscall time functions. This code
+ * is used in usermode, exported via either vdso or vsyscall. Because
+ * of this, it must be inlined rather than linked, hence implemented
+ * in a .h as inline functions.
+ */
+notrace static __always_inline
+long vgetns(const struct vsyscall_gtod_data *gtod)
+{
+ long v;
+ cycles_t (*vread)(void);
+
+ vread = gtod->clock.vread;
+ v = (vread() - gtod->clock.cycle_last) & gtod->clock.mask;
+ return (v * gtod->clock.mult) >> gtod->clock.shift;
+}
+
+notrace static __always_inline
+int __do_realtime(const struct vsyscall_gtod_data *gtod,
+ struct timespec *ts)
+{
+ unsigned long seq, ns;
+ do {
+ seq = read_seqbegin(&gtod->lock);
+ ts->tv_sec = gtod->wall_time_sec;
+ ts->tv_nsec = gtod->wall_time_nsec;
+ ns = vgetns(gtod);
+ } while (unlikely(read_seqretry(&gtod->lock, seq)));
+ timespec_add_ns(ts, ns);
+ return 0;
+}
+
+notrace static __always_inline
+int __do_vgettimeofday(const struct vsyscall_gtod_data *gtod,
+ struct timeval *tv, struct timezone *tz)
+{
+ long ret;
+
+ if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
+ if (likely(tv != NULL)) {
+ BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
+ offsetof(struct timespec, tv_nsec) ||
+ sizeof(*tv) != sizeof(struct timespec));
+ __do_realtime(gtod, (struct timespec *)tv);
+ tv->tv_usec /= 1000;
+ }
+
+ if (unlikely(tz != NULL)) {
+ /* Avoid memcpy. Some old compilers fail to inline it */
+ tz->tz_minuteswest = gtod->sys_tz.tz_minuteswest;
+ tz->tz_dsttime = gtod->sys_tz.tz_dsttime;
+ }
+ return 0;
+ }
+
+ asm("syscall" : "=a" (ret) :
+ "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
+ return ret;
+}
+
#endif /* _ASM_X86_VGTOD_H */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index f71dda9..e19a60e 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -90,24 +90,6 @@ void update_vsyscall(struct timespec *wall_time, struct clocksource *clock)
write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags);
}

-/* RED-PEN may want to readd seq locking, but then the variable should be
- * write-once.
- */
-static __always_inline void do_get_tz(struct timezone * tz)
-{
- *tz = __vsyscall_gtod_data.sys_tz;
-}
-
-static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
-{
- int ret;
- asm volatile("syscall"
- : "=a" (ret)
- : "0" (__NR_gettimeofday),"D" (tv),"S" (tz)
- : __syscall_clobber );
- return ret;
-}
-
static __always_inline long time_syscall(long *t)
{
long secs;
@@ -117,50 +99,9 @@ static __always_inline long time_syscall(long *t)
return secs;
}

-static __always_inline void do_vgettimeofday(struct timeval * tv)
-{
- cycle_t now, base, mask, cycle_delta;
- unsigned seq;
- unsigned long mult, shift, nsec;
- cycle_t (*vread)(void);
- do {
- seq = read_seqbegin(&__vsyscall_gtod_data.lock);
-
- vread = __vsyscall_gtod_data.clock.vread;
- if (unlikely(!__vsyscall_gtod_data.sysctl_enabled || !vread)) {
- gettimeofday(tv,NULL);
- return;
- }
-
- now = vread();
- base = __vsyscall_gtod_data.clock.cycle_last;
- mask = __vsyscall_gtod_data.clock.mask;
- mult = __vsyscall_gtod_data.clock.mult;
- shift = __vsyscall_gtod_data.clock.shift;
-
- tv->tv_sec = __vsyscall_gtod_data.wall_time_sec;
- nsec = __vsyscall_gtod_data.wall_time_nsec;
- } while (read_seqretry(&__vsyscall_gtod_data.lock, seq));
-
- /* calculate interval: */
- cycle_delta = (now - base) & mask;
- /* convert to nsecs: */
- nsec += (cycle_delta * mult) >> shift;
-
- while (nsec >= NSEC_PER_SEC) {
- tv->tv_sec += 1;
- nsec -= NSEC_PER_SEC;
- }
- tv->tv_usec = nsec / NSEC_PER_USEC;
-}
-
int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
{
- if (tv)
- do_vgettimeofday(tv);
- if (tz)
- do_get_tz(tz);
- return 0;
+ return __do_vgettimeofday(&__vsyscall_gtod_data, tv, tz);
}

/* This will break when the xtime seconds get inaccurate, but that is
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 6a40b78..723a84f 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -34,28 +34,6 @@ notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
return ret;
}

-notrace static inline long vgetns(void)
-{
- long v;
- cycles_t (*vread)(void);
- vread = gtod->clock.vread;
- v = (vread() - gtod->clock.cycle_last) & gtod->clock.mask;
- return (v * gtod->clock.mult) >> gtod->clock.shift;
-}
-
-notrace static noinline int do_realtime(struct timespec *ts)
-{
- unsigned long seq, ns;
- do {
- seq = read_seqbegin(&gtod->lock);
- ts->tv_sec = gtod->wall_time_sec;
- ts->tv_nsec = gtod->wall_time_nsec;
- ns = vgetns();
- } while (unlikely(read_seqretry(&gtod->lock, seq)));
- timespec_add_ns(ts, ns);
- return 0;
-}
-
/* Copy of the version in kernel/time.c which we cannot directly access */
notrace static void
vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
@@ -78,7 +56,7 @@ notrace static noinline int do_monotonic(struct timespec *ts)
do {
seq = read_seqbegin(&gtod->lock);
secs = gtod->wall_time_sec;
- ns = gtod->wall_time_nsec + vgetns();
+ ns = gtod->wall_time_nsec + vgetns(gtod);
secs += gtod->wall_to_monotonic.tv_sec;
ns += gtod->wall_to_monotonic.tv_nsec;
} while (unlikely(read_seqretry(&gtod->lock, seq)));
@@ -91,7 +69,7 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
if (likely(gtod->sysctl_enabled && gtod->clock.vread))
switch (clock) {
case CLOCK_REALTIME:
- return do_realtime(ts);
+ return __do_realtime(gtod, ts);
case CLOCK_MONOTONIC:
return do_monotonic(ts);
}
@@ -102,25 +80,7 @@ int clock_gettime(clockid_t, struct timespec *)

notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
- long ret;
- if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
- if (likely(tv != NULL)) {
- BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
- offsetof(struct timespec, tv_nsec) ||
- sizeof(*tv) != sizeof(struct timespec));
- do_realtime((struct timespec *)tv);
- tv->tv_usec /= 1000;
- }
- if (unlikely(tz != NULL)) {
- /* Avoid memcpy. Some old compilers fail to inline it */
- tz->tz_minuteswest = gtod->sys_tz.tz_minuteswest;
- tz->tz_dsttime = gtod->sys_tz.tz_dsttime;
- }
- return 0;
- }
- asm("syscall" : "=a" (ret) :
- "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
- return ret;
+ return __do_vgettimeofday(gtod, tv, tz);
}
int gettimeofday(struct timeval *, struct timezone *)
__attribute__((weak, alias("__vdso_gettimeofday")));
--
1.6.2.5

2009-10-14 19:29:48

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 04/12] kvm/pvclock: add monotonicity check

Other tsc-based clocksources add a monotonicity test to make sure there's
no regression in the returned cycles.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Avi Kivity <[email protected]>
---
arch/x86/kernel/kvmclock.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 223af43..46cb77f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -77,9 +77,14 @@ static cycle_t kvm_clock_read(void)
return ret;
}

+static struct clocksource kvm_clock;
+
static cycle_t kvm_clock_get_cycles(struct clocksource *cs)
{
- return kvm_clock_read();
+ cycle_t ret = kvm_clock_read();
+
+ return ret >= kvm_clock.cycle_last ?
+ ret : kvm_clock.cycle_last;
}

/*
--
1.6.2.5

2009-10-14 19:29:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 05/12] xen/pvclock: add monotonicity check

Other tsc-based clocksources add a monotonicity test to make sure there's
no regression in the returned cycles.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/time.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 0a5aa44..00f06cc 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -213,9 +213,14 @@ cycle_t xen_clocksource_read(void)
return ret;
}

+static struct clocksource xen_clocksource;
+
static cycle_t xen_clocksource_get_cycles(struct clocksource *cs)
{
- return xen_clocksource_read();
+ cycle_t ret = xen_clocksource_read();
+
+ return ret >= xen_clocksource.cycle_last ?
+ ret : xen_clocksource.cycle_last;
}

static void xen_read_wallclock(struct timespec *ts)
--
1.6.2.5

2009-10-14 19:30:48

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 06/12] x86/pvclock: make sure rdtsc doesn't speculate out of region

rdtsc is not necessarily ordered with surrounding instructions so we
need to make sure that it doesn't get sampled out of order with
respect to reading the time_info.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
---
arch/x86/kernel/pvclock.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..963f349 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -117,10 +117,10 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)

do {
version = pvclock_get_time_values(&shadow, src);
- barrier();
+ rdtsc_barrier();
offset = pvclock_get_nsec_offset(&shadow);
ret = shadow.system_timestamp + offset;
- barrier();
+ rdtsc_barrier();
} while (version != src->version);

return ret;
--
1.6.2.5

2009-10-14 19:30:02

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 07/12] pvclock: there's no need to copy time_info into shadow

We can use the vcpu_time_info as-is, without needing to copy it. We just
need to grab a copy of the version number before starting, then we can
compute the time using the time_info directly. If things changed under
our feet, then we just go back and do the whole thing again.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
---
arch/x86/kernel/pvclock.c | 47 +++++---------------------------------------
1 files changed, 6 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 963f349..e43cd78 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -20,20 +20,6 @@
#include <asm/pvclock.h>

/*
- * These are perodically updated
- * xen: magic shared_info page
- * kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
- u64 tsc_timestamp; /* TSC at last update of time vals. */
- u64 system_timestamp; /* Time, in nanosecs, since boot. */
- u32 tsc_to_nsec_mul;
- int tsc_shift;
- u32 version;
-};
-
-/*
* Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
* yielding a 64-bit result.
*/
@@ -71,30 +57,10 @@ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
return product;
}

-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
-{
- u64 delta = native_read_tsc() - shadow->tsc_timestamp;
- return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
- struct pvclock_vcpu_time_info *src)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
{
- do {
- dst->version = src->version;
- rmb(); /* fetch version before data */
- dst->tsc_timestamp = src->tsc_timestamp;
- dst->system_timestamp = src->system_time;
- dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
- dst->tsc_shift = src->tsc_shift;
- rmb(); /* test version after fetching data */
- } while ((src->version & 1) || (dst->version != src->version));
-
- return dst->version;
+ u64 delta = native_read_tsc() - src->tsc_timestamp;
+ return scale_delta(delta, src->tsc_to_system_mul, src->tsc_shift);
}

unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -111,15 +77,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)

cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
- struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;

do {
- version = pvclock_get_time_values(&shadow, src);
+ version = src->version;
rdtsc_barrier();
- offset = pvclock_get_nsec_offset(&shadow);
- ret = shadow.system_timestamp + offset;
+ offset = pvclock_get_nsec_offset(src);
+ ret = src->system_time + offset;
rdtsc_barrier();
} while (version != src->version);

--
1.6.2.5

2009-10-14 19:31:12

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 08/12] x86: create specific X86_VSYSCALL config variable

vsyscalls are only implemented on 64-bit at present, though there's no
fundimental reason why they couldn't be implemented for 32-bit. Create
a new CONFIG_X86_VSYSCALL to control compilation of vsyscall-related
code and definitions to make it clearer where they are.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 6 +++++-
arch/x86/include/asm/fixmap.h | 7 +++++--
arch/x86/kernel/Makefile | 3 ++-
arch/x86/kernel/hpet.c | 4 ++--
arch/x86/kernel/tsc.c | 4 ++--
5 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 13ffa5d..7950d54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -136,7 +136,7 @@ config GENERIC_CALIBRATE_DELAY

config GENERIC_TIME_VSYSCALL
bool
- default X86_64
+ default X86_VSYSCALL

config ARCH_HAS_CPU_RELAX
def_bool y
@@ -222,6 +222,10 @@ config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR

+config X86_VSYSCALL
+ def_bool y
+ depends on X86_64
+
config KTIME_SCALAR
def_bool X86_32
source "init/Kconfig"
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7b2d71d..3b63b57 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -74,12 +74,15 @@ enum fixed_addresses {
#ifdef CONFIG_X86_32
FIX_HOLE,
FIX_VDSO,
-#else
+#endif
+
+#ifdef CONFIG_X86_VSYSCALL
VSYSCALL_LAST_PAGE,
VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE
+ ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
VSYSCALL_HPET,
-#endif
+#endif /* CONFIG_X86_VSYSCALL */
+
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 430d5b2..1c9ec2f 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -37,7 +37,8 @@ obj-$(CONFIG_X86_VISWS) += visws_quirks.o
obj-$(CONFIG_X86_32) += probe_roms_32.o
obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o
obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
-obj-$(CONFIG_X86_64) += syscall_64.o vsyscall_64.o
+obj-$(CONFIG_X86_64) += syscall_64.o
+obj-$(CONFIG_X86_VSYSCALL) += vsyscall_64.o
obj-y += bootflag.o e820.o
obj-y += pci-dma.o quirks.o i8237.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index dedc2bd..2cac930 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -744,7 +744,7 @@ static cycle_t read_hpet(struct clocksource *cs)
return (cycle_t)hpet_readl(HPET_COUNTER);
}

-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_VSYSCALL
static cycle_t __vsyscall_fn vread_hpet(void)
{
return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
@@ -759,7 +759,7 @@ static struct clocksource clocksource_hpet = {
.shift = HPET_SHIFT,
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
.resume = hpet_resume_counter,
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_VSYSCALL
.vread = vread_hpet,
#endif
};
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 71f4368..fe7174f 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -725,7 +725,7 @@ static cycle_t read_tsc(struct clocksource *cs)
ret : clocksource_tsc.cycle_last;
}

-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_VSYSCALL
static cycle_t __vsyscall_fn vread_tsc(void)
{
cycle_t ret;
@@ -752,7 +752,7 @@ static struct clocksource clocksource_tsc = {
.shift = 22,
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_MUST_VERIFY,
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_VSYSCALL
.vread = vread_tsc,
#endif
};
--
1.6.2.5

2009-10-14 19:31:28

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 09/12] sched: add notifier for cross-cpu migrations

Add a notifier to publish when a task gets migrated from one cpu to
another. The first intended use is to export a "migration counter"
to usermode so that a task can tell whether its still running on the
same cpu (for use in vgetcpu and other vsyscall code).

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andi Kleen <[email protected]>
---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 14 ++++++++++++++
2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..5186dd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -141,6 +141,14 @@ extern unsigned long nr_iowait(void);
extern void calc_global_load(void);
extern u64 cpu_nr_migrations(int cpu);

+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+ struct task_struct *task;
+ int from_cpu;
+ int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
extern unsigned long get_parent_ip(unsigned long addr);

struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index 1b59e26..3982e8e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1951,6 +1951,12 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}

+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+ atomic_notifier_chain_register(&task_migration_notifier, n);
+}

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
@@ -1973,6 +1979,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
p->se.block_start -= clock_offset;
#endif
if (old_cpu != new_cpu) {
+ struct task_migration_notifier tmn;
+
p->se.nr_migrations++;
new_rq->nr_migrations_in++;
#ifdef CONFIG_SCHEDSTATS
@@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
#endif
perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
1, 1, NULL, 0);
+
+ tmn.task = p;
+ tmn.from_cpu = old_cpu;
+ tmn.to_cpu = new_cpu;
+
+ atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
}
p->se.vruntime -= old_cfsrq->min_vruntime -
new_cfsrq->min_vruntime;
--
1.6.2.5

2009-10-14 19:30:08

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 10/12] x86/pvclock: add vsyscall implementation

This patch allows the pvclock mechanism to be used in usermode. To
do this, we map an extra page into usermode containing an array of
pvclock_vcpu_time_info structures which give the information required
to compute a global system clock from the tsc. With this, we can
implement pvclock_clocksource_vread().

One complication is that usermode is subject to two levels of scheduling:
kernel scheduling of tasks onto vcpus, and hypervisor scheduling of vcpus
onto pcpus. In either case the underlying pcpu changed, and with it,
the correct set of parameters to compute tsc->system clock.

To address this we install a task migration notifier to update the "old"
vcpu's "migrate_count" number (ie, vcpu has had a task migrated away
from it). Usermode can then check the migrate_count while computing
the time and retry with a new cpu number if it has changed.

To use this feature, hypervisor-specific code is required
to call pvclock_init_vsyscall(), and if successful:
- cause the pvclock_vcpu_time_info structure at
pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for
each vcpu.
- use pvclock_clocksource_vread as the implementation of clocksource
.vread.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Keir Fraser <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Dan Magenheimer <[email protected]>
---
arch/x86/Kconfig | 4 +
arch/x86/include/asm/fixmap.h | 4 +-
arch/x86/include/asm/pvclock.h | 6 ++
arch/x86/include/asm/vsyscall.h | 3 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/pvclock.c | 149 ++++++++++++++++++++++++++++++++++++--
6 files changed, 159 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7950d54..50a5771 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -522,6 +522,10 @@ config PARAVIRT_CLOCK
bool
default n

+config PARAVIRT_CLOCK_VSYSCALL
+ bool
+ depends on PARAVIRT_CLOCK && X86_VSYSCALL
+
endif

config PARAVIRT_DEBUG
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 3b63b57..b15c865 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -82,7 +82,9 @@ enum fixed_addresses {
+ ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
VSYSCALL_HPET,
#endif /* CONFIG_X86_VSYSCALL */
-
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+ FIX_PVCLOCK_TIME_INFO,
+#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..d2402b3 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -3,6 +3,7 @@

#include <linux/clocksource.h>
#include <asm/pvclock-abi.h>
+#include <asm/vsyscall.h>

/* some helper functions for xen and kvm pv clock sources */
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
@@ -11,4 +12,9 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
struct timespec *ts);

+int __init pvclock_init_vsyscall(void);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
+
+cycle_t __vsyscall_fn pvclock_clocksource_vread(void);
+
#endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index bb90047..80a027d 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,9 @@ enum vsyscall_num {
extern int __vgetcpu_mode;
extern volatile unsigned long __jiffies;

+struct getcpu_cache;
+extern long vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache);
+
/* kernel space (writeable) */
extern int vgetcpu_mode;
extern struct timezone sys_tz;
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 1c9ec2f..88d51cb 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,10 +24,12 @@ CFLAGS_vsyscall_64.o := $(PROFILING) -g0 $(nostackp)
CFLAGS_hpet.o := $(nostackp)
CFLAGS_tsc.o := $(nostackp)
CFLAGS_paravirt.o := $(nostackp)
+CFLAGS_pvclock.o := $(nostackp)
GCOV_PROFILE_vsyscall_64.o := n
GCOV_PROFILE_hpet.o := n
GCOV_PROFILE_tsc.o := n
GCOV_PROFILE_paravirt.o := n
+GCOV_PROFILE_pvclock.o := n

obj-y := process_$(BITS).o signal.o entry_$(BITS).o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index e43cd78..0bed867 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -17,7 +17,11 @@

#include <linux/kernel.h>
#include <linux/percpu.h>
+#include <linux/sched.h>
+
#include <asm/pvclock.h>
+#include <asm/vsyscall.h>
+#include <asm/vgtod.h>

/*
* Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
@@ -57,9 +61,10 @@ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
return product;
}

-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
{
- u64 delta = native_read_tsc() - src->tsc_timestamp;
+ u64 delta = __native_read_tsc() - src->tsc_timestamp;
return scale_delta(delta, src->tsc_to_system_mul, src->tsc_shift);
}

@@ -75,17 +80,30 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
return pv_tsc_khz;
}

-cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+ cycle_t *cycles)
{
unsigned version;
cycle_t ret, offset;

+ version = src->version;
+ rdtsc_barrier();
+ offset = pvclock_get_nsec_offset(src);
+ ret = src->system_time + offset;
+ rdtsc_barrier();
+
+ *cycles = ret;
+ return version;
+}
+
+cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+{
+ unsigned version;
+ cycle_t ret;
+
do {
- version = src->version;
- rdtsc_barrier();
- offset = pvclock_get_nsec_offset(src);
- ret = src->system_time + offset;
- rdtsc_barrier();
+ version = __pvclock_read_cycles(src, &ret);
} while (version != src->version);

return ret;
@@ -116,3 +134,118 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,

set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
}
+
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+typedef union {
+ struct pvclock_vsyscall_time_info {
+ struct pvclock_vcpu_time_info pvti;
+ u32 migrate_count;
+ } info;
+ char pad[SMP_CACHE_BYTES];
+} aligned_pvti_t ____cacheline_aligned;
+
+static aligned_pvti_t *pvclock_vsyscall_time_info;
+
+static struct pvclock_vsyscall_time_info *pvclock_get_vsyscall_user_time_info(int cpu)
+{
+ if (pvclock_vsyscall_time_info == NULL)
+ return NULL;
+
+ return &pvclock_vsyscall_time_info[cpu].info;
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+ return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
+}
+
+cycle_t __vsyscall_fn pvclock_clocksource_vread(void)
+{
+ const aligned_pvti_t *pvti_base;
+ const struct pvclock_vsyscall_time_info *pvti;
+ cycle_t ret;
+ u32 version;
+ u32 migrate_count;
+ unsigned cpu, cpu1;
+
+ pvti_base = (aligned_pvti_t *)fix_to_virt(FIX_PVCLOCK_TIME_INFO);
+
+ /*
+ * When looping to get a consistent (time-info, tsc) pair, we
+ * also need to deal with the possibility we can switch vcpus,
+ * so make sure we always re-fetch time-info for the current vcpu.
+ */
+ do {
+ vgetcpu(&cpu, NULL, NULL);
+ pvti = &pvti_base[cpu].info;
+
+ migrate_count = pvti->migrate_count;
+ version = __pvclock_read_cycles(&pvti->pvti, &ret);
+
+ /*
+ * Test we're still on the cpu as well as the version.
+ * We could have been mograted just after the first
+ * vgetcpu but before fetching the version, so we
+ * wouldn't notice a version change.
+ */
+ vgetcpu(&cpu1, NULL, NULL);
+ } while (unlikely(cpu != cpu1 ||
+ pvti->pvti.version != version ||
+ pvti->migrate_count != migrate_count));
+
+ return ret >= __vsyscall_gtod_data.clock.cycle_last ?
+ ret : __vsyscall_gtod_data.clock.cycle_last;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+ struct task_migration_notifier *mn = v;
+ struct pvclock_vsyscall_time_info *pvti;
+
+ pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
+
+ if (pvti == NULL)
+ return NOTIFY_DONE;
+
+ pvti->migrate_count++;
+
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+ .notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state. This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+int __init pvclock_init_vsyscall(void)
+{
+ int cpu;
+
+ /* Just one page for now */
+ if (nr_cpu_ids * sizeof(aligned_pvti_t) > PAGE_SIZE) {
+ printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit time_info into a single page\n");
+ return -ENOSPC;
+ }
+
+ pvclock_vsyscall_time_info =
+ (aligned_pvti_t *)get_zeroed_page(GFP_KERNEL);
+ if (pvclock_vsyscall_time_info == NULL)
+ return -ENOMEM;
+
+ for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+ pvclock_vsyscall_time_info[cpu].info.pvti.version = ~0;
+
+ __set_fixmap(FIX_PVCLOCK_TIME_INFO, __pa(pvclock_vsyscall_time_info),
+ PAGE_KERNEL_VSYSCALL);
+
+ register_task_migration_notifier(&pvclock_migrate);
+
+ return 0;
+}
+
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
--
1.6.2.5

2009-10-14 19:30:58

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 11/12] x86/fixmap: add a predicate for usermode fixmaps

Some set of fixmaps are intented to be mapped into usermode. Add a
predicate to test for those fixmaps.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/fixmap.h | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index b15c865..026ee10 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -183,6 +183,24 @@ static inline void __set_fixmap(enum fixed_addresses idx,

extern void __this_fixmap_does_not_exist(void);

+static inline bool user_fixmap(enum fixed_addresses fixmap)
+{
+ switch (fixmap) {
+#ifdef CONFIG_X86_32
+ case FIX_HOLE ... FIX_VDSO:
+#else
+ case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET:
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+ case FIX_PVCLOCK_TIME_INFO:
+#endif
+#endif
+ return true;
+
+ default:
+ return false;
+ }
+}
+
/*
* 'index to address' translation. If anyone tries to use the idx
* directly without translation, we catch the bug with a NULL-deference
--
1.6.2.5

2009-10-14 19:31:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 12/12] xen/time: add pvclock_clocksource_vread support

Add support to register pvclock_vcpu_time_info structures in the userspace
mapped page and set the xen clocksource .vread method if that works.
The common pvclock code does everything else.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/Kconfig | 6 ++++
arch/x86/xen/mmu.c | 5 +++-
arch/x86/xen/smp.c | 2 +
arch/x86/xen/time.c | 54 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 8 ++++++
include/xen/interface/vcpu.h | 41 +++++++++++++++++++++++++++++++
6 files changed, 115 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index b83e119..a9ce09d 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -13,6 +13,11 @@ config XEN
kernel to boot in a paravirtualized environment under the
Xen hypervisor.

+config XEN_TIME_VSYSCALL
+ def_bool y
+ depends on X86_VSYSCALL && PARAVIRT_CLOCK
+ select PARAVIRT_CLOCK_VSYSCALL
+
config XEN_MAX_DOMAIN_MEMORY
int "Maximum allowed size of a domain in gigabytes"
default 8 if X86_32
@@ -36,3 +41,4 @@ config XEN_DEBUG_FS
help
Enable statistics output and various tuning options in debugfs.
Enabling this option may incur a significant performance overhead.
+
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4ceb285..858d407 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1814,6 +1814,9 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
# endif
#else
case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE:
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+ case FIX_PVCLOCK_TIME_INFO:
+#endif
#endif
#ifdef CONFIG_X86_LOCAL_APIC
case FIX_APIC_BASE: /* maps dummy local APIC */
@@ -1834,7 +1837,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
#ifdef CONFIG_X86_64
/* Replicate changes to map the vsyscall page into the user
pagetable vsyscall mapping. */
- if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) {
+ if (user_fixmap(idx)) {
unsigned long vaddr = __fix_to_virt(idx);
set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
}
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 429834e..a2ee882 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -313,6 +313,8 @@ static int __cpuinit xen_cpu_up(unsigned int cpu)
if (rc)
return rc;

+ xen_setup_vcpu_vsyscall_time_info(cpu);
+
rc = HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL);
BUG_ON(rc);

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 00f06cc..5cf93f8 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -479,6 +479,55 @@ void xen_timer_resume(void)
}
}

+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+void xen_setup_vcpu_vsyscall_time_info(int cpu)
+{
+ int ret;
+ struct pvclock_vcpu_time_info *pvti;
+ struct vcpu_register_time_memory_area t;
+
+ pvti = pvclock_get_vsyscall_time_info(cpu);
+ if (!pvti)
+ return;
+
+ t.addr.pv = pvti;
+
+ ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
+ cpu, &t);
+ /*
+ * If the call succeeds, it will update the vcpu_time_info and
+ * set the version to something valid. If it fails, we set
+ * the version to invalid so that usermode doesn't try to use
+ * it.
+ */
+ if (ret != 0)
+ pvti->version = ~0;
+}
+
+static int __init xen_setup_vsyscall_timeinfo(int cpu)
+{
+ int ret;
+
+ ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
+ cpu, NULL);
+ if (ret == -ENOSYS) {
+ printk(KERN_INFO "xen: vcpu_time_info placement not supported\n");
+ return ret;
+ }
+
+ ret = pvclock_init_vsyscall();
+ if (ret != 0) {
+ printk(KERN_INFO "xen: Failed to initialize pvclock vsyscall: %d\n",
+ ret);
+ return ret;
+ }
+
+ xen_setup_vcpu_vsyscall_time_info(cpu);
+
+ return 0;
+}
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
+
__init void xen_time_init(void)
{
int cpu = smp_processor_id();
@@ -501,4 +550,9 @@ __init void xen_time_init(void)

xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
+
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+ if (xen_setup_vsyscall_timeinfo(cpu) == 0)
+ xen_clocksource.vread = pvclock_clocksource_vread;
+#endif
}
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 22494fd..f8d8874 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -58,6 +58,14 @@ bool xen_vcpu_stolen(int vcpu);

void xen_setup_vcpu_info_placement(void);

+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+void xen_setup_vcpu_vsyscall_time_info(int cpu);
+#else
+static inline void xen_setup_vcpu_vsyscall_time_info(int cpu)
+{
+}
+#endif
+
#ifdef CONFIG_SMP
void xen_smp_init(void);

diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
index 87e6f8a..0a8edfd 100644
--- a/include/xen/interface/vcpu.h
+++ b/include/xen/interface/vcpu.h
@@ -170,4 +170,45 @@ struct vcpu_register_vcpu_info {
};
DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_vcpu_info);

+
+/*
+ * Register a memory location to get a secondary copy of the vcpu time
+ * parameters. The master copy still exists as part of the vcpu
+ * shared memory area, and this secondary copy is updated whenever the
+ * master copy is updated.
+ *
+ * The intent is that this copy may be mapped (RO) into userspace so
+ * that usermode can compute system time using the time info and the
+ * tsc. Usermode will see an array of vcpu_time_info structures, one
+ * for each vcpu, and choose the right one by an existing mechanism
+ * which allows it to get the current vcpu number (such as via a
+ * segment limit). It can then apply the normal algorithm to compute
+ * system time from the tsc.
+ *
+ * However, because usermode threads are subject to two levels of
+ * scheduling (kernel scheduling of threads to vcpus, and Xen
+ * scheduling of vcpus to pcpus), we must make sure that the thread
+ * knows it has had a race with either (or both) of these two events.
+ * To allow this, the guest kernel updates the time_info version
+ * number when the vcpu does a context switch, so that usermode will
+ * always see a version number change when the parameters need to be
+ * revalidated. Xen makes sure that it always updates the guest's
+ * version rather than overwriting it. (It assumes that a vcpu will
+ * always update its own version number, so there are no cross-cpu
+ * synchronization issues; the only concern is that if the guest
+ * kernel gets preempted by Xen it doesn't revert the version number
+ * to an older value.)
+ *
+ * @extra_arg == pointer to vcpu_register_time_info_memory_area structure.
+ */
+#define VCPUOP_register_vcpu_time_memory_area 13
+
+struct vcpu_register_time_memory_area {
+ union {
+ struct vcpu_time_info *v;
+ struct pvclock_vcpu_time_info *pv;
+ uint64_t p;
+ } addr;
+};
+
#endif /* __XEN_PUBLIC_VCPU_H__ */
--
1.6.2.5

2009-10-15 03:30:24

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

> Other tsc-based clocksources add a monotonicity test to make
> sure there's
> no regression in the returned cycles.
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/xen/time.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 0a5aa44..00f06cc 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -213,9 +213,14 @@ cycle_t xen_clocksource_read(void)
> return ret;
> }
>
> +static struct clocksource xen_clocksource;
> +
> static cycle_t xen_clocksource_get_cycles(struct clocksource *cs)
> {
> - return xen_clocksource_read();
> + cycle_t ret = xen_clocksource_read();
> +
> + return ret >= xen_clocksource.cycle_last ?
> + ret : xen_clocksource.cycle_last;

As long as we are going through the trouble of making
this monotonic, shouldn't it be monotonically increasing
(rather than just monotonically non-decreasing)? The
rdtsc instruction and any suitably high-precision
hardware timer will never return the same value
on subsequent uses so this might be a reasonable
precedent to obey. E.g.

+ return ret > xen_clocksource.cycle_last ?
+ ret : ++xen_clocksource.cycle_last;

2009-10-15 05:24:39

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

On 10/14/09 20:26, Dan Magenheimer wrote:
> As long as we are going through the trouble of making
> this monotonic, shouldn't it be monotonically increasing
> (rather than just monotonically non-decreasing)? The
> rdtsc instruction and any suitably high-precision
> hardware timer will never return the same value
> on subsequent uses so this might be a reasonable
> precedent to obey. E.g.
>
> + return ret > xen_clocksource.cycle_last ?
> + ret : ++xen_clocksource.cycle_last;
>

No, cycle_last isn't updated on every read, only on timer ticks. This
test doesn't seem to be intended to make sure that every
clocksource_read is globally monotonic, but just to avoid some boundary
conditions in the timer interrupt. I just copied it directly from
read_tsc().

J

2009-10-15 06:53:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

On 10/15/2009 04:28 AM, Jeremy Fitzhardinge wrote:
> Hi all,
>
> This series contains several things:
>
> - Unify the separate vdso and vsyscall implementations of vgettimeofday and
> vgetcpu. There's at least one bugfix which was only applied to one copy
> (ignore tcache in vgetcpu was only applied to the vdso version); this
> should avoid such skews in future.
>
> - Bug fixes for the Xen and KVM clocksource.read functions to make sure
> the returned time doesn't regress compared to clocksource.cycle_last.
> (Probably stable material.)
>
> - Make sure the pvclock rdtsc is surrounded by appropriate barriers so
> that it doesn't get speculated to the wrong place with respect to reading
> the time parameters. (Probably stable material.)
>
> - General cleanups of the pvclock algorithm (there's no need to make a local
> copy of the time parameters before use).
>
> - Add a new CONFIG_X86_VSYSCALL to control the compilation of vsyscall-related
> code rather than just using CONFIG_X86_64 - we may want to implement 32-bit
> vsyscall at some point, and this will make it easier.
>
> - Add the sched notifier for task migration between CPUs, for use by
> pvclock vread.
>
> - Implement a pvclock vread function, so that pvclock-using clocksources can be
> used by vsyscall/vdso vgettimeofday and vclock_gettime.
>
> - Use pvclock vread in the Xen clocksource.
>

Looks good to me.

Acked-by etc.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-15 13:36:06

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

> On 10/14/09 20:26, Dan Magenheimer wrote:
> > As long as we are going through the trouble of making
> > this monotonic, shouldn't it be monotonically increasing
> > (rather than just monotonically non-decreasing)? The
> > rdtsc instruction and any suitably high-precision
> > hardware timer will never return the same value
> > on subsequent uses so this might be a reasonable
> > precedent to obey. E.g.
> >
> > + return ret > xen_clocksource.cycle_last ?
> > + ret : ++xen_clocksource.cycle_last;
>
> No, cycle_last isn't updated on every read, only on timer ticks. This
> test doesn't seem to be intended to make sure that every
> clocksource_read is globally monotonic, but just to avoid
> some boundary
> conditions in the timer interrupt. I just copied it directly from
> read_tsc().

I understand but you are now essentially emulating a
reliable platform timer with a potentially unreliable
(but still high resolution) per-CPU timer AND probably
delivering that result to userland.

Read_tsc should only be used if either CONSTANT_TSC
or TSC_RELIABLE is true, so read_tsc is guaranteed
to be monotonically-strictly-increasing by hardware
(and enforced for CONSTANT_TSC by check_tsc_warp
at boot).

2009-10-15 19:20:00

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

On 10/15/09 06:27, Dan Magenheimer wrote:
> I understand but you are now essentially emulating a
> reliable platform timer with a potentially unreliable
> (but still high resolution) per-CPU timer AND probably
> delivering that result to userland.
>
> Read_tsc should only be used if either CONSTANT_TSC
> or TSC_RELIABLE is true, so read_tsc is guaranteed
> to be monotonically-strictly-increasing by hardware
> (and enforced for CONSTANT_TSC by check_tsc_warp
> at boot).
>

read_tsc clearly isn't expected to produce absolutely globally monotonic
results; if it were then the check wouldn't be necessary.

J

2009-10-15 19:47:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

On 10/14/09 23:51, Avi Kivity wrote:
>> - Implement a pvclock vread function, so that pvclock-using
>> clocksources can be
>> used by vsyscall/vdso vgettimeofday and vclock_gettime.
>>
>> - Use pvclock vread in the Xen clocksource.
>>
>
>
> Looks good to me.

Care to cook up a patch to implement the kvm bits to make sure it all
works OK for you?

Thanks,
J

2009-10-16 01:33:05

by john stultz

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

On Thu, Oct 15, 2009 at 6:27 AM, Dan Magenheimer
<[email protected]> wrote:
>> On 10/14/09 20:26, Dan Magenheimer wrote:
>> > As long as we are going through the trouble of making
>> > this monotonic, shouldn't it be monotonically increasing
>> > (rather than just monotonically non-decreasing)? ?The
>> > rdtsc instruction and any suitably high-precision
>> > hardware timer will never return the same value
>> > on subsequent uses so this might be a reasonable
>> > precedent to obey. ?E.g.
>> >
>> > + ? return ret > xen_clocksource.cycle_last ?
>> > + ? ? ? ? ? ret : ++xen_clocksource.cycle_last;

Oof. Modifying .cycle_last will cause timekeeping wreckage. Please don't.
Also the above would break on SMP.

Ideally we would have moved cycle_last to the timekeeper structure,
since its a timekeeping specific value, not really clocksource
related.

However, there is the situation where we have don't have perfectly
synced TSCs, but TSCs that are very very close. In this case, the only
time we might see skew is if update_wall time were to run on the
slightly faster TSC, and then immediately after the gettimeofday()
code runs and sees the cycle_last value ahead of the rdtsc.

So the cycle_last check is hackish, but it lets folks use the much
faster TSC when we'd have to otherwise throw it out.

If you wanted something like this, a global last_tsc value could be
used and cmpxchged to ensure you always have an increasing TSC value.
However I suspect the performance hit there would be painful.

>> No, cycle_last isn't updated on every read, only on timer ticks. ?This
>> test doesn't seem to be intended to make sure that every
>> clocksource_read is globally monotonic, but just to avoid
>> some boundary
>> conditions in the timer interrupt. ?I just copied it directly from
>> read_tsc().
>
> I understand but you are now essentially emulating a
> reliable platform timer with a potentially unreliable
> (but still high resolution) per-CPU timer AND probably
> delivering that result to userland.
>
> Read_tsc should only be used if either CONSTANT_TSC
> or TSC_RELIABLE is true, so read_tsc is guaranteed
> to be monotonically-strictly-increasing by hardware
> (and enforced for CONSTANT_TSC by check_tsc_warp
> at boot).

Ideally, yes, only perfect TSCs should be used.

But in reality, its a big performance win for folks who can get away
with just slightly offset TSCs.

thanks
-john

2009-10-16 03:11:35

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

On 10/15/09 18:32, john stultz wrote:
>>> No, cycle_last isn't updated on every read, only on timer ticks. This
>>> test doesn't seem to be intended to make sure that every
>>> clocksource_read is globally monotonic, but just to avoid
>>> some boundary
>>> conditions in the timer interrupt. I just copied it directly from
>>> read_tsc().
>>>
>> I understand but you are now essentially emulating a
>> reliable platform timer with a potentially unreliable
>> (but still high resolution) per-CPU timer AND probably
>> delivering that result to userland.
>>
>> Read_tsc should only be used if either CONSTANT_TSC
>> or TSC_RELIABLE is true, so read_tsc is guaranteed
>> to be monotonically-strictly-increasing by hardware
>> (and enforced for CONSTANT_TSC by check_tsc_warp
>> at boot).
>>
> Ideally, yes, only perfect TSCs should be used.
>
> But in reality, its a big performance win for folks who can get away
> with just slightly offset TSCs.
>

What monotonicity guarantees do we make to usermode, for both syscall
and vsyscall gettimeofday and clock_gettime?

Though its not clear to me how usermode would even notice very small
amounts of cross-thread/cpu non-monotonicity anyway. It would need make
sure that it samples the time and stores it to some globally visible
place atomically (with locks, compare-and-swap, etc), which is going to
be pretty expensive. And if its going to all that effort it may as well
do its own monotonicity checking/adjustments if its all that important.

(I can think of plenty of ways of doing it incorrectly, where you'd get
apparent non-monotonicity regardless of the quality of the time source.)

J

2009-10-16 17:58:26

by john stultz

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

On Thu, 2009-10-15 at 20:10 -0700, Jeremy Fitzhardinge wrote:
> On 10/15/09 18:32, john stultz wrote:
> >>> No, cycle_last isn't updated on every read, only on timer ticks. This
> >>> test doesn't seem to be intended to make sure that every
> >>> clocksource_read is globally monotonic, but just to avoid
> >>> some boundary
> >>> conditions in the timer interrupt. I just copied it directly from
> >>> read_tsc().
> >>>
> >> I understand but you are now essentially emulating a
> >> reliable platform timer with a potentially unreliable
> >> (but still high resolution) per-CPU timer AND probably
> >> delivering that result to userland.
> >>
> >> Read_tsc should only be used if either CONSTANT_TSC
> >> or TSC_RELIABLE is true, so read_tsc is guaranteed
> >> to be monotonically-strictly-increasing by hardware
> >> (and enforced for CONSTANT_TSC by check_tsc_warp
> >> at boot).
> >>
> > Ideally, yes, only perfect TSCs should be used.
> >
> > But in reality, its a big performance win for folks who can get away
> > with just slightly offset TSCs.
> >
>
> What monotonicity guarantees do we make to usermode, for both syscall
> and vsyscall gettimeofday and clock_gettime?

The guarantee is time won't go backwards. It may not always increase,
between two calls, but applications should not see a previous time after
a later time from clock_gettime/gettimeofday.

> Though its not clear to me how usermode would even notice very small
> amounts of cross-thread/cpu non-monotonicity anyway. It would need make
> sure that it samples the time and stores it to some globally visible
> place atomically (with locks, compare-and-swap, etc), which is going to
> be pretty expensive. And if its going to all that effort it may as well
> do its own monotonicity checking/adjustments if its all that important.

If the TSCs are offset enough for a thread to move between cpus and see
an inconsistency, then the TSC needs to be thrown out. The TSC sync
check at boot should provide this.

The cycle_last check in read_tsc() is really only for very very slight
offsets, that would otherwise pass the sync check, and a process could
not detect when switching between cpus (the skew would have to be
smaller then the time it takes to migrate between cpus).


> (I can think of plenty of ways of doing it incorrectly, where you'd get
> apparent non-monotonicity regardless of the quality of the time source.)

There's been some interesting talk of creating a more offset-robust TSC
clocksource using a per-cpu TSC offsets synced periodically against a
global counter like the HPET. It seems like it could work, but there
are a lot of edge cases and it really has to be right all of the time,
so I don't think its quite as trivial as some folks have thought. But it
would be interesting to see!

thanks
-john

2009-10-18 06:44:04

by Avi Kivity

[permalink] [raw]
Subject: Re: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

On 10/16/2009 04:46 AM, Jeremy Fitzhardinge wrote:
> Care to cook up a patch to implement the kvm bits to make sure it all
> works OK for you?
>

I've started to do that, but it occurs to me that we're missing out on
NUMA placement by forcing all clocks to be on the same page. OTOH, if
the clocks are heavily used, they'll stay in cache, and if not, who cares.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-18 08:18:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/12] xen/pvclock: add monotonicity check

On 10/17/09 02:58, john stultz wrote:
>> (I can think of plenty of ways of doing it incorrectly, where you'd get
>> apparent non-monotonicity regardless of the quality of the time source.)
>>
> There's been some interesting talk of creating a more offset-robust TSC
> clocksource using a per-cpu TSC offsets synced periodically against a
> global counter like the HPET. It seems like it could work, but there
> are a lot of edge cases and it really has to be right all of the time,
> so I don't think its quite as trivial as some folks have thought. But it
> would be interesting to see!
>

Xen and KVM do this already. The host system provides the tsc rate and
offset information for each cpu so that the guest kernel can compute a
global system time (ns since boot) from the tsc.

I suspect its quite difficult to guarantee that you get completely
monotonic results if the tscs are at different rates, but if they're the
same rate but merely offset it should give completely OK results.

J

2009-10-18 08:18:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

On 10/18/09 15:43, Avi Kivity wrote:
> On 10/16/2009 04:46 AM, Jeremy Fitzhardinge wrote:
>> Care to cook up a patch to implement the kvm bits to make sure it all
>> works OK for you?
>>
>
> I've started to do that, but it occurs to me that we're missing out on
> NUMA placement by forcing all clocks to be on the same page. OTOH, if
> the clocks are heavily used, they'll stay in cache, and if not, who
> cares.
>

Yes, I'd say so. I'd expect the data to be very close to read-only, so
the lines should be shared pretty efficiently.

On the other hand, there's nothing to stop us from moving to multiple
pages in future (either to support NUMA placement, or just more than 64
cpus).

J

2009-10-18 08:24:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

On 10/18/2009 05:18 PM, Jeremy Fitzhardinge wrote:
> On 10/18/09 15:43, Avi Kivity wrote:
>
>> On 10/16/2009 04:46 AM, Jeremy Fitzhardinge wrote:
>>
>>> Care to cook up a patch to implement the kvm bits to make sure it all
>>> works OK for you?
>>>
>>>
>> I've started to do that, but it occurs to me that we're missing out on
>> NUMA placement by forcing all clocks to be on the same page. OTOH, if
>> the clocks are heavily used, they'll stay in cache, and if not, who
>> cares.
>>
> Yes, I'd say so. I'd expect the data to be very close to read-only, so
> the lines should be shared pretty efficiently.
>
>

There wouldn't be any sharing since each clock is on its own cache
line. But the point is valid regardless.

> On the other hand, there's nothing to stop us from moving to multiple
> pages in future (either to support NUMA placement, or just more than 64
> cpus).
>

I'm already allocating multiple pages, so we'd just need to adjust the
fixmap.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-18 08:44:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [GIT PULL RFC] pvclock cleanups and pvclock vsyscall support

On 10/18/09 17:23, Avi Kivity wrote:
>> On the other hand, there's nothing to stop us from moving to multiple
>> pages in future (either to support NUMA placement, or just more than 64
>> cpus).
>>
>
> I'm already allocating multiple pages, so we'd just need to adjust the
> fixmap.

Are you changing pvclock_init_vsyscall() to allocate more pages? More
fixmap slots should be no problem.

J