This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
(tested on Sandy Bridge).
I'm hoping someone can test this with Ingo's time-warp-test, which
you can get here:
http://people.redhat.com/mingo/time-warp-test/time-warp-test.c
You'll need to change TEST_CLOCK to 1. I'm especially interested
in Core 2, Pentium D, Westmere, or AMD systems with usable TSCs.
(I've already tested on Sandy Bridge and Bloomfield (Xeon W3520)).
The changes and timings (fastest of 20 trials of 100M iters on Sandy
Bridge) are:
CLOCK_MONOTONIC: 22.09ns -> 15.66ns
CLOCK_REALTIME_COARSE: 4.23ns -> 3.44ns
CLOCK_MONOTONIC_COARSE: 5.65ns -> 4.23ns
x86-64: Clean up vdso/kernel shared variables
Because vsyscall_gtod_data's address isn't known until load time, the
code contains unnecessary address calculations. The code is also
rather complicated. Clean it up and use addresses that are known at
compile time.
x86-64: Optimize vread_tsc's barriers
This replaces lfence;rdtsc;lfence with a faster sequence with similar
ordering guarantees.
x86-64: Don't generate cmov in vread_tsc
GCC likes to generate a cmov on a branch that's almost completely
predictable. Force it to generate a real branch instead.
x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
vset_normalize_timespec was more general than necessary. Open-code
the appropriate normalization loops. This is a big win for
CLOCK_MONOTONIC_COARSE.
x86-64: Move vread_tsc into a new file with sensible options
This way vread_tsc doesn't have a frame pointer, with saves about
0.3ns. I guess that the CPU's stack frame optimizations aren't quite
as good as I thought.
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
We're building the vDSO with optimizations disabled that were meant
for kernel code. Override that, except for -fno-omit-frame-pointers,
which might make userspace debugging harder.
Changes from v1:
- Redo the vsyscall_gtod_data address patch to make the code
cleaner instead of uglier and to make it work for all the
vsyscall variables.
- Improve the comments for clarity and formatting.
- Fix up the changelog for the nsec < 0 tweak (the normalization
code can't be inline because the two callers are different).
- Move vread_tsc into its own file, removing a GCC version
dependence and making it more maintainable.
Ingo, I looked at moving vread_tsc into a .S file, but I think
it would be less maintainable for a few reasons:
- rdtsc_barrier() would need an assembly version. It uses
alternatives.
- The code needs access to the VVAR magic, which would need an
assembly-callable version. (This woudn't be so bad, but
it's more code.)
- It needs to know the offset of cycles_last. This would
involve adding an extra asm offset.
- I don't think it's that bad in C, and the code it generates
looks good.
Andy Lutomirski (6):
x86-64: Clean up vdso/kernel shared variables
x86-64: Optimize vread_tsc's barriers
x86-64: Don't generate cmov in vread_tsc
x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
x86-64: Move vread_tsc into a new file with sensible options
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
arch/x86/include/asm/tsc.h | 4 +++
arch/x86/include/asm/vdso.h | 14 ----------
arch/x86/include/asm/vgtod.h | 2 -
arch/x86/include/asm/vsyscall.h | 12 +-------
arch/x86/include/asm/vvar.h | 52 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 8 +++--
arch/x86/kernel/time.c | 2 +-
arch/x86/kernel/tsc.c | 19 -------------
arch/x86/kernel/vmlinux.lds.S | 34 ++++++++----------------
arch/x86/kernel/vread_tsc_64.c | 55 +++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/vsyscall_64.c | 46 ++++++++++++++------------------
arch/x86/vdso/Makefile | 17 ++++++++++-
arch/x86/vdso/vclock_gettime.c | 43 ++++++++++++++++--------------
arch/x86/vdso/vdso.lds.S | 7 -----
arch/x86/vdso/vextern.h | 16 -----------
arch/x86/vdso/vgetcpu.c | 3 +-
arch/x86/vdso/vma.c | 27 -------------------
arch/x86/vdso/vvar.c | 12 --------
18 files changed, 189 insertions(+), 184 deletions(-)
create mode 100644 arch/x86/include/asm/vvar.h
create mode 100644 arch/x86/kernel/vread_tsc_64.c
delete mode 100644 arch/x86/vdso/vextern.h
delete mode 100644 arch/x86/vdso/vvar.c
--
1.7.4
Variables that are shared between the vdso and the kernel are
currently a bit of a mess. They are each defined with their own
magic, they are accessed differently in the kernel, the vsyscall page,
and the vdso, and one of them (vsyscall_clock) doesn't even really
exist.
This changes them all to use a common mechanism. All of them are
delcared in vvar.h with a fixed address (validated by the linker
script). In the kernel (as before), they look like ordinary
read-write variables. In the vsyscall page and the vdso, they are
accessed through a new macro VVAR, which gives read-only access.
The vdso is now loaded verbatim into memory without any fixups. As a
side bonus, access from the vdso is faster because a level of
indirection is removed.
---
arch/x86/include/asm/vdso.h | 14 ----------
arch/x86/include/asm/vgtod.h | 2 -
arch/x86/include/asm/vsyscall.h | 12 +-------
arch/x86/include/asm/vvar.h | 52 +++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/time.c | 2 +-
arch/x86/kernel/tsc.c | 4 +-
arch/x86/kernel/vmlinux.lds.S | 34 ++++++++-----------------
arch/x86/kernel/vsyscall_64.c | 46 +++++++++++++++-------------------
arch/x86/vdso/Makefile | 2 +-
arch/x86/vdso/vclock_gettime.c | 3 +-
arch/x86/vdso/vdso.lds.S | 7 -----
arch/x86/vdso/vextern.h | 16 ------------
arch/x86/vdso/vgetcpu.c | 3 +-
arch/x86/vdso/vma.c | 27 --------------------
arch/x86/vdso/vvar.c | 12 ---------
15 files changed, 91 insertions(+), 145 deletions(-)
create mode 100644 arch/x86/include/asm/vvar.h
delete mode 100644 arch/x86/vdso/vextern.h
delete mode 100644 arch/x86/vdso/vvar.c
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 9064052..bb05228 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -1,20 +1,6 @@
#ifndef _ASM_X86_VDSO_H
#define _ASM_X86_VDSO_H
-#ifdef CONFIG_X86_64
-extern const char VDSO64_PRELINK[];
-
-/*
- * Given a pointer to the vDSO image, find the pointer to VDSO64_name
- * as that symbol is defined in the vDSO sources or linker script.
- */
-#define VDSO64_SYMBOL(base, name) \
-({ \
- extern const char VDSO64_##name[]; \
- (void *)(VDSO64_##name - VDSO64_PRELINK + (unsigned long)(base)); \
-})
-#endif
-
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
extern const char VDSO32_PRELINK[];
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 3d61e20..646b4c1 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -23,8 +23,6 @@ struct vsyscall_gtod_data {
struct timespec wall_to_monotonic;
struct timespec wall_time_coarse;
};
-extern struct vsyscall_gtod_data __vsyscall_gtod_data
-__section_vsyscall_gtod_data;
extern struct vsyscall_gtod_data vsyscall_gtod_data;
#endif /* _ASM_X86_VGTOD_H */
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d0983d2..d555973 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -16,27 +16,19 @@ enum vsyscall_num {
#ifdef __KERNEL__
#include <linux/seqlock.h>
-#define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16)))
-#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
-
/* Definitions for CONFIG_GENERIC_TIME definitions */
-#define __section_vsyscall_gtod_data __attribute__ \
- ((unused, __section__ (".vsyscall_gtod_data"),aligned(16)))
-#define __section_vsyscall_clock __attribute__ \
- ((unused, __section__ (".vsyscall_clock"),aligned(16)))
#define __vsyscall_fn \
__attribute__ ((unused, __section__(".vsyscall_fn"))) notrace
#define VGETCPU_RDTSCP 1
#define VGETCPU_LSL 2
-extern int __vgetcpu_mode;
-extern volatile unsigned long __jiffies;
-
/* kernel space (writeable) */
extern int vgetcpu_mode;
extern struct timezone sys_tz;
+#include <asm/vvar.h>
+
extern void map_vsyscall(void);
#endif /* __KERNEL__ */
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
new file mode 100644
index 0000000..131ce61
--- /dev/null
+++ b/arch/x86/include/asm/vvar.h
@@ -0,0 +1,52 @@
+/*
+ * vvar.h: Shared vDSO/kernel variable declarations
+ * Copyright (c) 2011 Andy Lutomirski
+ * Subject to the GNU General Public License, version 2
+ *
+ * A handful of variables are accessible (read-only) from userspace
+ * code in the vsyscall page and the vdso. They are declared here.
+ * Some other file must define them with DEFINE_VVAR.
+ *
+ * In normal kernel code, they are used like any other variable.
+ * In user code, they are accessed through the VVAR macro.
+ *
+ * Each of these variables lives in the vsyscall page, and each
+ * one needs a unique offset within the little piece of the page
+ * reserved for vvars. Specify that offset in DECLARE_VVAR.
+ * (There are 896 bytes available. If you mess up, the linker will
+ * catch it.)
+ */
+
+/* Offset of vars within vsyscall page */
+#define VSYSCALL_VARS_OFFSET (3072 + 128)
+
+#if defined(__VVAR_KERNEL_LDS)
+
+/* The kernel linker script defines its own magic to put vvars in the
+ * right place.
+ */
+#define DECLARE_VVAR(offset, type, name) \
+ EMIT_VVAR(name, VSYSCALL_VARS_OFFSET + offset)
+
+#else
+
+#define DECLARE_VVAR(offset, type, name) \
+ static type const * const vvaraddr_ ## name = \
+ (void *)(VSYSCALL_START + VSYSCALL_VARS_OFFSET + (offset));
+
+#define DEFINE_VVAR(type, name) \
+ type __vvar_ ## name \
+ __attribute__((section(".vsyscall_var_" #name), aligned(16)))
+
+#define VVAR(name) (*vvaraddr_ ## name)
+
+#endif
+
+/* DECLARE_VVAR(offset, type, name) */
+
+DECLARE_VVAR(0, volatile unsigned long, jiffies)
+DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
+DECLARE_VVAR(256, int, vgetcpu_mode)
+
+#undef DECLARE_VVAR
+#undef VSYSCALL_VARS_OFFSET
diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index 25a28a2..00cbb27 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -23,7 +23,7 @@
#include <asm/time.h>
#ifdef CONFIG_X86_64
-volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
+DEFINE_VVAR(volatile unsigned long, jiffies) = INITIAL_JIFFIES;
#endif
unsigned long profile_pc(struct pt_regs *regs)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ffe5755..bc46566 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -777,8 +777,8 @@ static cycle_t __vsyscall_fn vread_tsc(void)
ret = (cycle_t)vget_cycles();
rdtsc_barrier();
- return ret >= __vsyscall_gtod_data.clock.cycle_last ?
- ret : __vsyscall_gtod_data.clock.cycle_last;
+ return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
+ ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
}
#endif
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index bf47007..bbe6cc2 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -160,6 +160,12 @@ SECTIONS
#define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
#define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
+#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x \
+ ADDR(.vsyscall_0) + offset \
+ : AT(VLOAD(.vsyscall_var_ ## x)) { \
+ *(.vsyscall_var_ ## x) \
+ } \
+ x = VVIRT(.vsyscall_var_ ## x);
. = ALIGN(4096);
__vsyscall_0 = .;
@@ -174,18 +180,6 @@ SECTIONS
*(.vsyscall_fn)
}
- . = ALIGN(L1_CACHE_BYTES);
- .vsyscall_gtod_data : AT(VLOAD(.vsyscall_gtod_data)) {
- *(.vsyscall_gtod_data)
- }
-
- vsyscall_gtod_data = VVIRT(.vsyscall_gtod_data);
- .vsyscall_clock : AT(VLOAD(.vsyscall_clock)) {
- *(.vsyscall_clock)
- }
- vsyscall_clock = VVIRT(.vsyscall_clock);
-
-
.vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) {
*(.vsyscall_1)
}
@@ -193,21 +187,14 @@ SECTIONS
*(.vsyscall_2)
}
- .vgetcpu_mode : AT(VLOAD(.vgetcpu_mode)) {
- *(.vgetcpu_mode)
- }
- vgetcpu_mode = VVIRT(.vgetcpu_mode);
-
- . = ALIGN(L1_CACHE_BYTES);
- .jiffies : AT(VLOAD(.jiffies)) {
- *(.jiffies)
- }
- jiffies = VVIRT(.jiffies);
-
.vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) {
*(.vsyscall_3)
}
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+
. = __vsyscall_0 + PAGE_SIZE;
#undef VSYSCALL_ADDR
@@ -215,6 +202,7 @@ SECTIONS
#undef VLOAD
#undef VVIRT_OFFSET
#undef VVIRT
+#undef EMIT_VVAR
#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index dcbb28c..5f6ad03 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -49,15 +49,8 @@
__attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
#define __syscall_clobber "r11","cx","memory"
-/*
- * vsyscall_gtod_data contains data that is :
- * - readonly from vsyscalls
- * - written by timer interrupt or systcl (/proc/sys/kernel/vsyscall64)
- * Try to keep this structure as small as possible to avoid cache line ping pongs
- */
-int __vgetcpu_mode __section_vgetcpu_mode;
-
-struct vsyscall_gtod_data __vsyscall_gtod_data __section_vsyscall_gtod_data =
+DEFINE_VVAR(int, vgetcpu_mode);
+DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
{
.lock = SEQLOCK_UNLOCKED,
.sysctl_enabled = 1,
@@ -97,7 +90,7 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
*/
static __always_inline void do_get_tz(struct timezone * tz)
{
- *tz = __vsyscall_gtod_data.sys_tz;
+ *tz = VVAR(vsyscall_gtod_data).sys_tz;
}
static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
@@ -126,23 +119,24 @@ static __always_inline void do_vgettimeofday(struct timeval * tv)
unsigned long mult, shift, nsec;
cycle_t (*vread)(void);
do {
- seq = read_seqbegin(&__vsyscall_gtod_data.lock);
+ seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
- vread = __vsyscall_gtod_data.clock.vread;
- if (unlikely(!__vsyscall_gtod_data.sysctl_enabled || !vread)) {
+ vread = VVAR(vsyscall_gtod_data).clock.vread;
+ if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled ||
+ !vread)) {
gettimeofday(tv,NULL);
return;
}
now = vread();
- base = __vsyscall_gtod_data.clock.cycle_last;
- mask = __vsyscall_gtod_data.clock.mask;
- mult = __vsyscall_gtod_data.clock.mult;
- shift = __vsyscall_gtod_data.clock.shift;
+ base = VVAR(vsyscall_gtod_data).clock.cycle_last;
+ mask = VVAR(vsyscall_gtod_data).clock.mask;
+ mult = VVAR(vsyscall_gtod_data).clock.mult;
+ shift = VVAR(vsyscall_gtod_data).clock.shift;
- tv->tv_sec = __vsyscall_gtod_data.wall_time_sec;
- nsec = __vsyscall_gtod_data.wall_time_nsec;
- } while (read_seqretry(&__vsyscall_gtod_data.lock, seq));
+ tv->tv_sec = VVAR(vsyscall_gtod_data).wall_time_sec;
+ nsec = VVAR(vsyscall_gtod_data).wall_time_nsec;
+ } while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
/* calculate interval: */
cycle_delta = (now - base) & mask;
@@ -171,15 +165,15 @@ time_t __vsyscall(1) vtime(time_t *t)
{
unsigned seq;
time_t result;
- if (unlikely(!__vsyscall_gtod_data.sysctl_enabled))
+ if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
return time_syscall(t);
do {
- seq = read_seqbegin(&__vsyscall_gtod_data.lock);
+ seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
- result = __vsyscall_gtod_data.wall_time_sec;
+ result = VVAR(vsyscall_gtod_data).wall_time_sec;
- } while (read_seqretry(&__vsyscall_gtod_data.lock, seq));
+ } while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
if (t)
*t = result;
@@ -208,9 +202,9 @@ vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
We do this here because otherwise user space would do it on
its own in a likely inferior way (no access to jiffies).
If you don't like it pass NULL. */
- if (tcache && tcache->blob[0] == (j = __jiffies)) {
+ if (tcache && tcache->blob[0] == (j = VVAR(jiffies))) {
p = tcache->blob[1];
- } else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
+ } else if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
/* Load per CPU data from RDTSCP */
native_read_tscp(&p);
} else {
diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index b6552b1..a651861 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -11,7 +11,7 @@ vdso-install-$(VDSO32-y) += $(vdso32-images)
# files to link into the vdso
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vvar.o
+vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
# files to link into kernel
obj-$(VDSO64-y) += vma.o vdso.o
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index ee55754..0b873d4 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -22,9 +22,8 @@
#include <asm/hpet.h>
#include <asm/unistd.h>
#include <asm/io.h>
-#include "vextern.h"
-#define gtod vdso_vsyscall_gtod_data
+#define gtod (&VVAR(vsyscall_gtod_data))
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
{
diff --git a/arch/x86/vdso/vdso.lds.S b/arch/x86/vdso/vdso.lds.S
index 4e5dd3b..81f2500 100644
--- a/arch/x86/vdso/vdso.lds.S
+++ b/arch/x86/vdso/vdso.lds.S
@@ -28,10 +28,3 @@ VERSION {
}
VDSO64_PRELINK = VDSO_PRELINK;
-
-/*
- * Define VDSO64_x for each VEXTERN(x), for use via VDSO64_SYMBOL.
- */
-#define VEXTERN(x) VDSO64_ ## x = vdso_ ## x;
-#include "vextern.h"
-#undef VEXTERN
diff --git a/arch/x86/vdso/vextern.h b/arch/x86/vdso/vextern.h
deleted file mode 100644
index 1683ba2..0000000
--- a/arch/x86/vdso/vextern.h
+++ /dev/null
@@ -1,16 +0,0 @@
-#ifndef VEXTERN
-#include <asm/vsyscall.h>
-#define VEXTERN(x) \
- extern typeof(x) *vdso_ ## x __attribute__((visibility("hidden")));
-#endif
-
-#define VMAGIC 0xfeedbabeabcdefabUL
-
-/* Any kernel variables used in the vDSO must be exported in the main
- kernel's vmlinux.lds.S/vsyscall.h/proper __section and
- put into vextern.h and be referenced as a pointer with vdso prefix.
- The main kernel later fills in the values. */
-
-VEXTERN(jiffies)
-VEXTERN(vgetcpu_mode)
-VEXTERN(vsyscall_gtod_data)
diff --git a/arch/x86/vdso/vgetcpu.c b/arch/x86/vdso/vgetcpu.c
index 9fbc6b2..5463ad5 100644
--- a/arch/x86/vdso/vgetcpu.c
+++ b/arch/x86/vdso/vgetcpu.c
@@ -11,14 +11,13 @@
#include <linux/time.h>
#include <asm/vsyscall.h>
#include <asm/vgtod.h>
-#include "vextern.h"
notrace long
__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
{
unsigned int p;
- if (*vdso_vgetcpu_mode == VGETCPU_RDTSCP) {
+ if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
/* Load per CPU data from RDTSCP */
native_read_tscp(&p);
} else {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 4b5d26f..7abd2be 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -15,9 +15,6 @@
#include <asm/proto.h>
#include <asm/vdso.h>
-#include "vextern.h" /* Just for VMAGIC. */
-#undef VEXTERN
-
unsigned int __read_mostly vdso_enabled = 1;
extern char vdso_start[], vdso_end[];
@@ -26,20 +23,10 @@ extern unsigned short vdso_sync_cpuid;
static struct page **vdso_pages;
static unsigned vdso_size;
-static inline void *var_ref(void *p, char *name)
-{
- if (*(void **)p != (void *)VMAGIC) {
- printk("VDSO: variable %s broken\n", name);
- vdso_enabled = 0;
- }
- return p;
-}
-
static int __init init_vdso_vars(void)
{
int npages = (vdso_end - vdso_start + PAGE_SIZE - 1) / PAGE_SIZE;
int i;
- char *vbase;
vdso_size = npages << PAGE_SHIFT;
vdso_pages = kmalloc(sizeof(struct page *) * npages, GFP_KERNEL);
@@ -54,20 +41,6 @@ static int __init init_vdso_vars(void)
copy_page(page_address(p), vdso_start + i*PAGE_SIZE);
}
- vbase = vmap(vdso_pages, npages, 0, PAGE_KERNEL);
- if (!vbase)
- goto oom;
-
- if (memcmp(vbase, "\177ELF", 4)) {
- printk("VDSO: I'm broken; not ELF\n");
- vdso_enabled = 0;
- }
-
-#define VEXTERN(x) \
- *(typeof(__ ## x) **) var_ref(VDSO64_SYMBOL(vbase, x), #x) = &__ ## x;
-#include "vextern.h"
-#undef VEXTERN
- vunmap(vbase);
return 0;
oom:
diff --git a/arch/x86/vdso/vvar.c b/arch/x86/vdso/vvar.c
deleted file mode 100644
index 1b7e703..0000000
--- a/arch/x86/vdso/vvar.c
+++ /dev/null
@@ -1,12 +0,0 @@
-/* Define pointer to external vDSO variables.
- These are part of the vDSO. The kernel fills in the real addresses
- at boot time. This is done because when the vdso is linked the
- kernel isn't yet and we don't know the final addresses. */
-#include <linux/kernel.h>
-#include <linux/time.h>
-#include <asm/vsyscall.h>
-#include <asm/timex.h>
-#include <asm/vgtod.h>
-
-#define VEXTERN(x) typeof (__ ## x) *const vdso_ ## x = (void *)VMAGIC;
-#include "vextern.h"
--
1.7.4
vclock_gettime's do_monotonic helper can't ever generate a negative
nsec value, so it doesn't need to check whether it's negative. In
the CLOCK_MONOTONIC_COARSE case, ns can't ever exceed 2e9-1, so we
can avoid the loop entirely. This saves a single easily-predicted
branch.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/vclock_gettime.c | 40 ++++++++++++++++++++++------------------
1 files changed, 22 insertions(+), 18 deletions(-)
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 0b873d4..28b2c00 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -55,22 +55,6 @@ notrace static noinline int do_realtime(struct timespec *ts)
return 0;
}
-/* Copy of the version in kernel/time.c which we cannot directly access */
-notrace static void
-vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
-{
- while (nsec >= NSEC_PER_SEC) {
- nsec -= NSEC_PER_SEC;
- ++sec;
- }
- while (nsec < 0) {
- nsec += NSEC_PER_SEC;
- --sec;
- }
- ts->tv_sec = sec;
- ts->tv_nsec = nsec;
-}
-
notrace static noinline int do_monotonic(struct timespec *ts)
{
unsigned long seq, ns, secs;
@@ -81,7 +65,17 @@ notrace static noinline int do_monotonic(struct timespec *ts)
secs += gtod->wall_to_monotonic.tv_sec;
ns += gtod->wall_to_monotonic.tv_nsec;
} while (unlikely(read_seqretry(>od->lock, seq)));
- vset_normalized_timespec(ts, secs, ns);
+
+ /* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
+ * are all guaranteed to be nonnegative.
+ */
+ while (ns >= NSEC_PER_SEC) {
+ ns -= NSEC_PER_SEC;
+ ++secs;
+ }
+ ts->tv_sec = secs;
+ ts->tv_nsec = ns;
+
return 0;
}
@@ -106,7 +100,17 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
secs += gtod->wall_to_monotonic.tv_sec;
ns += gtod->wall_to_monotonic.tv_nsec;
} while (unlikely(read_seqretry(>od->lock, seq)));
- vset_normalized_timespec(ts, secs, ns);
+
+ /* wall_time_nsec and wall_to_monotonic.tv_nsec are
+ * guaranteed to be between 0 and NSEC_PER_SEC.
+ */
+ if (ns >= NSEC_PER_SEC) {
+ ns -= NSEC_PER_SEC;
+ ++secs;
+ }
+ ts->tv_sec = secs;
+ ts->tv_nsec = ns;
+
return 0;
}
--
1.7.4
RDTSC is completely unordered on modern Intel and AMD CPUs. The
Intel manual says that lfence;rdtsc causes all previous instructions
to complete before the tsc is read, and the AMD manual says to use
mfence;rdtsc to do the same thing.
We want a stronger guarantee, though: we want the tsc to be read
before any memory access that occurs after the call to
vclock_gettime (or vgettimeofday). We currently guarantee that with
a second lfence or mfence. This sequence is not really supported by
the manual (AFAICT) and it's also slow.
This patch changes the rdtsc to use implicit memory ordering instead
of the second fence. The sequence looks like this:
{l,m}fence
rdtsc
mov [something dependent on edx],[tmp]
return [some function of tmp]
This means that the time stamp has to be read before the load, and
the return value depends on tmp. All x86-64 chips guarantee that no
memory access after a load moves before that load. This means that
all memory access after vread_tsc occurs after the time stamp is
read.
The trick is that the answer should not actually change as a result
of the sneaky memory access. I accomplish this by shifting rdx left
by 32 bits, twice, to generate the number zero. (I can't imagine
that any CPU can break that dependency.) Then I use "zero" as an
offset to a memory access that we had to do anyway.
On Sandy Bridge (i7-2600), this improves a loop of
clock_gettime(CLOCK_MONOTONIC) by 5 ns/iter (from ~22.7 to ~17.7).
time-warp-test still passes.
I suspect that it's sufficient to just load something dependent on
edx without using the result, but I don't see any solid evidence in
the manual that CPUs won't eliminate useless loads. I leave scary
stuff like that to the real experts.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/tsc.c | 37 ++++++++++++++++++++++++++++++-------
1 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bc46566..858c084 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -767,18 +767,41 @@ static cycle_t read_tsc(struct clocksource *cs)
static cycle_t __vsyscall_fn vread_tsc(void)
{
cycle_t ret;
+ u64 zero, last;
/*
- * Surround the RDTSC by barriers, to make sure it's not
- * speculated to outside the seqlock critical section and
- * does not cause time warps:
+ * rdtsc is unordered, and we want it to be ordered like
+ * a load with respect to other CPUs (and we don't want
+ * it to execute absurdly early wrt code on this CPU).
+ * rdtsc_barrier() is a barrier that provides this ordering
+ * with respect to *earlier* loads. (Which barrier to use
+ * depends on the CPU.)
*/
rdtsc_barrier();
- ret = (cycle_t)vget_cycles();
- rdtsc_barrier();
- return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
- ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
+ asm volatile ("rdtsc\n\t"
+ "shl $0x20,%%rdx\n\t"
+ "or %%rdx,%%rax\n\t"
+ "shl $0x20,%%rdx"
+ : "=a" (ret), "=d" (zero) : : "cc");
+
+ /*
+ * zero == 0, but as far as the processor is concerned, zero
+ * depends on the output of rdtsc. So we can use it as a
+ * load barrier by loading something that depends on it.
+ * x86-64 keeps all loads in order wrt each other, so this
+ * ensures that rdtsc is ordered wrt all later loads.
+ */
+
+ /*
+ * This doesn't multiply 'zero' by anything, which *should*
+ * generate nicer code, except that gcc cleverly embeds the
+ * dereference into the cmp and the cmovae. Oh, well.
+ */
+ last = *( (cycle_t *)
+ ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
+
+ return ret >= last ? ret : last;
}
#endif
--
1.7.4
vread_tsc is short and hot, and it's userspace code so the usual
reasons to keep frame pointers around, enable -pg, and turn off
sibling calls don't apply.
(OK, turning off sibling calls has no effect. But it might
someday...)
As an added benefit, tsc.c is profilable now.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/tsc.h | 4 +++
arch/x86/kernel/Makefile | 8 +++--
arch/x86/kernel/tsc.c | 53 --------------------------------------
arch/x86/kernel/vread_tsc_64.c | 55 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 64 insertions(+), 56 deletions(-)
create mode 100644 arch/x86/kernel/vread_tsc_64.c
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 1ca132f..8f2b3c6 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -51,6 +51,10 @@ extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
extern unsigned long native_calibrate_tsc(void);
+#ifdef CONFIG_X86_64
+extern cycles_t vread_tsc(void);
+#endif
+
/*
* Boot-time check whether the TSCs are synchronized across
* all CPUs/cores:
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 34244b2..7626fb8 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -8,7 +8,6 @@ CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)
ifdef CONFIG_FUNCTION_TRACER
# Do not profile debug and lowlevel utilities
-CFLAGS_REMOVE_tsc.o = -pg
CFLAGS_REMOVE_rtc.o = -pg
CFLAGS_REMOVE_paravirt-spinlocks.o = -pg
CFLAGS_REMOVE_pvclock.o = -pg
@@ -24,13 +23,16 @@ endif
nostackp := $(call cc-option, -fno-stack-protector)
CFLAGS_vsyscall_64.o := $(PROFILING) -g0 $(nostackp)
CFLAGS_hpet.o := $(nostackp)
-CFLAGS_tsc.o := $(nostackp)
+CFLAGS_vread_tsc_64.o := $(nostackp)
CFLAGS_paravirt.o := $(nostackp)
GCOV_PROFILE_vsyscall_64.o := n
GCOV_PROFILE_hpet.o := n
GCOV_PROFILE_tsc.o := n
GCOV_PROFILE_paravirt.o := n
+# vread_tsc_64 is hot and should be fully optimized:
+CFLAGS_REMOVE_vread_tsc_64.o = -pg -fno-omit-frame-pointer -fno-optimize-sibling-calls
+
obj-y := process_$(BITS).o signal.o entry_$(BITS).o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o ldt.o dumpstack.o
@@ -39,7 +41,7 @@ obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_X86_32) += probe_roms_32.o
obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o
obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
-obj-$(CONFIG_X86_64) += syscall_64.o vsyscall_64.o
+obj-$(CONFIG_X86_64) += syscall_64.o vsyscall_64.o vread_tsc_64.o
obj-y += bootflag.o e820.o
obj-y += pci-dma.o quirks.o i8237.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o hw_breakpoint.o
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 69ff619..5346381 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -763,59 +763,6 @@ static cycle_t read_tsc(struct clocksource *cs)
ret : clocksource_tsc.cycle_last;
}
-#ifdef CONFIG_X86_64
-static cycle_t __vsyscall_fn vread_tsc(void)
-{
- cycle_t ret;
- u64 zero, last;
-
- /*
- * rdtsc is unordered, and we want it to be ordered like
- * a load with respect to other CPUs (and we don't want
- * it to execute absurdly early wrt code on this CPU).
- * rdtsc_barrier() is a barrier that provides this ordering
- * with respect to *earlier* loads. (Which barrier to use
- * depends on the CPU.)
- */
- rdtsc_barrier();
-
- asm volatile ("rdtsc\n\t"
- "shl $0x20,%%rdx\n\t"
- "or %%rdx,%%rax\n\t"
- "shl $0x20,%%rdx"
- : "=a" (ret), "=d" (zero) : : "cc");
-
- /*
- * zero == 0, but as far as the processor is concerned, zero
- * depends on the output of rdtsc. So we can use it as a
- * load barrier by loading something that depends on it.
- * x86-64 keeps all loads in order wrt each other, so this
- * ensures that rdtsc is ordered wrt all later loads.
- */
-
- /*
- * This doesn't multiply 'zero' by anything, which generates
- * very slightly nicer code than multiplying it by 8.
- */
- last = *( (cycle_t *)
- ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
-
- if (likely(ret >= last))
- return ret;
-
- /*
- * GCC likes to generate cmov here, but this branch is extremely
- * predictable (it's just a funciton of time and the likely is
- * very likely) and there's a data dependence, so force GCC
- * to generate a branch instead. I don't barrier() because
- * we don't actually need a barrier, and if this function
- * ever gets inlined it will generate worse code.
- */
- asm volatile ("");
- return last;
-}
-#endif
-
static void resume_tsc(struct clocksource *cs)
{
clocksource_tsc.cycle_last = 0;
diff --git a/arch/x86/kernel/vread_tsc_64.c b/arch/x86/kernel/vread_tsc_64.c
new file mode 100644
index 0000000..856330e
--- /dev/null
+++ b/arch/x86/kernel/vread_tsc_64.c
@@ -0,0 +1,55 @@
+/* This code runs in userspace. */
+
+#define DISABLE_BRANCH_PROFILING
+#include <asm/vgtod.h>
+
+notrace cycle_t __vsyscall_fn vread_tsc(void)
+{
+ cycle_t ret;
+ u64 zero, last;
+
+ /*
+ * rdtsc is unordered, and we want it to be ordered like
+ * a load with respect to other CPUs (and we don't want
+ * it to execute absurdly early wrt code on this CPU).
+ * rdtsc_barrier() is a barrier that provides this ordering
+ * with respect to *earlier* loads. (Which barrier to use
+ * depends on the CPU.)
+ */
+ rdtsc_barrier();
+
+ asm volatile ("rdtsc\n\t"
+ "shl $0x20,%%rdx\n\t"
+ "or %%rdx,%%rax\n\t"
+ "shl $0x20,%%rdx"
+ : "=a" (ret), "=d" (zero) : : "cc");
+
+ /*
+ * zero == 0, but as far as the processor is concerned, zero
+ * depends on the output of rdtsc. So we can use it as a
+ * load barrier by loading something that depends on it.
+ * x86-64 keeps all loads in order wrt each other, so this
+ * ensures that rdtsc is ordered wrt all later loads.
+ */
+
+ /*
+ * This doesn't multiply 'zero' by anything, which generates
+ * very slightly nicer code than multiplying it by 8.
+ */
+ last = *( (cycle_t *)
+ ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
+
+ if (likely(ret >= last))
+ return ret;
+
+ /*
+ * GCC likes to generate cmov here, but this branch is extremely
+ * predictable (it's just a funciton of time and the likely is
+ * very likely) and there's a data dependence, so force GCC
+ * to generate a branch instead. I don't barrier() because
+ * we don't actually need a barrier, and if this function
+ * ever gets inlined it will generate worse code.
+ */
+ asm volatile ("");
+ return last;
+}
--
1.7.4
The vDSO isn't part of the kernel, so profiling and kernel
backtraces don't really matter.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/Makefile | 15 ++++++++++++++-
1 files changed, 14 insertions(+), 1 deletions(-)
diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index a651861..bef0bc9 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -37,11 +37,24 @@ $(obj)/%.so: OBJCOPYFLAGS := -S
$(obj)/%.so: $(obj)/%.so.dbg FORCE
$(call if_changed,objcopy)
+#
+# Don't omit frame pointers for ease of userspace debugging, but do
+# optimize sibling calls.
+#
CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
- $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector)
+ $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector) \
+ -fno-omit-frame-pointer -foptimize-sibling-calls
$(vobjs): KBUILD_CFLAGS += $(CFL)
+#
+# vDSO code runs in userspace and -pg doesn't help with profiling anyway.
+#
+CFLAGS_REMOVE_vdso-note.o = -pg
+CFLAGS_REMOVE_vclock_gettime.o = -pg
+CFLAGS_REMOVE_vgetcpu.o = -pg
+CFLAGS_REMOVE_vvar.o = -pg
+
targets += vdso-syms.lds
obj-$(VDSO64-y) += vdso-syms.lds
--
1.7.4
vread_tsc checks whether rdtsc returns something less than
cycle_last, which is an extremely predictable branch. GCC likes
to generate a cmov anyway, which is several cycles slower than
a predicted branch. This saves a couple of nanoseconds.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/tsc.c | 19 +++++++++++++++----
1 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 858c084..69ff619 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -794,14 +794,25 @@ static cycle_t __vsyscall_fn vread_tsc(void)
*/
/*
- * This doesn't multiply 'zero' by anything, which *should*
- * generate nicer code, except that gcc cleverly embeds the
- * dereference into the cmp and the cmovae. Oh, well.
+ * This doesn't multiply 'zero' by anything, which generates
+ * very slightly nicer code than multiplying it by 8.
*/
last = *( (cycle_t *)
((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
- return ret >= last ? ret : last;
+ if (likely(ret >= last))
+ return ret;
+
+ /*
+ * GCC likes to generate cmov here, but this branch is extremely
+ * predictable (it's just a funciton of time and the likely is
+ * very likely) and there's a data dependence, so force GCC
+ * to generate a branch instead. I don't barrier() because
+ * we don't actually need a barrier, and if this function
+ * ever gets inlined it will generate worse code.
+ */
+ asm volatile ("");
+ return last;
}
#endif
--
1.7.4
* Andy Lutomirski <[email protected]> wrote:
> vread_tsc checks whether rdtsc returns something less than
> cycle_last, which is an extremely predictable branch. GCC likes
> to generate a cmov anyway, which is several cycles slower than
> a predicted branch. This saves a couple of nanoseconds.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 19 +++++++++++++++----
> 1 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 858c084..69ff619 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -794,14 +794,25 @@ static cycle_t __vsyscall_fn vread_tsc(void)
> */
>
> /*
> - * This doesn't multiply 'zero' by anything, which *should*
> - * generate nicer code, except that gcc cleverly embeds the
> - * dereference into the cmp and the cmovae. Oh, well.
> + * This doesn't multiply 'zero' by anything, which generates
> + * very slightly nicer code than multiplying it by 8.
> */
> last = *( (cycle_t *)
> ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
>
> - return ret >= last ? ret : last;
> + if (likely(ret >= last))
> + return ret;
> +
> + /*
> + * GCC likes to generate cmov here, but this branch is extremely
> + * predictable (it's just a funciton of time and the likely is
> + * very likely) and there's a data dependence, so force GCC
> + * to generate a branch instead. I don't barrier() because
> + * we don't actually need a barrier, and if this function
> + * ever gets inlined it will generate worse code.
> + */
> + asm volatile ("");
Hm, you have not addressed the review feedback i gave in:
Message-ID: <[email protected]>
Thanks,
Ingo
* Andy Lutomirski <[email protected]> wrote:
> +
> + /* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
> + * are all guaranteed to be nonnegative.
> + */
> + /* wall_time_nsec and wall_to_monotonic.tv_nsec are
> + * guaranteed to be between 0 and NSEC_PER_SEC.
> + */
Nor have you addressed stylistic review feedback i gave before ...
Really, we absolutely must insist on 110% attention to small details for
patches that modify assembly code.
Thanks,
Ingo
* Andy Lutomirski <[email protected]> wrote:
> The vDSO isn't part of the kernel, so profiling and kernel
> backtraces don't really matter.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/vdso/Makefile | 15 ++++++++++++++-
> 1 files changed, 14 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
> index a651861..bef0bc9 100644
> --- a/arch/x86/vdso/Makefile
> +++ b/arch/x86/vdso/Makefile
> @@ -37,11 +37,24 @@ $(obj)/%.so: OBJCOPYFLAGS := -S
> $(obj)/%.so: $(obj)/%.so.dbg FORCE
> $(call if_changed,objcopy)
>
> +#
> +# Don't omit frame pointers for ease of userspace debugging, but do
> +# optimize sibling calls.
> +#
> CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
> - $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector)
> + $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector) \
> + -fno-omit-frame-pointer -foptimize-sibling-calls
>
> $(vobjs): KBUILD_CFLAGS += $(CFL)
>
> +#
> +# vDSO code runs in userspace and -pg doesn't help with profiling anyway.
> +#
> +CFLAGS_REMOVE_vdso-note.o = -pg
> +CFLAGS_REMOVE_vclock_gettime.o = -pg
> +CFLAGS_REMOVE_vgetcpu.o = -pg
> +CFLAGS_REMOVE_vvar.o = -pg
The comment makes no sense in this form, -pg is not used for profiling but
in-kernel tracing.
Also, please do not do multiple changes in a single patch, split out the -pg,
the -fomit-frame-pointers and the -foptimize-sibling-calls change into two
patches. This will also help bisection, should some GCC version do something
silly with -foptimize-sibling-calls ...
Thanks,
Ingo
* Andy Lutomirski <[email protected]> wrote:
> Variables that are shared between the vdso and the kernel are
> currently a bit of a mess. They are each defined with their own
> magic, they are accessed differently in the kernel, the vsyscall page,
> and the vdso, and one of them (vsyscall_clock) doesn't even really
> exist.
>
> This changes them all to use a common mechanism. All of them are
> delcared in vvar.h with a fixed address (validated by the linker
> script). In the kernel (as before), they look like ordinary
> read-write variables. In the vsyscall page and the vdso, they are
> accessed through a new macro VVAR, which gives read-only access.
>
> The vdso is now loaded verbatim into memory without any fixups. As a
> side bonus, access from the vdso is faster because a level of
> indirection is removed.
Ok, that's a pretty nice consolidation and speedup.
This layout:
> +DECLARE_VVAR(0, volatile unsigned long, jiffies)
> +DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
> +DECLARE_VVAR(256, int, vgetcpu_mode)
Is spread out too much, using up several separate cachelines with nothing else
on them. Why not pack it tightly, with natural alignment taken into
consideration?
Thanks,
Ingo
(Cc:-ed more memory ordering folks.)
* Andy Lutomirski <[email protected]> wrote:
> RDTSC is completely unordered on modern Intel and AMD CPUs. The
> Intel manual says that lfence;rdtsc causes all previous instructions
> to complete before the tsc is read, and the AMD manual says to use
> mfence;rdtsc to do the same thing.
>
> We want a stronger guarantee, though: we want the tsc to be read
> before any memory access that occurs after the call to
> vclock_gettime (or vgettimeofday). We currently guarantee that with
> a second lfence or mfence. This sequence is not really supported by
> the manual (AFAICT) and it's also slow.
>
> This patch changes the rdtsc to use implicit memory ordering instead
> of the second fence. The sequence looks like this:
>
> {l,m}fence
> rdtsc
> mov [something dependent on edx],[tmp]
> return [some function of tmp]
>
> This means that the time stamp has to be read before the load, and
> the return value depends on tmp. All x86-64 chips guarantee that no
> memory access after a load moves before that load. This means that
> all memory access after vread_tsc occurs after the time stamp is
> read.
>
> The trick is that the answer should not actually change as a result
> of the sneaky memory access. I accomplish this by shifting rdx left
> by 32 bits, twice, to generate the number zero. (I can't imagine
> that any CPU can break that dependency.) Then I use "zero" as an
> offset to a memory access that we had to do anyway.
>
> On Sandy Bridge (i7-2600), this improves a loop of
> clock_gettime(CLOCK_MONOTONIC) by 5 ns/iter (from ~22.7 to ~17.7).
> time-warp-test still passes.
>
> I suspect that it's sufficient to just load something dependent on
> edx without using the result, but I don't see any solid evidence in
> the manual that CPUs won't eliminate useless loads. I leave scary
> stuff like that to the real experts.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 37 ++++++++++++++++++++++++++++++-------
> 1 files changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index bc46566..858c084 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -767,18 +767,41 @@ static cycle_t read_tsc(struct clocksource *cs)
> static cycle_t __vsyscall_fn vread_tsc(void)
> {
> cycle_t ret;
> + u64 zero, last;
>
> /*
> - * Surround the RDTSC by barriers, to make sure it's not
> - * speculated to outside the seqlock critical section and
> - * does not cause time warps:
> + * rdtsc is unordered, and we want it to be ordered like
> + * a load with respect to other CPUs (and we don't want
> + * it to execute absurdly early wrt code on this CPU).
> + * rdtsc_barrier() is a barrier that provides this ordering
> + * with respect to *earlier* loads. (Which barrier to use
> + * depends on the CPU.)
> */
> rdtsc_barrier();
> - ret = (cycle_t)vget_cycles();
> - rdtsc_barrier();
>
> - return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
> - ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
> + asm volatile ("rdtsc\n\t"
> + "shl $0x20,%%rdx\n\t"
> + "or %%rdx,%%rax\n\t"
> + "shl $0x20,%%rdx"
> + : "=a" (ret), "=d" (zero) : : "cc");
> +
> + /*
> + * zero == 0, but as far as the processor is concerned, zero
> + * depends on the output of rdtsc. So we can use it as a
> + * load barrier by loading something that depends on it.
> + * x86-64 keeps all loads in order wrt each other, so this
> + * ensures that rdtsc is ordered wrt all later loads.
> + */
> +
> + /*
> + * This doesn't multiply 'zero' by anything, which *should*
> + * generate nicer code, except that gcc cleverly embeds the
> + * dereference into the cmp and the cmovae. Oh, well.
> + */
> + last = *( (cycle_t *)
> + ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
> +
> + return ret >= last ? ret : last;
Ok, that's like totally sick ;-)
It's a software barrier in essence, using data dependency obfuscation.
First objection would be the extra memory references: we have a general
aversion against memory references. The memory reference here is arguably
special, it is to the stack so should be in cache and should be pretty fast.
But the code really looks too tricky for its own good.
For example this assumption:
> The trick is that the answer should not actually change as a result
> of the sneaky memory access. I accomplish this by shifting rdx left
> by 32 bits, twice, to generate the number zero. (I can't imagine
> that any CPU can break that dependency.) Then I use "zero" as an
is not particularly future-proof. Yes, i doubt any CPU will bother to figure
out the data dependency across such a sequence, but synthetic CPUs might and
who knows what the far future brings.
Also, do we *really* have RDTSC SMP-coherency guarantees on multi-socket CPUs
today? It now works on multi-core, but on bigger NUMA i strongly doubt it. So
this hack tries to preserve something that we wont be able to offer anyway.
So the much better optimization would be to give up on exact GTOD coherency and
just make sure the same task does not see time going backwards. If user-space
wants precise coherency it can use synchronization primitives itsef. By default
it would get the fast and possibly off by a few cycles thing instead. We'd
never be seriously jump in time - only small jumps would happen in practice,
depending on CPU parallelism effects.
If we do that then the optimization would be to RDTSC and not use *any* of the
barriers, neither the hardware ones nor your tricky software data-dependency
obfuscation barrier.
Note that doing this will also have other advantages: we wont really need
alternatives patching, thus we could more easily move this code into .S - which
would allow further optimizations, such as the elimination of this GCC
inflicted slowdown:
> + /*
> + * This doesn't multiply 'zero' by anything, which *should*
> + * generate nicer code, except that gcc cleverly embeds the
> + * dereference into the cmp and the cmovae. Oh, well.
> + */
> + last = *( (cycle_t *)
> + ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
> +
> + return ret >= last ? ret : last;
Thanks,
Ingo
On Thu, Apr 7, 2011 at 3:54 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> vread_tsc checks whether rdtsc returns something less than
>> cycle_last, which is an extremely predictable branch. ?GCC likes
>> to generate a cmov anyway, which is several cycles slower than
>> a predicted branch. ?This saves a couple of nanoseconds.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> ?arch/x86/kernel/tsc.c | ? 19 +++++++++++++++----
>> ?1 files changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>> index 858c084..69ff619 100644
>> --- a/arch/x86/kernel/tsc.c
>> +++ b/arch/x86/kernel/tsc.c
>> @@ -794,14 +794,25 @@ static cycle_t __vsyscall_fn vread_tsc(void)
>> ? ? ? ?*/
>>
>> ? ? ? /*
>> - ? ? ?* This doesn't multiply 'zero' by anything, which *should*
>> - ? ? ?* generate nicer code, except that gcc cleverly embeds the
>> - ? ? ?* dereference into the cmp and the cmovae. ?Oh, well.
>> + ? ? ?* This doesn't multiply 'zero' by anything, which generates
>> + ? ? ?* very slightly nicer code than multiplying it by 8.
>> ? ? ? ?*/
>> ? ? ? last = *( (cycle_t *)
>> ? ? ? ? ? ? ? ? ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
>>
>> - ? ? return ret >= last ? ret : last;
>> + ? ? if (likely(ret >= last))
>> + ? ? ? ? ? ? return ret;
>> +
>> + ? ? /*
>> + ? ? ?* GCC likes to generate cmov here, but this branch is extremely
>> + ? ? ?* predictable (it's just a funciton of time and the likely is
>> + ? ? ?* very likely) and there's a data dependence, so force GCC
>> + ? ? ?* to generate a branch instead. ?I don't barrier() because
>> + ? ? ?* we don't actually need a barrier, and if this function
>> + ? ? ?* ever gets inlined it will generate worse code.
>> + ? ? ?*/
>> + ? ? asm volatile ("");
>
> Hm, you have not addressed the review feedback i gave in:
>
> ?Message-ID: <[email protected]>
I can change that, but if anyone ever inlines this function (and Andi
suggested that as another future optimization), then they'd want to
undo it, because it will generate worse code. (barrier() has the
unnecessary memory clobber.)
--Andy
>
> Thanks,
>
> ? ? ? ?Ingo
>
On Thu, Apr 7, 2011 at 3:57 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> +
>> + ? ? /* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
>> + ? ? ?* are all guaranteed to be nonnegative.
>> + ? ? ?*/
>
>> + ? ? /* wall_time_nsec and wall_to_monotonic.tv_nsec are
>> + ? ? ?* guaranteed to be between 0 and NSEC_PER_SEC.
>> + ? ? ?*/
>
> Nor have you addressed stylistic review feedback i gave before ...
>
> Really, we absolutely must insist on 110% attention to small details for
> patches that modify assembly code.
Crud, I fixed most of those. I'll get these in v3. (I assume you're
talking about comment style.)
--Andy
>
> Thanks,
>
> ? ? ? ?Ingo
>
On Thu, Apr 7, 2011 at 4:25 AM, Ingo Molnar <[email protected]> wrote:
>
> (Cc:-ed more memory ordering folks.)
>
>> --- a/arch/x86/kernel/tsc.c
>> +++ b/arch/x86/kernel/tsc.c
>> @@ -767,18 +767,41 @@ static cycle_t read_tsc(struct clocksource *cs)
>> ?static cycle_t __vsyscall_fn vread_tsc(void)
>> ?{
>> ? ? ? cycle_t ret;
>> + ? ? u64 zero, last;
>>
>> ? ? ? /*
>> - ? ? ?* Surround the RDTSC by barriers, to make sure it's not
>> - ? ? ?* speculated to outside the seqlock critical section and
>> - ? ? ?* does not cause time warps:
>> + ? ? ?* rdtsc is unordered, and we want it to be ordered like
>> + ? ? ?* a load with respect to other CPUs (and we don't want
>> + ? ? ?* it to execute absurdly early wrt code on this CPU).
>> + ? ? ?* rdtsc_barrier() is a barrier that provides this ordering
>> + ? ? ?* with respect to *earlier* loads. ?(Which barrier to use
>> + ? ? ?* depends on the CPU.)
>> ? ? ? ?*/
>> ? ? ? rdtsc_barrier();
>> - ? ? ret = (cycle_t)vget_cycles();
>> - ? ? rdtsc_barrier();
>>
>> - ? ? return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
>> - ? ? ? ? ? ? ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
>> + ? ? asm volatile ("rdtsc\n\t"
>> + ? ? ? ? ? ? ? ? ? "shl $0x20,%%rdx\n\t"
>> + ? ? ? ? ? ? ? ? ? "or %%rdx,%%rax\n\t"
>> + ? ? ? ? ? ? ? ? ? "shl $0x20,%%rdx"
>> + ? ? ? ? ? ? ? ? ? : "=a" (ret), "=d" (zero) : : "cc");
>> +
>> + ? ? /*
>> + ? ? ?* zero == 0, but as far as the processor is concerned, zero
>> + ? ? ?* depends on the output of rdtsc. ?So we can use it as a
>> + ? ? ?* load barrier by loading something that depends on it.
>> + ? ? ?* x86-64 keeps all loads in order wrt each other, so this
>> + ? ? ?* ensures that rdtsc is ordered wrt all later loads.
>> + ? ? ?*/
>> +
>> + ? ? /*
>> + ? ? ?* This doesn't multiply 'zero' by anything, which *should*
>> + ? ? ?* generate nicer code, except that gcc cleverly embeds the
>> + ? ? ?* dereference into the cmp and the cmovae. ?Oh, well.
>> + ? ? ?*/
>> + ? ? last = *( (cycle_t *)
>> + ? ? ? ? ? ? ? ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
>> +
>> + ? ? return ret >= last ? ret : last;
>
> Ok, that's like totally sick ;-)
>
> It's a software barrier in essence, using data dependency obfuscation.
>
> First objection would be the extra memory references: we have a general
> aversion against memory references. The memory reference here is arguably
> special, it is to the stack so should be in cache and should be pretty fast.
>
> But the code really looks too tricky for its own good.
>
> For example this assumption:
>
>> The trick is that the answer should not actually change as a result
>> of the sneaky memory access. ?I accomplish this by shifting rdx left
>> by 32 bits, twice, to generate the number zero. ?(I can't imagine
>> that any CPU can break that dependency.) ?Then I use "zero" as an
>
> is not particularly future-proof. Yes, i doubt any CPU will bother to figure
> out the data dependency across such a sequence, but synthetic CPUs might and
> who knows what the far future brings.
That's fixable with a bit more work. Imagine (whitespace damanged):
asm volatile ("rdtsc\n\t"
"shl $0x20,%%rdx\n\t"
"or %%rdx,%%rax\n\t"
"shr $0x3f,%%rdx"
: "=a" (ret), "=d" (zero_or_one) : : "cc");
last = VVAR(vsyscall_gtod_data).clock.cycle_last[zero_or_one];
For this to work, cycle_last would have to be an array containing two
identical values, and we'd want to be a little careful to keep the
whole mess in one cacheline. Now I think it's really safe because
zero_or_one isn't constant at all, so the CPU has to wait for its
value no matter how clever it is.
>
> Also, do we *really* have RDTSC SMP-coherency guarantees on multi-socket CPUs
> today? It now works on multi-core, but on bigger NUMA i strongly doubt it. So
> this hack tries to preserve something that we wont be able to offer anyway.
>
> So the much better optimization would be to give up on exact GTOD coherency and
> just make sure the same task does not see time going backwards. If user-space
> wants precise coherency it can use synchronization primitives itsef. By default
> it would get the fast and possibly off by a few cycles thing instead. We'd
> never be seriously jump in time - only small jumps would happen in practice,
> depending on CPU parallelism effects.
>
> If we do that then the optimization would be to RDTSC and not use *any* of the
> barriers, neither the hardware ones nor your tricky software data-dependency
> obfuscation barrier.
IIRC I measured 200ns time warps on Sandy Bridge when I tried that.
(I think you won't see them that easily with time-warp-test in TSC
mode, because there's a locking instruction before the rdtsc and very
close to it.) It would save about 4ns more, I think, which isn't bad.
Personally, I'm working on this code because I'm migrating a bunch of
code that likes to timestamp itself from Windows to Linux, and one of
the really big attractions to Linux is that it has clock_gettime,
which is fast, pretty much warp-free, and actually shows *wall* time
with high precision. The closest thing that Windows has is
QueryPerformanceCounter, which is a giant PITA because it doesn't
track wall time (although it's still slightly faster than
clock_gettime even with this patch). If I have to re-add
software-enforced clock coherency to the Linux version, I'll be sad.
--Andy
>
> Note that doing this will also have other advantages: we wont really need
> alternatives patching, thus we could more easily move this code into .S - which
> would allow further optimizations, such as the elimination of this GCC
> inflicted slowdown:
>
>> + ? ? /*
>> + ? ? ?* This doesn't multiply 'zero' by anything, which *should*
>> + ? ? ?* generate nicer code, except that gcc cleverly embeds the
>> + ? ? ?* dereference into the cmp and the cmovae. ?Oh, well.
>> + ? ? ?*/
>> + ? ? last = *( (cycle_t *)
>> + ? ? ? ? ? ? ? ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
>> +
>> + ? ? return ret >= last ? ret : last;
>
> Thanks,
>
> ? ? ? ?Ingo
>
> Also, do we *really* have RDTSC SMP-coherency guarantees on multi-socket CPUs
> today? It now works on multi-core, but on bigger NUMA i strongly doubt it. So
> this hack tries to preserve something that we wont be able to offer anyway.
Some larger NUMA systems have explicit TSC consistency in hardware; on those that don't
we disable TSC as a clocksource so this path should be never taken.
> So the much better optimization would be to give up on exact GTOD coherency and
> just make sure the same task does not see time going backwards. If user-space
> wants precise coherency it can use synchronization primitives itsef. By default
> it would get the fast and possibly off by a few cycles thing instead. We'd
> never be seriously jump in time - only small jumps would happen in practice,
> depending on CPU parallelism effects.
That would be a big user visible break in compatibility.
Any small jump can lead to a negative time difference, and negative time differences
are known to break applications.
e.g. typical case is app using this as a event time stamp into a buffer written
from multiple CPUs, and then assuming that the time stamp always goes up.
> If we do that then the optimization would be to RDTSC and not use *any* of the
> barriers, neither the hardware ones nor your tricky software data-dependency
> obfuscation barrier.
The barriers were originally added because a stress test was able to observe
time going backwards without them.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 6, 2011 at 7:03 PM, Andy Lutomirski <[email protected]> wrote:
>
> We want a stronger guarantee, though: we want the tsc to be read
> before any memory access that occurs after the call to
> vclock_gettime (or vgettimeofday). ?We currently guarantee that with
> a second lfence or mfence. ?This sequence is not really supported by
> the manual (AFAICT) and it's also slow.
>
> This patch changes the rdtsc to use implicit memory ordering instead
> of the second fence.
Ok, I have to admit to a horrid fascination with this patch. The
"saves 5ns on Sandybridge" sounds like a quite noticeable win, and I
always hated the stupid barriers (and argued strongly for avoiding
them when we don't care enough - which is why the rdtsc_barrier() is a
separate thing, rather than being inside rdtsc.
And your approach to serialize things is *so* disgusting that it's funny.
That said, I think your commentary is wrong, and I think the
serialization for that reason is at least slightly debatable.
> ?All x86-64 chips guarantee that no
> memory access after a load moves before that load.
This is not true.
The x86-64 memory ordering semantics are that YOU CANNOT TELL that a
memory access happened before the load, but that's not the same thing
at all as saying that you can have no memory accesses. It just means
that the memory re-ordering logic will have to keep the cachelines
around until loads have completed in-order, and will have to retry if
it turns out that a cacheline was evicted in the wrong order (and thus
the values might have been changed by another CPU or DMA in ways that
make the out-of-order loads visible).
So that's problem #1 with your explanation.
Now, the _reason_ I don't think this is much of a problem is that I
seriously don't think we care. The whole "no moving memory ops around"
thing isn't about some correctness thing, it's really about trying to
keep the rdtsc "sufficiently bounded". And I do think that making the
rdtsc result impact the address calculation for the last_cycle thing
is perfectly fine in that respect.
So I don't think that the "all x86-64 chips guarantee.." argument is
correct at all, but I do think that in practice it's a good approach.
I fundamentally think that "barriers" are crap, and have always
thought that the x86 model of implicit memory barriers and "smart
re-ordering" is much much superior to the crazy model of memory
barriers. I think the same is true of lfence and rdtsc.
Which may be one reason why I like your patch - I think getting rid of
an lfence is simply a fundamental improvement.
> The trick is that the answer should not actually change as a result
> of the sneaky memory access. ?I accomplish this by shifting rdx left
> by 32 bits, twice, to generate the number zero. ?(I can't imagine
> that any CPU can break that dependency.) ?Then I use "zero" as an
> offset to a memory access that we had to do anyway.
So this is my other nit-pick: I can _easily_ imagine a CPU breaking
that dependency. In fact, I can pretty much guarantee that the CPU
design I was myself involved with at Transmeta would have broken it if
it had just been 64-bit (because it would first notice that eflags are
dead, then combine the two shifts into one, and then make the
shift-by-64 be a zero).
Now, I agree that it's less likely with a traditional CPU core, but
there are actually other ways to break the dependency too, namely by
doing things like address or data speculation on the load, and simply
do the load much earlier - and then just verify the address/data
afterwards.
Likely? No. But the point I'm trying to make is that there are at
least two quite reasonable ways to avoid the dependency.
Which again I actually think is probably perfectly ok. It's ok exactly because
(a) the dependency avoidance is pretty unlikely in any forseeable future
(b) we don't even _care_ too much even if it's avoided (ie the
annoying lfence wasn't that important to begin with - the time warps
of the clock aren't disastrous enough to be a major issue, and I bet
this approach - even if not totally water-tight - is still limiting
enough for scheduling that it cuts down the warps a lot)
So again, I mainly object to the _description_, not the approach itself.
The "lfence" was never some kind of absolute barrier. We really don't
care about any one particular instruction, we care about the rdtsc
just not moving around _too_ much. So I'm certainly ok with using
other approaches to limit CPU instruction scheduling.
I dunno. I would suggest:
- marking "cycle_last" as being "volatile"
- not doing the second shift by 32, but instead by 29, leaving the
three high bits "random".
- doing "last = (&__vsyscall_gtod_data.clock.cycle_last)[edx]"
instead, which should generate something like
movq __vsyscall_gtod_data+40(,%rdx,8),%rax
which does the last three bits, and makes it slightly harder for the
CPU to notice that there are no actual bits of real data left (again -
"slightly harder" is not at all "impossible" - if the CPU internally
does the address generation as a shift+add, it could notice that all
the shifts add up to 64).
But before that, I would check that the second lfence is needed AT
ALL. I don't know whether we ever tried the "only barrier before"
version, because quite frankly, if you call this thing in a loop, then
a single barrier will still mean that there are barriers in between
iterations. So...
There may well be a reason why Intel says just "lfence ; rdtsc" -
maybe the rdtsc is never really moved down anyway, and it's only the
"move up" barrier that anybody ever really needs.
Hmm?
So before trying to be clever, let's try to be simple. Ok?
Linus
> But before that, I would check that the second lfence is needed AT
> ALL. I don't know whether we ever tried the "only barrier before"
> version, because quite frankly, if you call this thing in a loop, then
> a single barrier will still mean that there are barriers in between
> iterations. So...
The testers will call it in a loop, but real users (e.g. using it for
a logging timestamp) may not.
I'm sure a single barrier would have fixed the testers, as you point out,
but the goal wasn't to only fix testers.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Apr 7, 2011 at 9:42 AM, Andi Kleen <[email protected]> wrote:
>
> I'm sure a single barrier would have fixed the testers, as you point out,
> but the goal wasn't to only fix testers.
You missed two points:
- first off, do we have any reason to believe that the rdtsc would
migrate down _anyway_? As AndyL says, both Intel and AMD seem to
document only the "[lm]fence + rdtsc" thing with a single fence
instruction before.
Instruction scheduling isn't some kind of theoretical game. It's a
very practical issue, and CPU schedulers are constrained to do a good
job quickly and _effectively_. In other words, instructions don't just
schedule randomly. In the presense of the barrier, is there any reason
to believe that the rdtsc would really schedule oddly? There is never
any reason to _delay_ an rdtsc (it can take no cache misses or wait on
any other resources), so when it is not able to move up, where would
it move?
IOW, it's all about "practical vs theoretical". Sure, in theory the
rdtsc could move down arbitrarily. In _practice_, the caller is going
to use the result (and if it doesn't, the value doesn't matter), and
thus the CPU will have data dependencies etc that constrain
scheduling. But more practically, there's no reason to delay
scheduling, because an rdtsc isn't going to be waiting for any
interesting resources, and it's also not going to be holding up any
more important resources (iow, sure, you'd like to schedule subsequent
loads early, but those won't be fighting over the same resource with
the rdtsc anyway, so I don't see any reason that would delay the rdtsc
and move it down).
So I suspect that one lfence (before) is basically the _same_ as the
current two lfences (around). Now, I can't guarantee it, but in the
absense of numbers to the contrary, there really isn't much reason to
believe otherwise. Especially considering the Intel/AMD
_documentation_. So we should at least try it, I think.
- the reason "back-to-back" (with the extreme example being in a
tight loop) matters is that if something isn't in a tight loop, any
jitter we see in the time counting wouldn't be visible anyway. One
random timestamp is meaningless on its own. It's only when you have
multiple ones that you can compare them. No?
So _before_ we try some really clever data dependency trick with new
inline asm and magic "double shifts to create a zero" things, I really
would suggest just trying to remove the second lfence entirely and see
how that works. Maybe it doesn't work, but ...
Linus
* Andi Kleen <[email protected]> wrote:
> > So the much better optimization would be to give up on exact GTOD coherency
> > and just make sure the same task does not see time going backwards. If
> > user-space wants precise coherency it can use synchronization primitives
> > itsef. By default it would get the fast and possibly off by a few cycles
> > thing instead. We'd never be seriously jump in time - only small jumps
> > would happen in practice, depending on CPU parallelism effects.
>
> That would be a big user visible break in compatibility.
Those are big scary words, but you really need to think this through:
Firstly, what is the user expectation? That a GTOD timestamp is provided. What
will the user application do with that timestamp? Store it, for later use.
So there's implicit ordering all around these timestamps.
Secondly, x86 hardware never did a good job keeping our GTOD timestamps
coherent, so while there *is* expectation for certainl behavior (Andy's for
example), it's not widespread at all.
I bet that Linus's single-side barrier approach will be good enough in practice
to meet Andy's needs. We might not be able to remove both barriers, but the
tricks look really fragile ...
Andy, mind trying out Linus's suggestion? It should bring us more of a speedup
and it would keep this code even simpler.
Thanks,
Ingo
> Instruction scheduling isn't some kind of theoretical game. It's a
> very practical issue, and CPU schedulers are constrained to do a good
> job quickly and _effectively_. In other words, instructions don't just
> schedule randomly. In the presense of the barrier, is there any reason
> to believe that the rdtsc would really schedule oddly? There is never
> any reason to _delay_ an rdtsc (it can take no cache misses or wait on
> any other resources), so when it is not able to move up, where would
> it move?
CPUs are complex beasts and I'm sure there are scheduling
constraints neither you nor me ever heard of :-)
There are always odd corner cases, e.g. if you have a correctable
error somewhere internally it may add a stall on some unit but not
on others, which may delay an arbitary uop.
Also there can be reordering against reading xtime and friends.
> - the reason "back-to-back" (with the extreme example being in a
> tight loop) matters is that if something isn't in a tight loop, any
> jitter we see in the time counting wouldn't be visible anyway. One
> random timestamp is meaningless on its own. It's only when you have
> multiple ones that you can compare them. No?
There's also the multiple CPUs logging to a shared buffer case.
I thought Vojtech's original test case was something like that in fact.
> So _before_ we try some really clever data dependency trick with new
> inline asm and magic "double shifts to create a zero" things, I really
> would suggest just trying to remove the second lfence entirely and see
> how that works. Maybe it doesn't work, but ...
I would prefer to be safe than sorry. Also there are still other things
to optimize anyways (I suggested a few in my earlier mail) which
are 100% safe unlike this. Maybe those would be enough to offset
the cost of the "paranoid lfence"
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Apr 7, 2011 at 11:15 AM, Andi Kleen <[email protected]> wrote:
>
> I would prefer to be safe than sorry.
There's a difference between "safe" and "making up theoretical
arguments for the sake of an argument".
If Intel _documented_ the "barriers on each side", I think you'd have a point.
As it is, we're not doing the "safe" thing, we're doing the "extra
crap that costs us and nobody has ever shown is actually worth it".
Linus
On Thu, Apr 7, 2011 at 2:30 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 7, 2011 at 11:15 AM, Andi Kleen <[email protected]> wrote:
>>
>> I would prefer to be safe than sorry.
>
> There's a difference between "safe" and "making up theoretical
> arguments for the sake of an argument".
>
> If Intel _documented_ the "barriers on each side", I think you'd have a point.
>
> As it is, we're not doing the "safe" thing, we're doing the "extra
> crap that costs us and nobody has ever shown is actually worth it".
Speaking as both a userspace programmer who wants to use clock_gettime
and as the sucker who has to test this thing, I'd like to agree on
what clock_gettime is *supposed* to do. I propose:
For the purposes of ordering, clock_gettime acts as though there is a
volatile variable that contains the time and is kept up-to-date by
some thread. clock_gettime reads that variable. This means that
clock_gettime is not a barrier but is ordered at least as strongly* as
a read to a volatile variable. If code that calls clock_gettime needs
stronger ordering, it should add additional barriers as appropriate.
* Modulo errata, BIOS bugs, implementation bugs, etc.
This means, for example, that if you do:
volatile int a = 0;
int b;
struct timespec c, d;
Thread 1:
a = 1;
clock_gettime(CLOCK_MONOTONIC, &c);
Thread 2:
clock_gettime(CLOCK_MONOTONIC, &d)
b = a;
you would expect in a fully serialized world that if b == 0 then d <=
c. (That is, if b == 0 then thread 2 finished before thread 1
started.) I think that this does not deserve to work, although it
will by accident on AMD systems. (But unless Intel's lfence is a lot
stronger than advertised, it will probably fail dramatically on Intel,
both in current and proposed code.)
If you agree with my proposal, I'll try to test it with and without
the magic extra barrier and I'll even write it up for Documentation
and maybe man-pages.
--Andy
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Linus
>
* On Thu, Apr 07, 2011 at 10:20:31AM -0700, Linus Torvalds <[email protected]> wrote:
>On Thu, Apr 7, 2011 at 9:42 AM, Andi Kleen <[email protected]> wrote:
>> I'm sure a single barrier would have fixed the testers, as you point out,
>> but the goal wasn't to only fix testers.
>
>You missed two points:
>
> - first off, do we have any reason to believe that the rdtsc would
>migrate down _anyway_? As AndyL says, both Intel and AMD seem to
>document only the "[lm]fence + rdtsc" thing with a single fence
>instruction before.
>
> Instruction scheduling isn't some kind of theoretical game. It's a
>very practical issue, and CPU schedulers are constrained to do a good
>job quickly and _effectively_. In other words, instructions don't just
>schedule randomly. In the presense of the barrier, is there any reason
>to believe that the rdtsc would really schedule oddly? There is never
>any reason to _delay_ an rdtsc (it can take no cache misses or wait on
>any other resources), so when it is not able to move up, where would
>it move?
>
>IOW, it's all about "practical vs theoretical". Sure, in theory the
>rdtsc could move down arbitrarily. In _practice_, the caller is going
>to use the result (and if it doesn't, the value doesn't matter), and
>thus the CPU will have data dependencies etc that constrain
>scheduling. But more practically, there's no reason to delay
>scheduling, because an rdtsc isn't going to be waiting for any
>interesting resources, and it's also not going to be holding up any
>more important resources (iow, sure, you'd like to schedule subsequent
>loads early, but those won't be fighting over the same resource with
>the rdtsc anyway, so I don't see any reason that would delay the rdtsc
>and move it down).
>
>So I suspect that one lfence (before) is basically the _same_ as the
>current two lfences (around). Now, I can't guarantee it, but in the
>absense of numbers to the contrary, there really isn't much reason to
>believe otherwise. Especially considering the Intel/AMD
>_documentation_. So we should at least try it, I think.
If only one lfence or serializing instruction is to be used, can't we
just use RDTSCP instruction (X86_FEATURE_RDTSCP,available only in >=
i7's and AMD) which both provides TSC as well as an upward serializing
guarantee. I see that instruction being used elsewhere in the kernel to
obtain the current cpu/node (vgetcpu), is it not possible to use it in
this case as well ?
>
> - the reason "back-to-back" (with the extreme example being in a
>tight loop) matters is that if something isn't in a tight loop, any
>jitter we see in the time counting wouldn't be visible anyway. One
>random timestamp is meaningless on its own. It's only when you have
>multiple ones that you can compare them. No?
I was looking at this documentation -
http://download.intel.com/embedded/software/IA/324264.pdf (How to
Benchmark Code Execution Times on Intel IA-32 and IA-64) where they try
to precisely benchmark code execution times, and later switch to using
RDTSCP twice to obtain both upward as well as downward guarantees of the
barrier. Now, based on context (loop or not), will a second serializing
instruction be needed or can that too be avoided ?
>
>So _before_ we try some really clever data dependency trick with new
>inline asm and magic "double shifts to create a zero" things, I really
>would suggest just trying to remove the second lfence entirely and see
>how that works. Maybe it doesn't work, but ...
>
> Linus
> If only one lfence or serializing instruction is to be used, can't we
> just use RDTSCP instruction (X86_FEATURE_RDTSCP,available only in >=
> i7's and AMD) which both provides TSC as well as an upward serializing
RDTSCP is a full synchronization, much heavier and slower than LFENCE
(or even two of them) Besides it doesn't exist on older CPUs.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Apr 7, 2011 at 5:26 PM, Andrew Lutomirski <[email protected]> wrote:
> On Thu, Apr 7, 2011 at 2:30 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Apr 7, 2011 at 11:15 AM, Andi Kleen <[email protected]> wrote:
>>>
>>> I would prefer to be safe than sorry.
>>
>> There's a difference between "safe" and "making up theoretical
>> arguments for the sake of an argument".
>>
>> If Intel _documented_ the "barriers on each side", I think you'd have a point.
>>
>> As it is, we're not doing the "safe" thing, we're doing the "extra
>> crap that costs us and nobody has ever shown is actually worth it".
>
> Speaking as both a userspace programmer who wants to use clock_gettime
> and as the sucker who has to test this thing, I'd like to agree on
> what clock_gettime is *supposed* to do. ?I propose:
>
> For the purposes of ordering, clock_gettime acts as though there is a
> volatile variable that contains the time and is kept up-to-date by
> some thread. ?clock_gettime reads that variable. ?This means that
> clock_gettime is not a barrier but is ordered at least as strongly* as
> a read to a volatile variable. ?If code that calls clock_gettime needs
> stronger ordering, it should add additional barriers as appropriate.
>
> * Modulo errata, BIOS bugs, implementation bugs, etc.
As far as I can tell, on Sandy Bridge and Bloomfield, I can't get the
sequence lfence;rdtsc to violate the rule above. That the case even
if I stick random arithmetic and branches right before the lfence. If
I remove the lfence, though, it starts to fail. (This is without the
evil fake barrier.)
However, as expected, I can see stores getting reordered after
lfence;rdtsc and rdtscp but not mfence;rdtsc.
So... do you think that the rule is sensible?
I'll post the test case somewhere when it's a little less ugly. I'd
like to see test results on AMD.
--Andy
* Andrew Lutomirski <[email protected]> wrote:
> > * Modulo errata, BIOS bugs, implementation bugs, etc.
>
> As far as I can tell, on Sandy Bridge and Bloomfield, I can't get the
> sequence lfence;rdtsc to violate the rule above. That the case even if I
> stick random arithmetic and branches right before the lfence. If I remove
> the lfence, though, it starts to fail. (This is without the evil fake
> barrier.)
It's not really evil, just too tricky and hence very vulnerable to entropy ;-)
> However, as expected, I can see stores getting reordered after lfence;rdtsc
> and rdtscp but not mfence;rdtsc.
Is this lfence;rdtsc variant enough for your real usecase as well?
Basically, we are free to define whatever sensible semantics we find reasonable
and fast - we are pretty free due to the fact that the whole TSC picture was
such a mess for a decade or so, so apps did not make assumptions (because we
could not make guarantees).
> So... do you think that the rule is sensible?
The barrier properties of this system call are flexible in the same sense so
your proposal is sensible to me. I'd go for the weakest barrier that still
works fine, that is the one that is the fastest and it also gives us the most
options for the future.
> I'll post the test case somewhere when it's a little less ugly. I'd like to
> see test results on AMD.
That would be nice - we could test it on various Intel and AMD CPUs.
Thanks,
Ingo