From: Andy Lutomirski <luto@MIT.EDU>
To: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
        Andi Kleen <andi@firstfloor.org>, linux-kernel@vger.kernel.org,
        Andy Lutomirski <luto@MIT.EDU>
Subject: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers
Date: Wed,  6 Apr 2011 22:03:59 -0400
Message-Id: <80b43d57d15f7b141799a7634274ee3bfe5a5855.1302137785.git.luto@mit.edu>
In-Reply-To: <cover.1302137785.git.luto@mit.edu>
References: <cover.1302137785.git.luto@mit.edu>
In-Reply-To: <cover.1302137785.git.luto@mit.edu>
References: <cover.1302137785.git.luto@mit.edu>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3935
Lines: 106

RDTSC is completely unordered on modern Intel and AMD CPUs.  The
Intel manual says that lfence;rdtsc causes all previous instructions
to complete before the tsc is read, and the AMD manual says to use
mfence;rdtsc to do the same thing.

We want a stronger guarantee, though: we want the tsc to be read
before any memory access that occurs after the call to
vclock_gettime (or vgettimeofday).  We currently guarantee that with
a second lfence or mfence.  This sequence is not really supported by
the manual (AFAICT) and it's also slow.

This patch changes the rdtsc to use implicit memory ordering instead
of the second fence.  The sequence looks like this:

{l,m}fence
rdtsc
mov [something dependent on edx],[tmp]
return [some function of tmp]

This means that the time stamp has to be read before the load, and
the return value depends on tmp.  All x86-64 chips guarantee that no
memory access after a load moves before that load.  This means that
all memory access after vread_tsc occurs after the time stamp is
read.

The trick is that the answer should not actually change as a result
of the sneaky memory access.  I accomplish this by shifting rdx left
by 32 bits, twice, to generate the number zero.  (I can't imagine
that any CPU can break that dependency.)  Then I use "zero" as an
offset to a memory access that we had to do anyway.

On Sandy Bridge (i7-2600), this improves a loop of
clock_gettime(CLOCK_MONOTONIC) by 5 ns/iter (from ~22.7 to ~17.7).
time-warp-test still passes.

I suspect that it's sufficient to just load something dependent on
edx without using the result, but I don't see any solid evidence in
the manual that CPUs won't eliminate useless loads.  I leave scary
stuff like that to the real experts.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/tsc.c |   37 ++++++++++++++++++++++++++++++-------
 1 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bc46566..858c084 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -767,18 +767,41 @@ static cycle_t read_tsc(struct clocksource *cs)
 static cycle_t __vsyscall_fn vread_tsc(void)
 {
 	cycle_t ret;
+	u64 zero, last;
 
 	/*
-	 * Surround the RDTSC by barriers, to make sure it's not
-	 * speculated to outside the seqlock critical section and
-	 * does not cause time warps:
+	 * rdtsc is unordered, and we want it to be ordered like
+	 * a load with respect to other CPUs (and we don't want
+	 * it to execute absurdly early wrt code on this CPU).
+	 * rdtsc_barrier() is a barrier that provides this ordering
+	 * with respect to *earlier* loads.  (Which barrier to use
+	 * depends on the CPU.)
 	 */
 	rdtsc_barrier();
-	ret = (cycle_t)vget_cycles();
-	rdtsc_barrier();
 
-	return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
-		ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
+	asm volatile ("rdtsc\n\t"
+		      "shl $0x20,%%rdx\n\t"
+		      "or %%rdx,%%rax\n\t"
+		      "shl $0x20,%%rdx"
+		      : "=a" (ret), "=d" (zero) : : "cc");
+
+	/*
+	 * zero == 0, but as far as the processor is concerned, zero
+	 * depends on the output of rdtsc.  So we can use it as a
+	 * load barrier by loading something that depends on it.
+	 * x86-64 keeps all loads in order wrt each other, so this
+	 * ensures that rdtsc is ordered wrt all later loads.
+	 */
+
+	/*
+	 * This doesn't multiply 'zero' by anything, which *should*
+	 * generate nicer code, except that gcc cleverly embeds the
+	 * dereference into the cmp and the cmovae.  Oh, well.
+	 */
+	last = *( (cycle_t *)
+		  ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
+
+	return ret >= last ? ret : last;
 }
 #endif
 
-- 
1.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/