Date: Sat, 25 Apr 2015 23:12:06 +0200
From: Borislav Petkov <bp@alien8.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
        Andy Lutomirski <luto@amacapital.net>,
        Denys Vlasenko <vda.linux@googlemail.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Brian Gerst <brgerst@gmail.com>, Denys Vlasenko <dvlasenk@redhat.com>,
        Ingo Molnar <mingo@kernel.org>, Steven Rostedt <rostedt@goodmis.org>,
        Oleg Nesterov <oleg@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Alexei Starovoitov <ast@plumgrid.com>, Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor
 attribute issue
Message-ID: <20150425211206.GE32099@pd.tnic>
References: <5d120f358612d73fc909f5bfa47e7bd082db0af0.1429841474.git.luto@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <5d120f358612d73fc909f5bfa47e7bd082db0af0.1429841474.git.luto@kernel.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8318
Lines: 193

On Thu, Apr 23, 2015 at 07:15:01PM -0700, Andy Lutomirski wrote:
> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
> with SS == 0 results in an invalid usermode state in which SS is
> apparently equal to __USER_DS but causes #SS if used.
> 
> Work around the issue by replacing NULL SS values with __KERNEL_DS
> in __switch_to, thus ensuring that SYSRET never happens with SS set
> to NULL.
> 
> This was exposed by a recent vDSO cleanup.
> 
> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> Tested only on Intel, which isn't very interesting.  I'll tidy up
> and send a test case, too, once Borislav confirms that it works.

So I did some benchmarking today. Custom kernel build measured with perf
stat, 10 builds with --pre doing

$ cat pre-build-kernel.sh
make -s clean
echo 3 > /proc/sys/vm/drop_caches

$ cat measure.sh
EVENTS="cpu-clock,task-clock,cycles,instructions,branches,branch-misses,context-switches,migrations"
perf stat -e $EVENTS --sync -a --repeat 10 --pre ~/kernel/pre-build-kernel.sh make -s -j64

I've prepended the perf stat output with markers A:, B: or C: for easier
comparing. The markers mean:

A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
B: With Andy's SYSRET patch ontop
C: Without RCX canonicalness check (see patch at the end).

Numbers are from an AMD F16h box:

A:    2835570.145246      cpu-clock (msec)                                              ( +-  0.02% ) [100.00%]
B:    2833364.074970      cpu-clock (msec)                                              ( +-  0.04% ) [100.00%]
C:    2834708.335431      cpu-clock (msec)                                              ( +-  0.02% ) [100.00%]

This is interesting - The SYSRET SS fix makes it minimally better and
the C-patch is a bit worse again. Net win is 861 msec, almost a second,
oh well.

A:    2835570.099981      task-clock (msec)         #    3.996 CPUs utilized            ( +-  0.02% ) [100.00%]
B:    2833364.073633      task-clock (msec)         #    3.996 CPUs utilized            ( +-  0.04% ) [100.00%]
C:    2834708.350387      task-clock (msec)         #    3.996 CPUs utilized            ( +-  0.02% ) [100.00%]

Similar thing observable here.

A: 5,591,213,166,613      cycles                    #    1.972 GHz                      ( +-  0.03% ) [75.00%]
B: 5,585,023,802,888      cycles                    #    1.971 GHz                      ( +-  0.03% ) [75.00%]
C: 5,587,983,212,758      cycles                    #    1.971 GHz                      ( +-  0.02% ) [75.00%]

net win is 3,229,953,855 cycles drop.

A: 3,106,707,101,530      instructions              #    0.56  insns per cycle          ( +-  0.01% ) [75.00%]
B: 3,106,632,251,528      instructions              #    0.56  insns per cycle          ( +-  0.00% ) [75.00%]
C: 3,106,265,958,142      instructions              #    0.56  insns per cycle          ( +-  0.00% ) [75.00%]

This looks like it would make sense - instruction count drops from A -> B -> C.

A:   683,676,044,429      branches                  #  241.107 M/sec                    ( +-  0.01% ) [75.00%]
B:   683,670,899,595      branches                  #  241.293 M/sec                    ( +-  0.01% ) [75.00%]
C:   683,675,772,858      branches                  #  241.180 M/sec                    ( +-  0.01% ) [75.00%]

Also makes sense - the C patch adds an unconditional JMP over the
RCX-canonicalness check.

A:    43,829,535,008      branch-misses             #    6.41% of all branches          ( +-  0.02% ) [75.00%]
B:    43,844,118,416      branch-misses             #    6.41% of all branches          ( +-  0.03% ) [75.00%]
C:    43,819,871,086      branch-misses             #    6.41% of all branches          ( +-  0.02% ) [75.00%]

And this is nice, branch misses are the smallest with C, cool. It makes
sense again - the C patch adds an unconditional JMP which doesn't miss.

A:         2,030,357      context-switches          #    0.716 K/sec                    ( +-  0.06% ) [100.00%]
B:         2,029,313      context-switches          #    0.716 K/sec                    ( +-  0.05% ) [100.00%]
C:         2,028,566      context-switches          #    0.716 K/sec                    ( +-  0.06% ) [100.00%]

Those look good.

A:            52,421      migrations                #    0.018 K/sec                    ( +-  1.13% )
B:            52,049      migrations                #    0.018 K/sec                    ( +-  1.02% )
C:            51,365      migrations                #    0.018 K/sec                    ( +-  0.92% )

Same here.

A:     709.528485252 seconds time elapsed                                          ( +-  0.02% )
B:     708.976557288 seconds time elapsed                                          ( +-  0.04% )
C:     709.312844791 seconds time elapsed                                          ( +-  0.02% )

Interestingly, the unconditional JMP kinda costs... Btw, I'm not sure if
kernel build is the optimal workload for benchmarking here but I don't
see why not - it does a lot of syscalls so it should exercise the SYSRET
path sufficiently.

Anyway, we can do this below. Or not, I'm sitting on the fence about
that one.

---
From: Borislav Petkov <bp@suse.de>
Date: Sat, 25 Apr 2015 19:30:33 +0200
Subject: [PATCH] x86/entry: Avoid canonical RCX check on AMD

It is not needed on AMD as RCX canonicalness is not checked during
SYSRET there.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/kernel/cpu/intel.c       |  2 ++
 arch/x86/kernel/entry_64.S        | 13 +++++++++----
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ee9b94d9921..8d555b046fe9 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -265,6 +265,7 @@
 #define X86_BUG_11AP		X86_BUG(5) /* Bad local APIC aka 11AP */
 #define X86_BUG_FXSAVE_LEAK	X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
 #define X86_BUG_CLFLUSH_MONITOR	X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
+#define X86_BUG_CANONICAL_RCX	X86_BUG(8) /* SYSRET #GPs when %RCX non-canonical */
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 50163fa9034f..109a51815e92 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -159,6 +159,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		pr_info("Disabling PGE capability bit\n");
 		setup_clear_cpu_cap(X86_FEATURE_PGE);
 	}
+
+	set_cpu_bug(c, X86_BUG_CANONICAL_RCX);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e952f6bf1d6d..d01fb6c1362f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -415,16 +415,20 @@ syscall_return:
 	jne opportunistic_sysret_failed
 
 	/*
-	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
-	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.
-	 *
 	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 */
 	.ifne __VIRTUAL_MASK_SHIFT - 47
 	.error "virtual address width changed -- SYSRET checks need update"
 	.endif
+
+	/*
+	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.
+	 */
+	ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX
+
 	/* Change top 16 bits to be the sign-extension of 47th bit */
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
@@ -432,6 +436,7 @@ syscall_return:
 	cmpq	%rcx, %r11
 	jne	opportunistic_sysret_failed
 
+1:
 	cmpq $__USER_CS,CS(%rsp)	/* CS must match SYSRET */
 	jne opportunistic_sysret_failed
 
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/