This is a reworked version of my older fsgsbase patchkit.
Main changes:
- Ported to new entry/* code, which simplified it somewhat
- Now has a test program
- Fixed ptrace/core dump support
- Better documentation
- Some minor fixes improvement
This adds kernel support for some Intel instructions that
allow fast access to the FS and GS 64bit base. They need
some changes to entry_64.S because they allow user
to fully control the GS base.
Advantages:
- NMIs (and other "paranoid" interrupts) avoid
one RDMSR which makes them faster
- User space can use these instructions, mainly
for efficient context switching with user thread libraries
- Context switches do not need to use MSR writes
anymore to context switch FS/GS base >4GB. This
will speed up applications that have enough thread
local data that it won't fit below 4GB.
- User space can use GS efficiently as an additional
global pointer register
I also included one minor (unrelated) optimization to
disable an unneeded old SWAPGS workaround.
-Andi
From: Andi Kleen <[email protected]>
The kernel needs to explicitely enable RD/WRFSBASE to handle context
switch correctly. So the application needs to know if it can safely use
these instruction. Just looking at the CPUID bit is not enough because it
may be running in a kernel that does not enable the instructions.
One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want
to overwrite the signal handlers of the main application.
So we need to provide a way for the application to discover the kernel
capability.
I used AT_HWCAP2 in the ELF aux vector which is already used by
PPC for similar things. We define a new Linux defined bitmap
returned in AT_HWCAP. Currently it has only one bit set,
for kernel is FSGSBASE capable.
The application can then access it manually or using
the getauxval() function in newer glibc.
v2: Rename things.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/elf.h | 7 +++++++
arch/x86/include/uapi/asm/hwcap.h | 7 +++++++
arch/x86/kernel/cpu/common.c | 7 ++++++-
3 files changed, 20 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/uapi/asm/hwcap.h
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 15340e3..0df9c95 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -258,6 +258,13 @@ extern int force_personality32;
#define ELF_HWCAP (boot_cpu_data.x86_capability[CPUID_1_EDX])
+extern unsigned elf_hwcap2;
+
+/* HWCAP2 supplies kernel enabled CPU feature, so that the application
+ can know that it can safely use them. The bits are defined in
+ uapi/asm/hwcap.h. */
+#define ELF_HWCAP2 elf_hwcap2
+
/* This yields a string that ld.so will use to load implementation
specific libraries for optimization. This is more specific in
intent than poking at uname or /proc/cpuinfo.
diff --git a/arch/x86/include/uapi/asm/hwcap.h b/arch/x86/include/uapi/asm/hwcap.h
new file mode 100644
index 0000000..d9c54f8
--- /dev/null
+++ b/arch/x86/include/uapi/asm/hwcap.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_HWCAP_H
+#define _ASM_HWCAP_H 1
+
+#define HWCAP2_FSGSBASE (1 << 0) /* Kernel enabled RD/WR FS/GS BASE */
+/* upto bit 31 free */
+
+#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f581cd1..b022f31 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -35,6 +35,7 @@
#include <asm/desc.h>
#include <asm/fpu/internal.h>
#include <asm/mtrr.h>
+#include <asm/hwcap.h>
#include <linux/numa.h>
#include <asm/asm.h>
#include <asm/cpu.h>
@@ -50,6 +51,8 @@
#include "cpu.h"
+unsigned elf_hwcap2 __read_mostly;
+
/* all of these masks are initialized in setup_cpu_local_masks() */
cpumask_var_t cpu_initialized_mask;
cpumask_var_t cpu_callout_mask;
@@ -1019,8 +1022,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
/* The boot/hotplug time assigment got cleared, restore it */
c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id);
- if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
+ elf_hwcap2 |= HWCAP2_FSGSBASE;
cr4_set_bits(X86_CR4_FSGSBASE);
+ }
}
/*
--
2.5.5
From: Andi Kleen <[email protected]>
Every gs selector/index reload always paid an extra MFENCE
between the two SWAPGS. This was to work around an old
bug in early K8 steppings. All other CPUs don't need the extra
mfence. Patch the extra MFENCE only in for K8.
v2: Use set_cpu_bug()
v3: Use ALTERNATIVE directly
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/entry/entry_64.S | 2 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/amd.c | 3 +++
3 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c605710..252bce4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -786,7 +786,7 @@ ENTRY(native_load_gs_index)
SWAPGS
gs_change:
movl %edi, %gs
-2: mfence /* workaround */
+2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_MFENCE
SWAPGS
popfq
ret
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 074b760..f3b3ff8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -288,6 +288,7 @@
#define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
#define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
+#define X86_BUG_SWAPGS_MFENCE X86_BUG(9) /* SWAPGS may need MFENCE */
#ifdef CONFIG_X86_32
/*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 97c59fd..9cd932b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -589,6 +589,9 @@ static void init_amd_k8(struct cpuinfo_x86 *c)
if ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+ /* Early steppings needed a mfence on swapgs. */
+ set_cpu_bug(c, X86_BUG_SWAPGS_MFENCE);
+
/*
* Some BIOSes incorrectly force this feature, but only K8 revision D
* (model = 0x14) and later actually support it.
--
2.5.5
From: Andi Kleen <[email protected]>
Add a simple tester. By default it runs 10000 iterations,
but can also run forever with tfsgs_64 0
Signed-off-by: Andi Kleen <[email protected]>
---
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/tfsgs.c | 151 +++++++++++++++++++++++++++++++++++
2 files changed, 153 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/tfsgs.c
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index d5ce7d7..e4a3ef9 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -9,11 +9,12 @@ TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_sysc
TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
+TARGETS_C_64BIT_ONLY := tfsgs
TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
TARGETS_C_64BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_64BIT_ONLY)
BINARIES_32 := $(TARGETS_C_32BIT_ALL:%=%_32)
-BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64)
+BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64) $(TARGETS_C_64BIT_ONLY:%=%_64)
CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
diff --git a/tools/testing/selftests/x86/tfsgs.c b/tools/testing/selftests/x86/tfsgs.c
new file mode 100644
index 0000000..15bb472
--- /dev/null
+++ b/tools/testing/selftests/x86/tfsgs.c
@@ -0,0 +1,151 @@
+/* Test kernel RD/WR FS/GS BASE support
+ * Run tfsgs 0 to run forever, otherwise iterations (default 10000)
+ * For stress testing run many in parallel to test context switching too
+ *
+ * This program destroys TLS, which means most of normal glibc
+ * doesn't work. So it uses its own libc replacement.
+ *
+ * It also breaks some versions of gdb
+ * (workaround available in https://sourceware.org/bugzilla/show_bug.cgi?id=19684)
+ */
+#include <stdlib.h>
+#include <assert.h>
+#include <asm/prctl.h>
+#include <asm/unistd.h>
+#include <cpuid.h>
+#include <sys/auxv.h>
+#include <elf.h>
+
+#ifndef __always_inline
+#define __always_inline inline __attribute__((always_inline))
+#endif
+
+static __always_inline unsigned long rdgsbase(void)
+{
+ unsigned long gs;
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
+ : "=a" (gs)
+ :: "memory");
+ return gs;
+}
+
+static __always_inline unsigned long rdfsbase(void)
+{
+ unsigned long fs;
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
+ : "=a" (fs)
+ :: "memory");
+ return fs;
+}
+
+static __always_inline void wrgsbase(unsigned long gs)
+{
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
+ :: "a" (gs)
+ : "memory");
+}
+
+static __always_inline void wrfsbase(unsigned long fs)
+{
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
+ :: "a" (fs)
+ : "memory");
+}
+
+/* Custom assert because we can't access errno with changed fs */
+
+int my_strlen(char *s)
+{
+ int len = 0;
+ while (*s++)
+ len++;
+ return len;
+}
+
+int arch_prctl(int cmd, unsigned long arg)
+{
+ int ret;
+ asm volatile("syscall" : "=a" (ret)
+ : "0" (__NR_arch_prctl), "D" (cmd), "S" (arg)
+ : "memory", "rcx", "r11");
+ return ret;
+}
+
+__attribute__((noinline)) void my_assert(int flag, char *msg)
+{
+ if (!flag) {
+ int ret;
+ asm volatile("syscall"
+ : "=a" (ret)
+ : "0" (__NR_write),
+ "D" (2), "S" (msg),
+ "d" (my_strlen(msg))
+ : "memory", "rcx", "r11");
+ *(int *)0 = 0;
+ }
+}
+
+long iter = 10000;
+
+#ifndef bit_FSGSBASE
+#define bit_FSGSBASE 1
+#endif
+
+/* Will be eventually in asm/hwcap.h */
+#define HWCAP2_FSGSBASE (1 << 0)
+
+unsigned long nfs, ngs, x;
+
+int main(int ac, char **av)
+{
+ long i;
+ unsigned a, b, c, d;
+
+ if (__get_cpuid_max(0, NULL) < 7)
+ exit(0);
+ __cpuid_count(7, 0, a, b, c, d);
+ if (!(b & bit_FSGSBASE))
+ exit(0);
+
+ /* Kernel support? */
+ if (!(getauxval(AT_HWCAP2) & HWCAP2_FSGSBASE))
+ exit(0);
+
+ if (av[1])
+ iter = strtoul(av[1], NULL, 0);
+
+ srandom(1);
+ unsigned long count = random();
+ unsigned long orig_fs = rdfsbase();
+ for (i = 0; i < iter || iter == 0; i++) {
+ unsigned long x = count++;
+ x = ((long)(x << 16)) >> 16; /* sign extend 48->64 */
+
+ wrgsbase(x);
+ wrfsbase(x);
+
+ int i;
+ for (i = 0; i < 1000; i++)
+ asm volatile("pause" ::: "memory");
+
+ ngs = rdgsbase();
+ nfs = rdfsbase();
+
+ my_assert(ngs == x, "gs check 1 failed\n");
+ my_assert(nfs == x, "fs check 1 failed\n");
+
+ unsigned long n;
+ const unsigned long MASK = 0x7fffffffffff;
+ arch_prctl(ARCH_SET_FS, (x + 1) & MASK);
+ arch_prctl(ARCH_SET_GS, (x - 1) & MASK);
+ n = rdfsbase();
+ my_assert(n == ((x + 1) & MASK), "fs check 2 failed\n");
+
+ for (i = 0; i < 1000; i++)
+ asm volatile("pause" ::: "memory");
+
+ n = rdgsbase();
+ my_assert(n == ((x - 1) & MASK), "gs check 2 failed\n");
+ }
+ wrfsbase(orig_fs);
+}
--
2.5.5
From: Andi Kleen <[email protected]>
v2: Minor updates to documentation requested in review.
v3: Update for new gcc and various improvements.
Signed-off-by: Andi Kleen <[email protected]>
---
Documentation/x86/fsgs.txt | 109 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 109 insertions(+)
create mode 100644 Documentation/x86/fsgs.txt
diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
new file mode 100644
index 0000000..1d4a51f
--- /dev/null
+++ b/Documentation/x86/fsgs.txt
@@ -0,0 +1,109 @@
+
+Using FS and GS prefixes on 64bit x86 linux
+
+The x86 architecture supports segment prefixes per instruction to add an
+offset to an address. On 64bit x86, these are mostly nops, except for FS
+and GS.
+
+This offers an efficient way to reference a global pointer.
+
+The compiler has to generate special code to use these base registers,
+or they can be accessed with inline assembler.
+
+ mov %gs:offset,%reg
+ mov %fs:offset,%reg
+
+On 64bit code, FS is used to address the thread local segment (TLS), declared using
+__thread. The compiler then automatically generates the correct prefixes and
+relocations to access these values.
+
+FS is normally managed by the runtime code or the threading library
+Overwriting it can break a lot of things (including syscalls and gdb),
+but it can make sense to save/restore it for threading purposes.
+
+GS is freely available, but may need special (compiler or inline assembler)
+code to use.
+
+Traditionally 64bit FS and GS could be set by the arch_prctl system call
+
+ arch_prctl(ARCH_SET_GS, value)
+ arch_prctl(ARCH_SET_FS, value)
+
+[There was also an older method using modify_ldt(), inherited from 32bit,
+but this is not discussed here.]
+
+However using a syscall is problematic for user space threading libraries
+that want to context switch in user space. The whole point of them
+is avoiding the overhead of a syscall. It's also cleaner for compilers
+wanting to use the extra register to use instructions to write
+it, or read it directly to compute addresses and offsets.
+
+Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
+access these registers quickly from user context
+
+ RDFSBASE %reg read the FS base (or _readfsbase_u64)
+ RDGSBASE %reg read the GS base (or _readgsbase_u64)
+
+ WRFSBASE %reg write the FS base (or _writefsbase_u64)
+ WRGSBASE %reg write the GS base (or _writegsbase_u64)
+
+If you use the intrinsics include <immintrin.h> and set the -mfsgsbase option.
+
+The instructions are supported by the CPU when the "fsgsbase" string is shown in
+/proc/cpuinfo (or directly retrieved through the CPUID instruction,
+7:0 (ebx), word 9, bit 0)
+
+The instructions are only available to 64bit binaries.
+
+In addition the kernel needs to explicitly enable these instructions, as it
+may otherwise not correctly context switch the state. Newer Linux
+kernels enable this. When the kernel did not enable the instruction
+they will fault with an #UD exception.
+
+An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
+bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
+kernel supports RDFSGSBASE.
+
+ #include <sys/auxv.h>
+ #include <elf.h>
+
+ /* Will be eventually in asm/hwcap.h */
+ #define HWCAP2_FSGSBASE (1 << 0)
+
+ unsigned val = getauxval(AT_HWCAP2);
+ if (val & HWCAP2_FSGSBASE) {
+ asm("wrgsbase %0" :: "r" (ptr));
+ }
+
+No extra CPUID check needed as the kernel will not set this bit if the CPU
+does not support it.
+
+Another requirement is that the FS or GS selector has to be zero
+(is normally true unless changed explicitly). When it non-zero
+the context switch assumes the bases were loaded through the LDT/GDT,
+and will reload that.
+
+gcc 6 will have special support to directly access data relative
+to fs/gs using the __seg_fs and __seg_gs address space pointer
+modifiers.
+
+#ifndef __SEG_GS
+#error "Need gcc 6 or later"
+#endif
+
+struct gsdata {
+ int a;
+ int b;
+} gsdata = { 1, 2 };
+
+int __seg_gs *valp = 0; /* offset relative to GS */
+
+ /* Check if kernel supports FSGSBASE as above */
+
+ /* Set up new GS */
+ asm("wrgsbase %0" :: "r" (&gsdata));
+
+ /* Now the global pointer can be used normally */
+ printf("gsdata.a = %d\n", valp->a);
+
+Andi Kleen
--
2.5.5
From: Andi Kleen <[email protected]>
Convert arch_prctl to use the new instructions to
change fs/gs if available, instead of using MSRs.
This is merely a small performance optimization,
no new functionality.
With the new instructions the syscall is really obsolete,
as everything can be set directly in ring 3. But the syscall
is widely used by existing software, so we still support it.
The syscall still enforces that the addresses are not
in kernel space, even though that is not needed more.
This is mainly so that the programs written for new CPUs
do not suddenly fail on old CPUs.
v2: Make kprobes safe
v3: Rename things.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/process_64.c | 48 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 40 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 53fa839..5f40517 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -530,20 +530,38 @@ void set_personality_ia32(bool x32)
}
EXPORT_SYMBOL_GPL(set_personality_ia32);
+static noinline __kprobes void reload_user_gs(unsigned long addr)
+{
+ local_irq_disable();
+ swapgs();
+ loadsegment(gs, 0);
+ wrgsbase(addr);
+ swapgs();
+ local_irq_enable();
+}
+
long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
{
int ret = 0;
int doit = task == current;
int cpu;
+ int fast_seg = boot_cpu_has(X86_FEATURE_FSGSBASE);
switch (code) {
case ARCH_SET_GS:
+ /*
+ * With fast_seg we don't need that check anymore,
+ * but keep it so that programs do not suddenly
+ * start failing when run on older CPUs.
+ * If you really want to set a address in kernel space
+ * use WRGSBASE directly.
+ */
if (addr >= TASK_SIZE_OF(task))
return -EPERM;
cpu = get_cpu();
/* handle small bases via the GDT because that's faster to
switch. */
- if (addr <= 0xffffffff) {
+ if (addr <= 0xffffffff && !fast_seg) {
set_32bit_tls(task, GS_TLS, addr);
if (doit) {
load_TLS(&task->thread, cpu);
@@ -555,8 +573,12 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
task->thread.gsindex = 0;
task->thread.gs = addr;
if (doit) {
- load_gs_index(0);
- ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
+ if (fast_seg) {
+ reload_user_gs(addr);
+ } else {
+ load_gs_index(0);
+ ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
+ }
}
}
put_cpu();
@@ -569,7 +591,7 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
cpu = get_cpu();
/* handle small bases via the GDT because that's faster to
switch. */
- if (addr <= 0xffffffff) {
+ if (addr <= 0xffffffff && !fast_seg) {
set_32bit_tls(task, FS_TLS, addr);
if (doit) {
load_TLS(&task->thread, cpu);
@@ -584,7 +606,10 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
/* set the selector to 0 to not confuse
__switch_to */
loadsegment(fs, 0);
- ret = wrmsrl_safe(MSR_FS_BASE, addr);
+ if (fast_seg)
+ wrfsbase(addr);
+ else
+ ret = wrmsrl_safe(MSR_FS_BASE, addr);
}
}
put_cpu();
@@ -593,6 +618,8 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
unsigned long base;
if (task->thread.fsindex == FS_TLS_SEL)
base = read_32bit_tls(task, FS_TLS);
+ else if (doit && fast_seg)
+ base = rdfsbase();
else if (doit)
rdmsrl(MSR_FS_BASE, base);
else
@@ -607,9 +634,14 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
base = read_32bit_tls(task, GS_TLS);
else if (doit) {
savesegment(gs, gsindex);
- if (gsindex)
- rdmsrl(MSR_KERNEL_GS_BASE, base);
- else
+ if (gsindex) {
+ if (fast_seg) {
+ local_irq_disable();
+ base = read_user_gsbase();
+ local_irq_enable();
+ } else
+ rdmsrl(MSR_KERNEL_GS_BASE, base);
+ } else
base = task->thread.gs;
} else
base = task->thread.gs;
--
2.5.5
From: Andi Kleen <[email protected]>
Introduction:
IvyBridge added four new instructions to directly write the fs and gs
64bit base registers. Previously this had to be done with a system
call to write to MSRs. The main use case is fast user space threading
and switching the fs/gs registers quickly there. Another use
case is having (relatively) cheap access to a new address
register per thread.
The instructions are opt-in and have to be explicitely enabled
by the OS.
For more details on how to use the instructions see
Documentation/x86/fsgs.txt added in a followon patch.
Paranoid exception path changes:
===============================
The paranoid entry/exit code is used for any NMI like
exception.
Previously Linux couldn't support the new instructions
because the paranoid entry code relied on the gs base never being
negative outside the kernel to decide when to use swaps. It would
check the gs MSR value and assume it was already running in
kernel if negative.
To get rid of this assumption we have to revamp the paranoid exception
path to not rely on this. We can use the new instructions
to get (relatively) quick access to the GS value, and use
it directly to save/restore the GSBASE instead of using
SWAPGS.
This is also significantly faster than a MSR read, so will speed
NMIs (useful for profiling)
The kernel gs for the paranoid path is now stored at the
bottom of the IST stack (so that it can be derived from RSP).
The original patch compared the gs with the kernel gs and
assumed that if it was identical, swapgs was not needed
(and no user space processing was needed). This
was nice and simple and didn't need a lot of changes.
But this had the side effect that if a user process set its
GS to the same as the kernel it may lose rescheduling
checks (so a racing reschedule IPI would have been
only acted upon the next non paranoid interrupt)
This version now switches to full save/restore of the GS.
When swapgs used to be needed, but we have the new
instructions, we restore original GS value in the exit
path.
Context switch changes:
======================
Then after these changes we need to also use the new instructions
to save/restore fs and gs, so that the new values set by the
users won't disappear. This is also significantly
faster for the case when the 64bit base has to be switched
(that is when GS is larger than 4GB), as we can replace
the slow MSR write with a faster wr[fg]sbase execution.
This is in term enables fast switching when there are
enough threads that their TLS segment does not fit below 4GB
(or with some newer systems which don't properly hint the
stack limit), or alternatively programs that use fs as an additional base
register will not get a sigificant context switch penalty.
It is all done in a single patch because there was no
simple way to do it in pieces without having crash
holes inbetween.
v2: Change to save/restore GS instead of using swapgs
based on the value. Large scale changes.
v3: Fix wrong flag initialization in fallback path.
Thanks 0day!
v4: Make swapgs code paths kprobes safe.
Port to new base line code which now switches indexes.
v5: Port to new kernel which avoids paranoid entry for ring 3.
Removed some code that handled this previously.
v6: Remove obsolete code. Use macro for ALTERNATIVE. Use
ALTERNATIVE for exit path, eliminating the DO_RESTORE_G15 flag.
Various cleanups. Improve description.
v7: Port to new entry code. Some fixes/cleanups.
v8: Lots of changes.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/entry/entry_64.S | 31 +++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 9 ++++++++
arch/x86/kernel/process_64.c | 51 ++++++++++++++++++++++++++++++++++++++------
3 files changed, 85 insertions(+), 6 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 858b555..c605710 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -35,6 +35,8 @@
#include <asm/asm.h>
#include <asm/smap.h>
#include <asm/pgtable_types.h>
+#include <asm/alternative-asm.h>
+#include <asm/fsgs.h>
#include <linux/err.h>
/* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
@@ -678,6 +680,7 @@ ENTRY(\sym)
jnz 1f
.endif
call paranoid_entry
+ /* r15: previous gs if FSGSBASE, otherwise %ebx: swapgs flag */
.else
call error_entry
.endif
@@ -933,6 +936,7 @@ ENTRY(paranoid_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
+ ALTERNATIVE "", "jmp paranoid_save_gs", X86_FEATURE_FSGSBASE
movl $1, %ebx
movl $MSR_GS_BASE, %ecx
rdmsr
@@ -943,6 +947,25 @@ ENTRY(paranoid_entry)
1: ret
END(paranoid_entry)
+ /*
+ * Faster version not using RDMSR, and also not assuming
+ * anything about the previous GS value.
+ * This allows the user to arbitarily change GS using
+ * WRGSBASE. The kernel GS is at the bottom of the
+ * IST stack.
+ *
+ * We don't use the %ebx flag in this case, gs is always
+ * conditionally saved/restored in R15
+ */
+ENTRY(paranoid_save_gs)
+ RDGSBASE_R15 # read previous gs
+ movq $~(EXCEPTION_STKSZ-1), %rax # get ist stack mask
+ andq %rsp,%rax # get bottom of stack
+ movq (%rax),%rdi # get expected GS
+ WRGSBASE_RDI # set gs for kernel
+ ret
+END(paranoid_save_gs)
+
/*
* "Paranoid" exit path from exception stack. This is invoked
* only on return from non-NMI IST interrupts that came
@@ -958,11 +981,14 @@ END(paranoid_entry)
ENTRY(paranoid_exit)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF_DEBUG
+ ALTERNATIVE "", "jmp paranoid_gsrestore", X86_FEATURE_FSGSBASE
testl %ebx, %ebx /* swapgs needed? */
jnz paranoid_exit_no_swapgs
TRACE_IRQS_IRETQ
SWAPGS_UNSAFE_STACK
jmp paranoid_exit_restore
+paranoid_gsrestore:
+ WRGSBASE_R15
paranoid_exit_no_swapgs:
TRACE_IRQS_IRETQ_DEBUG
paranoid_exit_restore:
@@ -1380,16 +1406,21 @@ end_repeat_nmi:
* exceptions might do.
*/
call paranoid_entry
+ /* r15: previous gs if FSGSBASE, otherwise %ebx swapgs flag */
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi
+ ALTERNATIVE "", "jmp nmi_gsrestore", X86_FEATURE_FSGSBASE
testl %ebx, %ebx /* swapgs needed? */
jnz nmi_restore
nmi_swapgs:
SWAPGS_UNSAFE_STACK
+ jmp nmi_restore
+nmi_gsrestore:
+ WRGSBASE_R15
nmi_restore:
RESTORE_EXTRA_REGS
RESTORE_C_REGS
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 249461f..f581cd1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1018,6 +1018,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
#endif
/* The boot/hotplug time assigment got cleared, restore it */
c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id);
+
+ if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ cr4_set_bits(X86_CR4_FSGSBASE);
}
/*
@@ -1422,8 +1425,14 @@ void cpu_init(void)
*/
if (!oist->ist[0]) {
char *estacks = per_cpu(exception_stacks, cpu);
+ void *gs = per_cpu(irq_stack_union.gs_base, cpu);
for (v = 0; v < N_EXCEPTION_STACKS; v++) {
+ /* Store GS at bottom of stack for bootstrap access */
+ *(void **)estacks = gs;
+ /* Put it on every 4K entry */
+ if (exception_stack_sizes[v] > EXCEPTION_STKSZ)
+ *(void **)(estacks + EXCEPTION_STKSZ) = gs;
estacks += exception_stack_sizes[v];
oist->ist[v] = t->x86_tss.ist[v] =
(unsigned long)estacks;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b9d99e0..53fa839 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -48,6 +48,7 @@
#include <asm/syscalls.h>
#include <asm/debugreg.h>
#include <asm/switch_to.h>
+#include <asm/fsgs.h>
asmlinkage extern void ret_from_fork(void);
@@ -260,6 +261,27 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp)
}
#endif
+/* Out of line to be protected from kprobes. */
+
+/* Interrupts are disabled here. */
+static noinline __kprobes void switch_gs_base(unsigned long gs)
+{
+ swapgs();
+ wrgsbase(gs);
+ swapgs();
+}
+
+/* Interrupts are disabled here. */
+static noinline __kprobes unsigned long read_user_gsbase(void)
+{
+ unsigned long gs;
+
+ swapgs();
+ gs = rdgsbase();
+ swapgs();
+ return gs;
+}
+
/*
* switch_to(x,y) should switch tasks from x to y.
*
@@ -291,6 +313,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
*/
savesegment(fs, fsindex);
savesegment(gs, gsindex);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ prev->fs = rdfsbase();
+ prev->gs = read_user_gsbase();
+ }
/*
* Load TLS before restoring any segments so that segment loads
@@ -330,6 +356,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
loadsegment(ds, next->ds);
/*
+ * Description of code path without FSGSBASE:
+ *
* Switch FS and GS.
*
* These are even more complicated than DS and ES: they have
@@ -361,8 +389,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
* base address.
*
* Note: This all depends on arch_prctl being the only way that
- * user code can override the segment base. Once wrfsbase and
- * wrgsbase are enabled, most of this code will need to change.
+ * user code can override the segment base.
+ *
+ * Description with FSGSBASE:
+ * We simply save/restore the bases, and the indexes.
+ *
*/
if (unlikely(fsindex | next->fsindex | prev->fs)) {
loadsegment(fs, next->fsindex);
@@ -379,8 +410,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (fsindex)
prev->fs = 0;
}
- if (next->fs)
- wrmsrl(MSR_FS_BASE, next->fs);
+ if (next->fs) {
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ wrfsbase(next->fs);
+ else
+ wrmsrl(MSR_FS_BASE, next->fs);
+ }
prev->fsindex = fsindex;
if (unlikely(gsindex | next->gsindex | prev->gs)) {
@@ -390,8 +425,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (gsindex)
prev->gs = 0;
}
- if (next->gs)
- wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+ if (next->gs) {
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ switch_gs_base(next->gs);
+ else
+ wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+ }
prev->gsindex = gsindex;
switch_fpu_finish(next_fpu, fpu_switch);
--
2.5.5
From: Andi Kleen <[email protected]>
Add C intrinsics and assembler macros for the new rd/wr fs/gs base
instructions and for swapgs.
Very straight forward. Used in followon patch.
For assembler only a few standard registers used by entry_64.S
are defined.
v2: Use __always_inline
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/fsgs.h | 54 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
create mode 100644 arch/x86/include/asm/fsgs.h
diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
new file mode 100644
index 0000000..8a9f900
--- /dev/null
+++ b/arch/x86/include/asm/fsgs.h
@@ -0,0 +1,54 @@
+#ifndef _ASM_FSGS_H
+#define _ASM_FSGS_H 1
+
+#ifndef __ASSEMBLY__
+
+static __always_inline void swapgs(void)
+{
+ asm volatile("swapgs" ::: "memory");
+}
+
+/* Must be protected by X86_FEATURE_FSGSBASE check. */
+
+static __always_inline unsigned long rdgsbase(void)
+{
+ unsigned long gs;
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
+ : "=a" (gs)
+ :: "memory");
+ return gs;
+}
+
+static __always_inline unsigned long rdfsbase(void)
+{
+ unsigned long fs;
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
+ : "=a" (fs)
+ :: "memory");
+ return fs;
+}
+
+static __always_inline void wrgsbase(unsigned long gs)
+{
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
+ :: "a" (gs)
+ : "memory");
+}
+
+static __always_inline void wrfsbase(unsigned long fs)
+{
+ asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
+ :: "a" (fs)
+ : "memory");
+}
+
+#else
+
+/* Handle old assemblers. */
+#define RDGSBASE_R15 .byte 0xf3,0x49,0x0f,0xae,0xcf
+#define WRGSBASE_RDI .byte 0xf3,0x48,0x0f,0xae,0xdf
+#define WRGSBASE_R15 .byte 0xf3,0x49,0x0f,0xae,0xdf
+
+#endif /* __ASSEMBLY__ */
+
+#endif
--
2.5.5
From: Andi Kleen <[email protected]>
The ptrace code for fs/gs base made some assumptions on
the state of fs/gs which are not true anymore on kernels
running with FSGSBASE.
With the new instructions it is very easy to access
the values, and they are always stored in the thread
struct. So just implement the straight forward code
to access it directly.
Note the direct access code path is only used for core dumps,
as with real ptrace the process is always blocked
and the state can be read from memory.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/proto.h | 1 +
arch/x86/kernel/process_64.c | 15 +++++++++++++--
arch/x86/kernel/ptrace.c | 15 ++++++++++++++-
3 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 9b9b30b..9f235e0 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -31,5 +31,6 @@ void x86_report_nx(void);
extern int reboot_force;
long do_arch_prctl(struct task_struct *task, int code, unsigned long addr);
+unsigned long read_user_gsbase(void);
#endif /* _ASM_X86_PROTO_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 5f40517..d7674d9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -272,7 +272,7 @@ static noinline __kprobes void switch_gs_base(unsigned long gs)
}
/* Interrupts are disabled here. */
-static noinline __kprobes unsigned long read_user_gsbase(void)
+static noinline __kprobes unsigned long __read_user_gsbase(void)
{
unsigned long gs;
@@ -282,6 +282,17 @@ static noinline __kprobes unsigned long read_user_gsbase(void)
return gs;
}
+unsigned long read_user_gsbase(void)
+{
+ unsigned long flags;
+ unsigned long gs;
+
+ local_irq_save(flags);
+ gs = __read_user_gsbase();
+ local_irq_restore(flags);
+ return gs;
+}
+
/*
* switch_to(x,y) should switch tasks from x to y.
*
@@ -315,7 +326,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
savesegment(gs, gsindex);
if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
prev->fs = rdfsbase();
- prev->gs = read_user_gsbase();
+ prev->gs = __read_user_gsbase();
}
/*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 32e9d9c..b68b15b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -38,6 +38,7 @@
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
#include <asm/syscall.h>
+#include <asm/fsgs.h>
#include "tls.h"
@@ -452,12 +453,18 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
#ifdef CONFIG_X86_64
case offsetof(struct user_regs_struct, fs_base): {
+ unsigned int seg = task->thread.fsindex;
+ if (boot_cpu_has(X86_FEATURE_FSGSBASE)) {
+ if (task == current)
+ return rdfsbase();
+ else
+ return task->thread.fs;
+ }
/*
* do_arch_prctl may have used a GDT slot instead of
* the MSR. To userland, it appears the same either
* way, except the %fs segment selector might not be 0.
*/
- unsigned int seg = task->thread.fsindex;
if (task->thread.fs != 0)
return task->thread.fs;
if (task == current)
@@ -471,6 +478,12 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
* Exactly the same here as the %fs handling above.
*/
unsigned int seg = task->thread.gsindex;
+ if (boot_cpu_has(X86_FEATURE_FSGSBASE)) {
+ if (task == current)
+ return read_user_gsbase();
+ else
+ return task->thread.gs;
+ }
if (task->thread.gs != 0)
return task->thread.gs;
if (task == current)
--
2.5.5
From: Andi Kleen <[email protected]>
Add FS/GS base dumping to the standard ELF_CORE_COPY_REGS macro
I think this is only used in some special cases, the majority
of core dumps seem to go through the getregs interface also
used by ptrace.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/elf.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 0df9c95..f57cc17 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -9,6 +9,8 @@
#include <asm/ptrace.h>
#include <asm/user.h>
#include <asm/auxvec.h>
+#include <asm/proto.h>
+#include <asm/fsgs.h>
typedef unsigned long elf_greg_t;
@@ -226,8 +228,13 @@ do { \
(pr_reg)[18] = (regs)->flags; \
(pr_reg)[19] = (regs)->sp; \
(pr_reg)[20] = (regs)->ss; \
- (pr_reg)[21] = current->thread.fs; \
- (pr_reg)[22] = current->thread.gs; \
+ if (boot_cpu_has(X86_FEATURE_FSGSBASE)) { \
+ (pr_reg)[21] = rdfsbase(); \
+ (pr_reg)[22] = read_user_gsbase(); \
+ } else { \
+ (pr_reg)[21] = current->thread.fs; \
+ (pr_reg)[22] = current->thread.gs; \
+ } \
asm("movl %%ds,%0" : "=r" (v)); (pr_reg)[23] = v; \
asm("movl %%es,%0" : "=r" (v)); (pr_reg)[24] = v; \
asm("movl %%fs,%0" : "=r" (v)); (pr_reg)[25] = v; \
--
2.5.5
On Mar 21, 2016 9:16 AM, "Andi Kleen" <[email protected]> wrote:
>
> From: Andi Kleen <[email protected]>
>
> Introduction:
>
> IvyBridge added four new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there. Another use
> case is having (relatively) cheap access to a new address
> register per thread.
>
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
>
> For more details on how to use the instructions see
> Documentation/x86/fsgs.txt added in a followon patch.
>
> Paranoid exception path changes:
> ===============================
>
> The paranoid entry/exit code is used for any NMI like
> exception.
>
> Previously Linux couldn't support the new instructions
> because the paranoid entry code relied on the gs base never being
> negative outside the kernel to decide when to use swaps. It would
> check the gs MSR value and assume it was already running in
> kernel if negative.
>
> To get rid of this assumption we have to revamp the paranoid exception
> path to not rely on this. We can use the new instructions
> to get (relatively) quick access to the GS value, and use
> it directly to save/restore the GSBASE instead of using
> SWAPGS.
>
> This is also significantly faster than a MSR read, so will speed
> NMIs (useful for profiling)
>
> The kernel gs for the paranoid path is now stored at the
> bottom of the IST stack (so that it can be derived from RSP).
>
> The original patch compared the gs with the kernel gs and
> assumed that if it was identical, swapgs was not needed
> (and no user space processing was needed). This
> was nice and simple and didn't need a lot of changes.
>
> But this had the side effect that if a user process set its
> GS to the same as the kernel it may lose rescheduling
> checks (so a racing reschedule IPI would have been
> only acted upon the next non paranoid interrupt)
>
> This version now switches to full save/restore of the GS.
>
> When swapgs used to be needed, but we have the new
> instructions, we restore original GS value in the exit
> path.
>
> Context switch changes:
> ======================
>
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear. This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
>
> This is in term enables fast switching when there are
> enough threads that their TLS segment does not fit below 4GB
> (or with some newer systems which don't properly hint the
> stack limit), or alternatively programs that use fs as an additional base
> register will not get a sigificant context switch penalty.
>
> It is all done in a single patch because there was no
> simple way to do it in pieces without having crash
> holes inbetween.
>
> v2: Change to save/restore GS instead of using swapgs
> based on the value. Large scale changes.
> v3: Fix wrong flag initialization in fallback path.
> Thanks 0day!
> v4: Make swapgs code paths kprobes safe.
> Port to new base line code which now switches indexes.
> v5: Port to new kernel which avoids paranoid entry for ring 3.
> Removed some code that handled this previously.
> v6: Remove obsolete code. Use macro for ALTERNATIVE. Use
> ALTERNATIVE for exit path, eliminating the DO_RESTORE_G15 flag.
> Various cleanups. Improve description.
> v7: Port to new entry code. Some fixes/cleanups.
> v8: Lots of changes.
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 31 +++++++++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 9 ++++++++
> arch/x86/kernel/process_64.c | 51 ++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 85 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 858b555..c605710 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -35,6 +35,8 @@
> #include <asm/asm.h>
> #include <asm/smap.h>
> #include <asm/pgtable_types.h>
> +#include <asm/alternative-asm.h>
> +#include <asm/fsgs.h>
> #include <linux/err.h>
>
> /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
> @@ -678,6 +680,7 @@ ENTRY(\sym)
> jnz 1f
> .endif
> call paranoid_entry
> + /* r15: previous gs if FSGSBASE, otherwise %ebx: swapgs flag */
[...]
The asm looks generally correct.
> @@ -1422,8 +1425,14 @@ void cpu_init(void)
> */
> if (!oist->ist[0]) {
> char *estacks = per_cpu(exception_stacks, cpu);
> + void *gs = per_cpu(irq_stack_union.gs_base, cpu);
>
> for (v = 0; v < N_EXCEPTION_STACKS; v++) {
> + /* Store GS at bottom of stack for bootstrap access */
> + *(void **)estacks = gs;
> + /* Put it on every 4K entry */
> + if (exception_stack_sizes[v] > EXCEPTION_STKSZ)
> + *(void **)(estacks + EXCEPTION_STKSZ) = gs;
What if it's more than 2x the normal size?
(The debug stack should just be deleted entirely, but that's a separate issue.)
> estacks += exception_stack_sizes[v];
> oist->ist[v] = t->x86_tss.ist[v] =
> (unsigned long)estacks;
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index b9d99e0..53fa839 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -48,6 +48,7 @@
> #include <asm/syscalls.h>
> #include <asm/debugreg.h>
> #include <asm/switch_to.h>
> +#include <asm/fsgs.h>
>
> asmlinkage extern void ret_from_fork(void);
>
> @@ -260,6 +261,27 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp)
> }
> #endif
>
> +/* Out of line to be protected from kprobes. */
> +
> +/* Interrupts are disabled here. */
> +static noinline __kprobes void switch_gs_base(unsigned long gs)
> +{
> + swapgs();
> + wrgsbase(gs);
> + swapgs();
> +}
Can we call this write_user_gsbase(unsigned long gsbase) for consistency?
> +
> +/* Interrupts are disabled here. */
> +static noinline __kprobes unsigned long read_user_gsbase(void)
> +{
> + unsigned long gs;
> +
> + swapgs();
> + gs = rdgsbase();
> + swapgs();
> + return gs;
> +}
> +
> /*
> * switch_to(x,y) should switch tasks from x to y.
> *
> @@ -291,6 +313,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> */
> savesegment(fs, fsindex);
> savesegment(gs, gsindex);
> + if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> + prev->fs = rdfsbase();
> + prev->gs = read_user_gsbase();
> + }
Please add a patch before this one that renames gs to gsbase. This is
unreadable as is.
Also, the existing code is wrong on AMD CPUs, but that's more or less
independent of your series.
--Andy
On Mon, Mar 21, 2016 at 9:16 AM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> Add C intrinsics and assembler macros for the new rd/wr fs/gs base
> instructions and for swapgs.
>
> Very straight forward. Used in followon patch.
>
> For assembler only a few standard registers used by entry_64.S
> are defined.
>
> v2: Use __always_inline
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/include/asm/fsgs.h | 54 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
> create mode 100644 arch/x86/include/asm/fsgs.h
>
> diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
> new file mode 100644
> index 0000000..8a9f900
> --- /dev/null
> +++ b/arch/x86/include/asm/fsgs.h
> @@ -0,0 +1,54 @@
> +#ifndef _ASM_FSGS_H
> +#define _ASM_FSGS_H 1
> +
> +#ifndef __ASSEMBLY__
> +
> +static __always_inline void swapgs(void)
> +{
> + asm volatile("swapgs" ::: "memory");
> +}
> +
> +/* Must be protected by X86_FEATURE_FSGSBASE check. */
> +
> +static __always_inline unsigned long rdgsbase(void)
> +{
> + unsigned long gs;
> + asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
> + : "=a" (gs)
> + :: "memory");
> + return gs;
> +}
> +
> +static __always_inline unsigned long rdfsbase(void)
> +{
> + unsigned long fs;
> + asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
> + : "=a" (fs)
> + :: "memory");
> + return fs;
> +}
> +
> +static __always_inline void wrgsbase(unsigned long gs)
> +{
> + asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
> + :: "a" (gs)
> + : "memory");
> +}
> +
"unsigned long gsbase", perhaps?
--Andy
On Mon, Mar 21, 2016 at 9:16 AM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> Convert arch_prctl to use the new instructions to
> change fs/gs if available, instead of using MSRs.
>
> This is merely a small performance optimization,
> no new functionality.
>
> With the new instructions the syscall is really obsolete,
> as everything can be set directly in ring 3. But the syscall
> is widely used by existing software, so we still support it.
>
> The syscall still enforces that the addresses are not
> in kernel space, even though that is not needed more.
> This is mainly so that the programs written for new CPUs
> do not suddenly fail on old CPUs.
>
> v2: Make kprobes safe
> v3: Rename things.
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/kernel/process_64.c | 48 ++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 53fa839..5f40517 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -530,20 +530,38 @@ void set_personality_ia32(bool x32)
> }
> EXPORT_SYMBOL_GPL(set_personality_ia32);
>
> +static noinline __kprobes void reload_user_gs(unsigned long addr)
> +{
> + local_irq_disable();
> + swapgs();
> + loadsegment(gs, 0);
> + wrgsbase(addr);
> + swapgs();
> + local_irq_enable();
> +}
The actual operation this does is to set the selector to zero and the
base to the specified value. Can you give it a name that makes it
clear (e.g. zero_user_gs_and_set_base)?
I'm also wondering whether it would make sense to move the cpu_has
into these helpers rather than putting it in the callers.
--Andy
On Mon, Mar 21, 2016 at 11:39:07AM -0700, Andy Lutomirski wrote:
> 4. Does the sigcontext format need to change?
I don't think it needs to be. fs/gs are global state and the
signal handlers are not likely to change it.
Also there is no default that could be restored.
>
> For maximum safely, comprehensibility, and sanity, there's an argument
> to be made that 1a and 2a should leave the state exactly as it started
> and that 1b and 2b should leave it alone unless percpu bases are in
> use. For maximum simplicity of implementation, there's an argument
> that, if the fs or gs selector is nonzero and the base doesn't match
> the in-memory descriptor, then the kernel can do whatever it wants.
>
> I propose the following semantics:
So you want to change the existing semantics. We had this discussion
before. I think it is out of scope of my patch, which just extends the
existing semantics to support the instructions.
(what happened in the system call before is now possible in ring 3)
If you want to invent some new overengineered semantics you can do it in some
followon patch.
Personally i think it is pointless. The existing semantics are fine.
> Does all this make sense? Do people agree with me?
I think you are overcomplicated something fundamentally simple.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 12:05 PM, Andi Kleen <[email protected]> wrote:
> On Mon, Mar 21, 2016 at 11:13:05AM -0700, Andy Lutomirski wrote:
>> On Mar 21, 2016 9:16 AM, "Andi Kleen" <[email protected]> wrote:
>> >
>> > From: Andi Kleen <[email protected]>
>> >
>> > Introduction:
>> >
>> > IvyBridge added four new instructions to directly write the fs and gs
>> > 64bit base registers. Previously this had to be done with a system
>> > call to write to MSRs. The main use case is fast user space threading
>> > and switching the fs/gs registers quickly there. Another use
>> > case is having (relatively) cheap access to a new address
>> > register per thread.
>> >
>> > The instructions are opt-in and have to be explicitely enabled
>> > by the OS.
>> >
>> > For more details on how to use the instructions see
>> > Documentation/x86/fsgs.txt added in a followon patch.
>> >
>> > Paranoid exception path changes:
>> > ===============================
>> >
>> > The paranoid entry/exit code is used for any NMI like
>> > exception.
>> >
>> > Previously Linux couldn't support the new instructions
>> > because the paranoid entry code relied on the gs base never being
>> > negative outside the kernel to decide when to use swaps. It would
>> > check the gs MSR value and assume it was already running in
>> > kernel if negative.
>> >
>> > To get rid of this assumption we have to revamp the paranoid exception
>> > path to not rely on this. We can use the new instructions
>> > to get (relatively) quick access to the GS value, and use
>> > it directly to save/restore the GSBASE instead of using
>> > SWAPGS.
>> >
>> > This is also significantly faster than a MSR read, so will speed
>> > NMIs (useful for profiling)
>> >
>> > The kernel gs for the paranoid path is now stored at the
>> > bottom of the IST stack (so that it can be derived from RSP).
>> >
>> > The original patch compared the gs with the kernel gs and
>> > assumed that if it was identical, swapgs was not needed
>> > (and no user space processing was needed). This
>> > was nice and simple and didn't need a lot of changes.
>> >
>> > But this had the side effect that if a user process set its
>> > GS to the same as the kernel it may lose rescheduling
>> > checks (so a racing reschedule IPI would have been
>> > only acted upon the next non paranoid interrupt)
>> >
>> > This version now switches to full save/restore of the GS.
>> >
>> > When swapgs used to be needed, but we have the new
>> > instructions, we restore original GS value in the exit
>> > path.
>> >
>> > Context switch changes:
>> > ======================
>> >
>> > Then after these changes we need to also use the new instructions
>> > to save/restore fs and gs, so that the new values set by the
>> > users won't disappear. This is also significantly
>> > faster for the case when the 64bit base has to be switched
>> > (that is when GS is larger than 4GB), as we can replace
>> > the slow MSR write with a faster wr[fg]sbase execution.
>> >
>> > This is in term enables fast switching when there are
>> > enough threads that their TLS segment does not fit below 4GB
>> > (or with some newer systems which don't properly hint the
>> > stack limit), or alternatively programs that use fs as an additional base
>> > register will not get a sigificant context switch penalty.
>> >
>> > It is all done in a single patch because there was no
>> > simple way to do it in pieces without having crash
>> > holes inbetween.
>> >
>> > v2: Change to save/restore GS instead of using swapgs
>> > based on the value. Large scale changes.
>> > v3: Fix wrong flag initialization in fallback path.
>> > Thanks 0day!
>> > v4: Make swapgs code paths kprobes safe.
>> > Port to new base line code which now switches indexes.
>> > v5: Port to new kernel which avoids paranoid entry for ring 3.
>> > Removed some code that handled this previously.
>> > v6: Remove obsolete code. Use macro for ALTERNATIVE. Use
>> > ALTERNATIVE for exit path, eliminating the DO_RESTORE_G15 flag.
>> > Various cleanups. Improve description.
>> > v7: Port to new entry code. Some fixes/cleanups.
>> > v8: Lots of changes.
>> > Signed-off-by: Andi Kleen <[email protected]>
>> > ---
>> > arch/x86/entry/entry_64.S | 31 +++++++++++++++++++++++++++
>> > arch/x86/kernel/cpu/common.c | 9 ++++++++
>> > arch/x86/kernel/process_64.c | 51 ++++++++++++++++++++++++++++++++++++++------
>> > 3 files changed, 85 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> > index 858b555..c605710 100644
>> > --- a/arch/x86/entry/entry_64.S
>> > +++ b/arch/x86/entry/entry_64.S
>> > @@ -35,6 +35,8 @@
>> > #include <asm/asm.h>
>> > #include <asm/smap.h>
>> > #include <asm/pgtable_types.h>
>> > +#include <asm/alternative-asm.h>
>> > +#include <asm/fsgs.h>
>> > #include <linux/err.h>
>> >
>> > /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
>> > @@ -678,6 +680,7 @@ ENTRY(\sym)
>> > jnz 1f
>> > .endif
>> > call paranoid_entry
>> > + /* r15: previous gs if FSGSBASE, otherwise %ebx: swapgs flag */
>>
>> [...]
>>
>> The asm looks generally correct.
>>
>> > @@ -1422,8 +1425,14 @@ void cpu_init(void)
>> > */
>> > if (!oist->ist[0]) {
>> > char *estacks = per_cpu(exception_stacks, cpu);
>> > + void *gs = per_cpu(irq_stack_union.gs_base, cpu);
>> >
>> > for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>> > + /* Store GS at bottom of stack for bootstrap access */
>> > + *(void **)estacks = gs;
>> > + /* Put it on every 4K entry */
>> > + if (exception_stack_sizes[v] > EXCEPTION_STKSZ)
>> > + *(void **)(estacks + EXCEPTION_STKSZ) = gs;
>>
>> What if it's more than 2x the normal size?
>
> Well it is not and cannot be. Is that a trick question?
It isn't, but I had to look at the header to find that out.
Presumably either the code should work no matter what the stack sizes
are or it should assert that the sizes are always either
EXCEPTION_STKSZ or 2*EXCEPTION_STKSZ.
--Andy
On Mon, Mar 21, 2016 at 11:13:05AM -0700, Andy Lutomirski wrote:
> On Mar 21, 2016 9:16 AM, "Andi Kleen" <[email protected]> wrote:
> >
> > From: Andi Kleen <[email protected]>
> >
> > Introduction:
> >
> > IvyBridge added four new instructions to directly write the fs and gs
> > 64bit base registers. Previously this had to be done with a system
> > call to write to MSRs. The main use case is fast user space threading
> > and switching the fs/gs registers quickly there. Another use
> > case is having (relatively) cheap access to a new address
> > register per thread.
> >
> > The instructions are opt-in and have to be explicitely enabled
> > by the OS.
> >
> > For more details on how to use the instructions see
> > Documentation/x86/fsgs.txt added in a followon patch.
> >
> > Paranoid exception path changes:
> > ===============================
> >
> > The paranoid entry/exit code is used for any NMI like
> > exception.
> >
> > Previously Linux couldn't support the new instructions
> > because the paranoid entry code relied on the gs base never being
> > negative outside the kernel to decide when to use swaps. It would
> > check the gs MSR value and assume it was already running in
> > kernel if negative.
> >
> > To get rid of this assumption we have to revamp the paranoid exception
> > path to not rely on this. We can use the new instructions
> > to get (relatively) quick access to the GS value, and use
> > it directly to save/restore the GSBASE instead of using
> > SWAPGS.
> >
> > This is also significantly faster than a MSR read, so will speed
> > NMIs (useful for profiling)
> >
> > The kernel gs for the paranoid path is now stored at the
> > bottom of the IST stack (so that it can be derived from RSP).
> >
> > The original patch compared the gs with the kernel gs and
> > assumed that if it was identical, swapgs was not needed
> > (and no user space processing was needed). This
> > was nice and simple and didn't need a lot of changes.
> >
> > But this had the side effect that if a user process set its
> > GS to the same as the kernel it may lose rescheduling
> > checks (so a racing reschedule IPI would have been
> > only acted upon the next non paranoid interrupt)
> >
> > This version now switches to full save/restore of the GS.
> >
> > When swapgs used to be needed, but we have the new
> > instructions, we restore original GS value in the exit
> > path.
> >
> > Context switch changes:
> > ======================
> >
> > Then after these changes we need to also use the new instructions
> > to save/restore fs and gs, so that the new values set by the
> > users won't disappear. This is also significantly
> > faster for the case when the 64bit base has to be switched
> > (that is when GS is larger than 4GB), as we can replace
> > the slow MSR write with a faster wr[fg]sbase execution.
> >
> > This is in term enables fast switching when there are
> > enough threads that their TLS segment does not fit below 4GB
> > (or with some newer systems which don't properly hint the
> > stack limit), or alternatively programs that use fs as an additional base
> > register will not get a sigificant context switch penalty.
> >
> > It is all done in a single patch because there was no
> > simple way to do it in pieces without having crash
> > holes inbetween.
> >
> > v2: Change to save/restore GS instead of using swapgs
> > based on the value. Large scale changes.
> > v3: Fix wrong flag initialization in fallback path.
> > Thanks 0day!
> > v4: Make swapgs code paths kprobes safe.
> > Port to new base line code which now switches indexes.
> > v5: Port to new kernel which avoids paranoid entry for ring 3.
> > Removed some code that handled this previously.
> > v6: Remove obsolete code. Use macro for ALTERNATIVE. Use
> > ALTERNATIVE for exit path, eliminating the DO_RESTORE_G15 flag.
> > Various cleanups. Improve description.
> > v7: Port to new entry code. Some fixes/cleanups.
> > v8: Lots of changes.
> > Signed-off-by: Andi Kleen <[email protected]>
> > ---
> > arch/x86/entry/entry_64.S | 31 +++++++++++++++++++++++++++
> > arch/x86/kernel/cpu/common.c | 9 ++++++++
> > arch/x86/kernel/process_64.c | 51 ++++++++++++++++++++++++++++++++++++++------
> > 3 files changed, 85 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 858b555..c605710 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -35,6 +35,8 @@
> > #include <asm/asm.h>
> > #include <asm/smap.h>
> > #include <asm/pgtable_types.h>
> > +#include <asm/alternative-asm.h>
> > +#include <asm/fsgs.h>
> > #include <linux/err.h>
> >
> > /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
> > @@ -678,6 +680,7 @@ ENTRY(\sym)
> > jnz 1f
> > .endif
> > call paranoid_entry
> > + /* r15: previous gs if FSGSBASE, otherwise %ebx: swapgs flag */
>
> [...]
>
> The asm looks generally correct.
>
> > @@ -1422,8 +1425,14 @@ void cpu_init(void)
> > */
> > if (!oist->ist[0]) {
> > char *estacks = per_cpu(exception_stacks, cpu);
> > + void *gs = per_cpu(irq_stack_union.gs_base, cpu);
> >
> > for (v = 0; v < N_EXCEPTION_STACKS; v++) {
> > + /* Store GS at bottom of stack for bootstrap access */
> > + *(void **)estacks = gs;
> > + /* Put it on every 4K entry */
> > + if (exception_stack_sizes[v] > EXCEPTION_STKSZ)
> > + *(void **)(estacks + EXCEPTION_STKSZ) = gs;
>
> What if it's more than 2x the normal size?
Well it is not and cannot be. Is that a trick question?
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 12:03 PM, Andi Kleen <[email protected]> wrote:
> On Mon, Mar 21, 2016 at 11:39:07AM -0700, Andy Lutomirski wrote:
>> 4. Does the sigcontext format need to change?
>
> I don't think it needs to be. fs/gs are global state and the
> signal handlers are not likely to change it.
>
> Also there is no default that could be restored.
>
>>
>> For maximum safely, comprehensibility, and sanity, there's an argument
>> to be made that 1a and 2a should leave the state exactly as it started
>> and that 1b and 2b should leave it alone unless percpu bases are in
>> use. For maximum simplicity of implementation, there's an argument
>> that, if the fs or gs selector is nonzero and the base doesn't match
>> the in-memory descriptor, then the kernel can do whatever it wants.
>>
>> I propose the following semantics:
>
> So you want to change the existing semantics. We had this discussion
> before. I think it is out of scope of my patch, which just extends the
> existing semantics to support the instructions.
>
> (what happened in the system call before is now possible in ring 3)
>
> If you want to invent some new overengineered semantics you can do it in some
> followon patch.
>
> Personally i think it is pointless. The existing semantics are fine.
I strongly disagree.
You're adding an hwcap bit because you expect user code to use this
thing, which means you're adding an ABI, which means that the
semantics should be given due consideration.
If the goal were just to speed up context switches, then just maybe it
would make sense to enable it in such a way that user code *doesn't*
use it. But Intel wasn't nice enough to let us switch the bit for
CPL0 only, and you added an hwcap bit, so here we are...
--Andy
On Mon, Mar 21, 2016 at 02:49:44PM -0400, Brian Gerst wrote:
> On Mon, Mar 21, 2016 at 12:16 PM, Andi Kleen <[email protected]> wrote:
> > From: Andi Kleen <[email protected]>
> >
> > The kernel needs to explicitely enable RD/WRFSBASE to handle context
> > switch correctly. So the application needs to know if it can safely use
> > these instruction. Just looking at the CPUID bit is not enough because it
> > may be running in a kernel that does not enable the instructions.
> >
> > One way for the application would be to just try and catch the SIGILL.
> > But that is difficult to do in libraries which may not want
> > to overwrite the signal handlers of the main application.
> >
> > So we need to provide a way for the application to discover the kernel
> > capability.
> >
> > I used AT_HWCAP2 in the ELF aux vector which is already used by
> > PPC for similar things. We define a new Linux defined bitmap
> > returned in AT_HWCAP. Currently it has only one bit set,
> > for kernel is FSGSBASE capable.
> >
> > The application can then access it manually or using
> > the getauxval() function in newer glibc.
>
> How about adding a VDSO function instead? The VDSO can use
> alternatives, so it can use the new instructions if supported, or else
> use the old syscall.
What would be the point of that?
It would be a lot more complicated, and I don't see any advantages
over the aux vector. vdso also requires custom assembler
stubs in the C library.
-Andi
On Mon, Mar 21, 2016 at 2:54 PM, Andi Kleen <[email protected]> wrote:
> On Mon, Mar 21, 2016 at 02:49:44PM -0400, Brian Gerst wrote:
>> On Mon, Mar 21, 2016 at 12:16 PM, Andi Kleen <[email protected]> wrote:
>> > From: Andi Kleen <[email protected]>
>> >
>> > The kernel needs to explicitely enable RD/WRFSBASE to handle context
>> > switch correctly. So the application needs to know if it can safely use
>> > these instruction. Just looking at the CPUID bit is not enough because it
>> > may be running in a kernel that does not enable the instructions.
>> >
>> > One way for the application would be to just try and catch the SIGILL.
>> > But that is difficult to do in libraries which may not want
>> > to overwrite the signal handlers of the main application.
>> >
>> > So we need to provide a way for the application to discover the kernel
>> > capability.
>> >
>> > I used AT_HWCAP2 in the ELF aux vector which is already used by
>> > PPC for similar things. We define a new Linux defined bitmap
>> > returned in AT_HWCAP. Currently it has only one bit set,
>> > for kernel is FSGSBASE capable.
>> >
>> > The application can then access it manually or using
>> > the getauxval() function in newer glibc.
>>
>> How about adding a VDSO function instead? The VDSO can use
>> alternatives, so it can use the new instructions if supported, or else
>> use the old syscall.
>
> What would be the point of that?
>
> It would be a lot more complicated, and I don't see any advantages
> over the aux vector. vdso also requires custom assembler
> stubs in the C library.
>
> -Andi
It would be less complicated actually, as normal userspace would just
continue to call arch_prctl() as it does today. Glibc would implement
arch_prctl() just like it does with gettimeofday() -- with an ifunc
selector [1] that calls the VDSO function if it is available, or the
syscall if not. No custom assembly needed.
[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD
--
Brian Gerst
> You're adding an hwcap bit because you expect user code to use this
> thing, which means you're adding an ABI, which means that the
> semantics should be given due consideration.
Right I did that and concluded the existing semantics are fine.
They also worked fine for many years with the system call.
We have two different modi:
- Code uses old FS/GS selector, gs selector is not zero
In this case the selector base in GDT/LDT takes preference.
This is legacy, but still works fine.
- Code uses 64bit base, either through arch_prctl or the new
instructions. In this case FS/GS selector has to be zero.
This is the new expected mode for 64bit code.
With the new instructions the modi can be temporarily
out of sync (GS/FS != 0, but a different base loaded),
but will always be reset on the next context switch.
Your previous objection was that this allows to detect
context switches, but that's already possible in other
ways so I think it's a red hering.
Also if you really want to change it you can do so
in a followon patch under your own name.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 9:16 AM, Andi Kleen <[email protected]> wrote:
> This is a reworked version of my older fsgsbase patchkit.
> Main changes:
> - Ported to new entry/* code, which simplified it somewhat
> - Now has a test program
> - Fixed ptrace/core dump support
> - Better documentation
> - Some minor fixes improvement
I think that the biggest remaining issue is to define the semantics.
As an architectural matter, the relevant user state is (fs selector,
fs base, gs selector, gs base). With FSGSBASE enabled, user code can
more or less independently control all four of those values. (It's
slightly more complicated than that because set_thread_area and
modify_ldt both forget to reload segment registers IIRC, but we can
fix that independently.)
Keeping in mind that we'll probably want to add percpu segment bases
at some point (to allow very fast atomic percpu data access for user
code), the questions I have are:
1a. What happens when a task switches out and back in on the same CPU?
1b. What happens when a task switches out and back in on a different CPU?
2a. What happens when a tracer reads the state out and writes exactly
the same thing back in and the task resumes on the CPU it started on?
2b. What happens when a tracer reads the state out and writes exactly
the same thing back in and the task resumes on a different CPU?
3. What happens if fs or gs points to a real descriptor and that
descriptor changes?
4. Does the sigcontext format need to change?
For maximum safely, comprehensibility, and sanity, there's an argument
to be made that 1a and 2a should leave the state exactly as it started
and that 1b and 2b should leave it alone unless percpu bases are in
use. For maximum simplicity of implementation, there's an argument
that, if the fs or gs selector is nonzero and the base doesn't match
the in-memory descriptor, then the kernel can do whatever it wants.
I propose the following semantics:
- All "save state" or "report state" events unconditionally save the
base and selector as they actually were in the CPU state. (Keep it
simple. Also, with these patches applied, on an FSGSBASE-capable CPU,
selector != 0 is a slow path.)
- When restoring state, if selector == 0, then the base is restored as it was.
- When restoring state, if selector != 0, then the base is restored
to whatever the in-memory descriptor says. (Optionally, down the
road, we could make it so that a save + restore without an intervening
migration, set_thread_area, or modify_ldt would restore the base as it
was. This would make things more predictable.)
- If/when we add percpu bases, they are associated with a nonzero selector.
The big open question is: should signal delivery and restore do
anything to the selectors or bases? I think that, by default, it
can't, but maybe we'll want an option to do it some day.
Does all this make sense? Do people agree with me?
> It would be less complicated actually, as normal userspace would just
> continue to call arch_prctl() as it does today. Glibc would implement
We already have that through the system call, no advantage of
putting it into a vsyscall.
Also the experience with getcpu and similar so far is that
vsyscalls are too slow for the users who want really fast paths.
So they're prefer to use the direct instructions anyways.
> arch_prctl() just like it does with gettimeofday() -- with an ifunc
> selector [1] that calls the VDSO function if it is available, or the
> syscall if not. No custom assembly needed.
vdso always needs custom assembler, please see how glibc implements it.
-Andi
On Mon, Mar 21, 2016 at 12:16 PM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> The kernel needs to explicitely enable RD/WRFSBASE to handle context
> switch correctly. So the application needs to know if it can safely use
> these instruction. Just looking at the CPUID bit is not enough because it
> may be running in a kernel that does not enable the instructions.
>
> One way for the application would be to just try and catch the SIGILL.
> But that is difficult to do in libraries which may not want
> to overwrite the signal handlers of the main application.
>
> So we need to provide a way for the application to discover the kernel
> capability.
>
> I used AT_HWCAP2 in the ELF aux vector which is already used by
> PPC for similar things. We define a new Linux defined bitmap
> returned in AT_HWCAP. Currently it has only one bit set,
> for kernel is FSGSBASE capable.
>
> The application can then access it manually or using
> the getauxval() function in newer glibc.
How about adding a VDSO function instead? The VDSO can use
alternatives, so it can use the new instructions if supported, or else
use the old syscall.
--
Brian Gerst
> Please add a patch before this one that renames gs to gsbase. This is
> unreadable as is.
I investigated this now, and it's not straight forward because on 32bit
->gs is actually gsindex, not gsbase. So with a straight rename you
would end up with index in base on 32bit, which would be confusing.
Perhaps this could be cleaned up at some point, switching 32bit
to use ->gsindex. But not right now, seems somewhat risky,
and I don't want to make it part of this patch kit.
FWIW I think it's readable.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 12:40 PM, Andi Kleen <[email protected]> wrote:
>> You're adding an hwcap bit because you expect user code to use this
>> thing, which means you're adding an ABI, which means that the
>> semantics should be given due consideration.
>
> Right I did that and concluded the existing semantics are fine.
> They also worked fine for many years with the system call.
>
> We have two different modi:
>
> - Code uses old FS/GS selector, gs selector is not zero
> In this case the selector base in GDT/LDT takes preference.
In this case the selector base in GDT/LDT is the whole story because
arch_prctl zeroes the selector.
>
> This is legacy, but still works fine.
>
> - Code uses 64bit base, either through arch_prctl or the new
> instructions. In this case FS/GS selector has to be zero.
>
> This is the new expected mode for 64bit code.
>
> With the new instructions the modi can be temporarily
> out of sync (GS/FS != 0, but a different base loaded),
> but will always be reset on the next context switch.
>
> Your previous objection was that this allows to detect
> context switches, but that's already possible in other
> ways so I think it's a red hering.
>
> Also if you really want to change it you can do so
> in a followon patch under your own name.
ARCH_SET_FS and ARCH_SET_GS *zero the selector*. WRFSBASE and
WRGSBASE *do not zero the selector*. This design is, in my mind,
obnoxious and represents an error on Intel's part, but it's what the
docs say the cpu does and I have no reason to doubt the docs.
So a patchset to enable these asinine new instructions needs to take
this into account, and the ABI issue needs to be addressed, even if
the answer is that the proposed code is fine.
(Also, the existing code is fscked up. Guess what xor %eax, %eax; mov
%ax, %gs does to the base on AMD? The existing code is *wrong*, and I
don't want to see it get wronger.)
And no, I don't really care about programs detecting context switches.
I do, however, care about allowing non-determinism in things that
ought to behave deterministically. Writing a nonzero value to %gs and
then doing WRGSBASE is something that user code will be able to do
whether we like it or not, some shitty threading library is likely to
do this just to spite us, the the kernel needs to do *something* when
this happens.
On Mon, Mar 21, 2016 at 3:05 PM, Andi Kleen <[email protected]> wrote:
>> Please add a patch before this one that renames gs to gsbase. This is
>> unreadable as is.
>
> I investigated this now, and it's not straight forward because on 32bit
> ->gs is actually gsindex, not gsbase. So with a straight rename you
> would end up with index in base on 32bit, which would be confusing.
>
I would take this inconsistency as a reason why this needs to be fixed
before any further changes are made.
The variables should be called "gsbase" and "gsindex", full stop.
It's not particularly risky -- we could even compare the generated
code if we cared.
--Andy
On Mar 21, 2016 12:43 PM, "Andi Kleen" <[email protected]> wrote:
>
> > It would be less complicated actually, as normal userspace would just
> > continue to call arch_prctl() as it does today. Glibc would implement
>
> We already have that through the system call, no advantage of
> putting it into a vsyscall.
>
> Also the experience with getcpu and similar so far is that
> vsyscalls are too slow for the users who want really fast paths.
> So they're prefer to use the direct instructions anyways.
Getcpu is mainly slow because the overcomplicated API requires
branches. I've been tempted to add __vdso_get_cpu as an alternative
that simply returns the CPU number.
>
> > arch_prctl() just like it does with gettimeofday() -- with an ifunc
> > selector [1] that calls the VDSO function if it is available, or the
> > syscall if not. No custom assembly needed.
>
> vdso always needs custom assembler, please see how glibc implements it.
This is simply not true.
I haven't checked the glibc implementation, and I wouldn't be remotely
surprised if it is maliciously incomprehensible, but there is no
reason whatsoever that using any vdso mechanism other than AT_SYSINFO
itself requires assembler. AT_SYSINFO, of course, requires assembler
because the calling convention is weird.
--Andy
> So a patchset to enable these asinine new instructions needs to take
> this into account, and the ABI issue needs to be addressed, even if
What's the ABI issue?
AFAIK we're perfectly consistent.
> the answer is that the proposed code is fine.
>
> (Also, the existing code is fscked up. Guess what xor %eax, %eax; mov
> %ax, %gs does to the base on AMD? The existing code is *wrong*, and I
> don't want to see it get wronger.)
I have no idea, but changing it is definitely not in scope for my patches.
>
> And no, I don't really care about programs detecting context switches.
> I do, however, care about allowing non-determinism in things that
> ought to behave deterministically. Writing a nonzero value to %gs and
> then doing WRGSBASE is something that user code will be able to do
> whether we like it or not, some shitty threading library is likely to
> do this just to spite us, the the kernel needs to do *something* when
> this happens.
They will quickly notice it if there is a problem, so I don't think
we need to worry about that.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 03:08:46PM -0700, Andy Lutomirski wrote:
> On Mon, Mar 21, 2016 at 3:05 PM, Andi Kleen <[email protected]> wrote:
> >> Please add a patch before this one that renames gs to gsbase. This is
> >> unreadable as is.
> >
> > I investigated this now, and it's not straight forward because on 32bit
> > ->gs is actually gsindex, not gsbase. So with a straight rename you
> > would end up with index in base on 32bit, which would be confusing.
> >
>
> I would take this inconsistency as a reason why this needs to be fixed
> before any further changes are made.
The patchkit doesn't change anything for 32bit, so any inconsistency
in 32bit is completely orthogonal.
I don't see how adding a bazillion unrelated changes makes this
patchkit any better.
>
> The variables should be called "gsbase" and "gsindex", full stop.
> It's not particularly risky -- we could even compare the generated
> code if we cared.
Iff you can catch all the ifdefs ...
I looked through the grep hits and I don't think it's straight forward.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 3:11 PM, Andi Kleen <[email protected]> wrote:
>> So a patchset to enable these asinine new instructions needs to take
>> this into account, and the ABI issue needs to be addressed, even if
>
> What's the ABI issue?
>
> AFAIK we're perfectly consistent.
>
>> the answer is that the proposed code is fine.
>>
>> (Also, the existing code is fscked up. Guess what xor %eax, %eax; mov
>> %ax, %gs does to the base on AMD? The existing code is *wrong*, and I
>> don't want to see it get wronger.)
>
> I have no idea, but changing it is definitely not in scope for my patches.
>
>>
>> And no, I don't really care about programs detecting context switches.
>> I do, however, care about allowing non-determinism in things that
>> ought to behave deterministically. Writing a nonzero value to %gs and
>> then doing WRGSBASE is something that user code will be able to do
>> whether we like it or not, some shitty threading library is likely to
>> do this just to spite us, the the kernel needs to do *something* when
>> this happens.
>
> They will quickly notice it if there is a problem, so I don't think
> we need to worry about that.
Really?
Imagine that some brilliant lightweight threading library does:
- set GS to nonzero (by whatever means -- arch_prctl(ARCH_SET_GS,
whatever) on a pre-IVB host followed by migration, some modify_ldt
garbage, simple bloody-mindedness, whatever);
- WRGSBASE
- Use GS for a bit
This will work most of the time until it gets unlucky with preemption.
And yes, runtime library authors really do mess up in amazing ways.
It's an issue. It needs conscious design.
> Imagine that some brilliant lightweight threading library does:
>
> - set GS to nonzero (by whatever means -- arch_prctl(ARCH_SET_GS,
> whatever) on a pre-IVB host followed by migration, some modify_ldt
> garbage, simple bloody-mindedness, whatever);
Migration is only possible when the CPUID flags match.
> - WRGSBASE
> - Use GS for a bit
>
> This will work most of the time until it gets unlucky with preemption.
As soon as a kernel thread or something else schedules the value
will be lost.
> And yes, runtime library authors really do mess up in amazing ways.
>
> It's an issue. It needs conscious design.
Ok. So your only objection is the order of the context switch
updates?
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 3:41 PM, Andi Kleen <[email protected]> wrote:
>> Imagine that some brilliant lightweight threading library does:
>>
>> - set GS to nonzero (by whatever means -- arch_prctl(ARCH_SET_GS,
>> whatever) on a pre-IVB host followed by migration, some modify_ldt
>> garbage, simple bloody-mindedness, whatever);
>
> Migration is only possible when the CPUID flags match.
>
>> - WRGSBASE
>> - Use GS for a bit
>>
>> This will work most of the time until it gets unlucky with preemption.
>
> As soon as a kernel thread or something else schedules the value
> will be lost.
>
>> And yes, runtime library authors really do mess up in amazing ways.
>>
>> It's an issue. It needs conscious design.
>
> Ok. So your only objection is the order of the context switch
> updates?
No. My objection is that there needs to be an explicit statement what
the semantics are. If the agreed-upon semantics are "undefined
behavior if GS != 0 and GSBASE doesn't match the descriptor", so be
it, but this needs to be a conscious decision and needs to be weighed
against the alternatives.
The actual implementation details are just details. They need to
match the intended semantics, of course.
--Andy
> No. My objection is that there needs to be an explicit statement what
> the semantics are. If the agreed-upon semantics are "undefined
> behavior if GS != 0 and GSBASE doesn't match the descriptor", so be
> it, but this needs to be a conscious decision and needs to be weighed
> against the alternatives.
Documentation/x86/fsgs.txt already has this statement:
>>>
Another requirement is that the FS or GS selector has to be zero
(is normally true unless changed explicitly). When it is non-zero
the context switch assumes the bases were loaded through the LDT/GDT,
and will reload that.
<<<
Is that sufficient?
>
> The actual implementation details are just details. They need to
> match the intended semantics, of course.
I believe my implementation matches the paragraph above.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Mar 21, 2016 at 3:52 PM, Andi Kleen <[email protected]> wrote:
>> No. My objection is that there needs to be an explicit statement what
>> the semantics are. If the agreed-upon semantics are "undefined
>> behavior if GS != 0 and GSBASE doesn't match the descriptor", so be
>> it, but this needs to be a conscious decision and needs to be weighed
>> against the alternatives.
>
> Documentation/x86/fsgs.txt already has this statement:
>
>>>>
> Another requirement is that the FS or GS selector has to be zero
> (is normally true unless changed explicitly). When it is non-zero
> the context switch assumes the bases were loaded through the LDT/GDT,
> and will reload that.
> <<<
>
> Is that sufficient?
>
Maybe. Are there better options? Could we, for example, actually try
to preserve the state if this happens? Would it be worth it?
>>
>> The actual implementation details are just details. They need to
>> match the intended semantics, of course.
>
> I believe my implementation matches the paragraph above.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.
--
Andy Lutomirski
AMA Capital Management, LLC
> Maybe. Are there better options? Could we, for example, actually try
> to preserve the state if this happens?
We probably could, at the cost of making the context switch a bit
more expensive.
> Would it be worth it?
I doubt it.
I expect once WR*BASE is widely used noone will bother with selectors
for 64bit programs anymore.
-Andi
On Mon, 21 Mar 2016, Andi Kleen wrote:
> On Mon, Mar 21, 2016 at 03:08:46PM -0700, Andy Lutomirski wrote:
> > On Mon, Mar 21, 2016 at 3:05 PM, Andi Kleen <[email protected]> wrote:
> > >> Please add a patch before this one that renames gs to gsbase. This is
> > >> unreadable as is.
> > >
> > > I investigated this now, and it's not straight forward because on 32bit
> > > ->gs is actually gsindex, not gsbase. So with a straight rename you
> > > would end up with index in base on 32bit, which would be confusing.
> > >
> >
> > I would take this inconsistency as a reason why this needs to be fixed
> > before any further changes are made.
>
> The patchkit doesn't change anything for 32bit, so any inconsistency
> in 32bit is completely orthogonal.
>
> I don't see how adding a bazillion unrelated changes makes this
> patchkit any better.
The general rule is that we do necessary cleanups before we change or add new
functionality. That's nothing which can be negotiated. Period.
Thanks,
tglx
On Mon, Mar 21, 2016 at 6:15 PM, Andi Kleen <[email protected]> wrote:
> On Mon, Mar 21, 2016 at 03:08:46PM -0700, Andy Lutomirski wrote:
>> On Mon, Mar 21, 2016 at 3:05 PM, Andi Kleen <[email protected]> wrote:
>> >> Please add a patch before this one that renames gs to gsbase. This is
>> >> unreadable as is.
>> >
>> > I investigated this now, and it's not straight forward because on 32bit
>> > ->gs is actually gsindex, not gsbase. So with a straight rename you
>> > would end up with index in base on 32bit, which would be confusing.
>> >
>>
>> I would take this inconsistency as a reason why this needs to be fixed
>> before any further changes are made.
>
> The patchkit doesn't change anything for 32bit, so any inconsistency
> in 32bit is completely orthogonal.
The inconsistency is with the 64-bit code. fs/gs imply the %fs and
%gs registers (the selector index), not the base.
The rename should be:
fs -> fsbase
gs -> gsbase
fsindex -> fs
gsindex -> gs
--
Brian Gerst
On Mon, 21 Mar 2016 09:16:05 -0700, Andi Kleen said:
> From: Andi Kleen <[email protected]>
>
> v2: Minor updates to documentation requested in review.
> +In addition the kernel needs to explicitly enable these instructions, as it
> +may otherwise not correctly context switch the state. Newer Linux
> +kernels enable this. When the kernel did not enable the instruction
> +they will fault with an #UD exception.
Looks OK for now, since the code hasn't actually landed. However, once
we know for sure what release, we probably want to replace "Newer" with
"4.N or later" or something - particularly if the man-pages project will
be using this as source material.
On Mon, Mar 21, 2016 at 9:16 AM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> Introduction:
>
> IvyBridge added four new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there. Another use
> case is having (relatively) cheap access to a new address
> register per thread.
I'm queuing up a variant of this patch. I'll send it out for review
when it's ready.
--Andy