2019-03-15 20:09:27

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 00/12] x86: Enable FSGSBASE instructions

Updates from v5 [5]:
* Drop the new tain flag (TAINT_INSECURE)
* Cleanup copy_thread_tls(), some changelog, and unnecessary comments on
assembly macros
* Rearrange some helper updates appropriately (from patch 4 to 6)

Updates from v4 [4]:
* Remove the FSGSBASE assembly macros

Updates from v3 [3]:
* Raise minimum binutils requirement to use the new instructions directly
* Optimize FIND_PERCPU_BASE macro
* Rename some helper functions, __{rd,wr}gsbase_inactive()
* Use NOKPROBE_SYMBOL instead of __kprobes
* Rebase on top of the helper function fix [7]

Update from v2 [2]:
* Separate out the preparatory patches [6] (merged as of now)
* Bisect the paranoid_entry update patch
* Edit minor nits

Updates from v1 [1]:
* Update the GSBASE update mechanism on the paranoid entry/exit.
* Exclude ptracer backward compatibility patches.
* Include FSGSBASE documentation and enumerating capability
for user space
* Add the TAINT_INSECURE flag.

[1] Version 1: https://lore.kernel.org/patchwork/cover/934843
[2] Version 2: https://lore.kernel.org/patchwork/cover/912063
[3] Version 3: https://lore.kernel.org/patchwork/cover/1002725
[4] Version 4: https://lore.kernel.org/patchwork/cover/1032799
[5] Version 5: https://lore.kernel.org/patchwork/cover/1038035
[6] https://lore.kernel.org/patchwork/cover/988180
[7] https://lore.kernel.org/patchwork/patch/1017513

Andi Kleen (3):
x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/fsgsbase/64: Add documentation for FSGSBASE

Andy Lutomirski (4):
x86/fsgsbase/64: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
x86/fsgsbase/64: Preserve FS/GS state in __switch_to() if FSGSBASE is
on
selftests/x86/fsgsbase: Test WRGSBASE
x86/fsgsbase/64: Enable FSGSBASE by default and add a chicken bit

Chang S. Bae (5):
kbuild: Raise the minimum required binutils version to 2.21
x86/fsgsbase/64: Enable FSGSBASE instructions in the helper functions
x86/fsgsbase/64: When copying a thread, use the FSGSBASE instructions
if available
x86/fsgsbase/64: Introduce the FIND_PERCPU_BASE macro
x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

.../admin-guide/kernel-parameters.txt | 2 +
Documentation/process/changes.rst | 6 +-
Documentation/x86/fsgs.txt | 104 +++++++++++++++++
arch/x86/entry/entry_64.S | 71 +++++++++---
arch/x86/include/asm/fsgsbase.h | 95 ++++++++++++++--
arch/x86/include/asm/inst.h | 15 +++
arch/x86/include/uapi/asm/hwcap2.h | 3 +
arch/x86/kernel/cpu/common.c | 22 ++++
arch/x86/kernel/process_64.c | 105 ++++++++++++++++--
tools/testing/selftests/x86/fsgsbase.c | 102 ++++++++++++++++-
10 files changed, 484 insertions(+), 41 deletions(-)
create mode 100644 Documentation/x86/fsgs.txt

--
2.19.1



2019-03-15 20:07:57

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 03/12] x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions

From: Andi Kleen <[email protected]>

Add C intrinsics and assembler macros for the new FSBASE and GSBASE
instructions.

Very straight forward. Used in followon patches.

[ luto: Rename the variables from FS and GS to FSBASE and GSBASE and
make <asm/fsgsbase.h> safe to include on 32-bit kernels. ]

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index bca4c743de77..fdd1177499b4 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -19,6 +19,36 @@ extern unsigned long x86_gsbase_read_task(struct task_struct *task);
extern void x86_fsbase_write_task(struct task_struct *task, unsigned long fsbase);
extern void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase);

+/* Must be protected by X86_FEATURE_FSGSBASE check. */
+
+static __always_inline unsigned long rdfsbase(void)
+{
+ unsigned long fsbase;
+
+ asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory");
+
+ return fsbase;
+}
+
+static __always_inline unsigned long rdgsbase(void)
+{
+ unsigned long gsbase;
+
+ asm volatile("rdgsbase %0" : "=r" (gsbase) :: "memory");
+
+ return gsbase;
+}
+
+static __always_inline void wrfsbase(unsigned long fsbase)
+{
+ asm volatile("wrfsbase %0" :: "r" (fsbase) : "memory");
+}
+
+static __always_inline void wrgsbase(unsigned long gsbase)
+{
+ asm volatile("wrgsbase %0" :: "r" (gsbase) : "memory");
+}
+
/* Helper functions for reading/writing FS/GS base */

static inline unsigned long x86_fsbase_read_cpu(void)
--
2.19.1


2019-03-15 20:08:03

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 07/12] x86/fsgsbase/64: Introduce the FIND_PERCPU_BASE macro

GSBASE is used to find per-CPU data in the kernel. But when it is unknown,
the per-CPU base can be found from the per_cpu_offset table with a CPU NR.
The CPU NR is extracted from the limit field of the CPUNODE entry in GDT,
or by the RDPID instruction.

Also, add the GAS-compatible RDPID macro.

The new macro will be used on a following patch.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 41 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/inst.h | 15 ++++++++++++
2 files changed, 56 insertions(+)

diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index aefd53767a5d..5e3dfbe8c1bf 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -78,6 +78,47 @@ extern void x86_gsbase_write_cpu_inactive(unsigned long gsbase);

#endif /* CONFIG_X86_64 */

+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_X86_64
+
+#include <asm/inst.h>
+
+#ifdef CONFIG_SMP
+
+/*
+ * CPU/node NR is loaded from the limit (size) field of a special segment
+ * descriptor entry in GDT.
+ */
+.macro LOAD_CPU_AND_NODE_SEG_LIMIT reg:req
+ movq $__CPUNODE_SEG, \reg
+ lsl \reg, \reg
+.endm
+
+/*
+ * Fetch the per-CPU GSBASE value for this processor and put it in @reg.
+ * We normally use %gs for accessing per-CPU data, but we are setting up
+ * %gs here and obviously can not use %gs itself to access per-CPU data.
+ */
+.macro FIND_PERCPU_BASE reg:req
+ ALTERNATIVE \
+ "LOAD_CPU_AND_NODE_SEG_LIMIT \reg", \
+ "RDPID \reg", \
+ X86_FEATURE_RDPID
+ andq $VDSO_CPUNODE_MASK, \reg
+ movq __per_cpu_offset(, \reg, 8), \reg
+.endm
+
+#else
+
+.macro FIND_PERCPU_BASE reg:req
+ movq pcpu_unit_offsets(%rip), \reg
+.endm
+
+#endif /* CONFIG_SMP */
+
+#endif /* CONFIG_X86_64 */
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_FSGSBASE_H */
diff --git a/arch/x86/include/asm/inst.h b/arch/x86/include/asm/inst.h
index f5a796da07f8..d063841a17e3 100644
--- a/arch/x86/include/asm/inst.h
+++ b/arch/x86/include/asm/inst.h
@@ -306,6 +306,21 @@
.endif
MODRM 0xc0 movq_r64_xmm_opd1 movq_r64_xmm_opd2
.endm
+
+.macro RDPID opd
+ REG_TYPE rdpid_opd_type \opd
+ .if rdpid_opd_type == REG_TYPE_R64
+ R64_NUM rdpid_opd \opd
+ .else
+ R32_NUM rdpid_opd \opd
+ .endif
+ .byte 0xf3
+ .if rdpid_opd > 7
+ PFX_REX rdpid_opd 0
+ .endif
+ .byte 0x0f, 0xc7
+ MODRM 0xc0 rdpid_opd 0x7
+.endm
#endif

#endif
--
2.19.1


2019-03-15 20:08:06

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 09/12] selftests/x86/fsgsbase: Test WRGSBASE

From: Andy Lutomirski <[email protected]>

This validates that GS and GSBASE are independently preserved across
context switches.

[ chang: Use FSGSBASE instructions directly instead of .byte ]

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
tools/testing/selftests/x86/fsgsbase.c | 102 ++++++++++++++++++++++++-
1 file changed, 99 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/x86/fsgsbase.c b/tools/testing/selftests/x86/fsgsbase.c
index f249e042b3b5..5956475972f1 100644
--- a/tools/testing/selftests/x86/fsgsbase.c
+++ b/tools/testing/selftests/x86/fsgsbase.c
@@ -23,6 +23,7 @@
#include <pthread.h>
#include <asm/ldt.h>
#include <sys/mman.h>
+#include <setjmp.h>

#ifndef __x86_64__
# error This test is 64-bit only
@@ -71,6 +72,43 @@ static void sigsegv(int sig, siginfo_t *si, void *ctx_void)

}

+static jmp_buf jmpbuf;
+
+static void sigill(int sig, siginfo_t *si, void *ctx_void)
+{
+ siglongjmp(jmpbuf, 1);
+}
+
+static bool have_fsgsbase;
+
+static inline unsigned long rdgsbase(void)
+{
+ unsigned long gsbase;
+
+ asm volatile("rdgsbase %0" : "=r" (gsbase) :: "memory");
+
+ return gsbase;
+}
+
+static inline unsigned long rdfsbase(void)
+{
+ unsigned long fsbase;
+
+ asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory");
+
+ return fsbase;
+}
+
+static inline void wrgsbase(unsigned long gsbase)
+{
+ asm volatile("wrgsbase %0" :: "r" (gsbase) : "memory");
+}
+
+static inline void wrfsbase(unsigned long fsbase)
+{
+ asm volatile("wrfsbase %0" :: "r" (fsbase) : "memory");
+}
+
enum which_base { FS, GS };

static unsigned long read_base(enum which_base which)
@@ -199,14 +237,16 @@ static void do_remote_base()
to_set, hard_zero ? " and clear gs" : "", sel);
}

-void do_unexpected_base(void)
+static __thread int set_thread_area_entry_number = -1;
+
+static void do_unexpected_base(void)
{
/*
* The goal here is to try to arrange for GS == 0, GSBASE !=
* 0, and for the the kernel the think that GSBASE == 0.
*
* To make the test as reliable as possible, this uses
- * explicit descriptorss. (This is not the only way. This
+ * explicit descriptors. (This is not the only way. This
* could use ARCH_SET_GS with a low, nonzero base, but the
* relevant side effect of ARCH_SET_GS could change.)
*/
@@ -239,7 +279,7 @@ void do_unexpected_base(void)
MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0);
memcpy(low_desc, &desc, sizeof(desc));

- low_desc->entry_number = -1;
+ low_desc->entry_number = set_thread_area_entry_number;

/* 32-bit set_thread_area */
long ret;
@@ -254,6 +294,8 @@ void do_unexpected_base(void)
return;
}
printf("\tother thread: using GDT slot %d\n", desc.entry_number);
+ set_thread_area_entry_number = desc.entry_number;
+
asm volatile ("mov %0, %%gs" : : "rm" ((unsigned short)((desc.entry_number << 3) | 0x3)));
}

@@ -265,6 +307,34 @@ void do_unexpected_base(void)
asm volatile ("mov %0, %%gs" : : "rm" ((unsigned short)0));
}

+void test_wrbase(unsigned short index, unsigned long base)
+{
+ unsigned short newindex;
+ unsigned long newbase;
+
+ printf("[RUN]\tGS = 0x%hx, GSBASE = 0x%lx\n", index, base);
+
+ asm volatile ("mov %0, %%gs" : : "rm" (index));
+ wrgsbase(base);
+
+ remote_base = 0;
+ ftx = 1;
+ syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+ while (ftx != 0)
+ syscall(SYS_futex, &ftx, FUTEX_WAIT, 1, NULL, NULL, 0);
+
+ asm volatile ("mov %%gs, %0" : "=rm" (newindex));
+ newbase = rdgsbase();
+
+ if (newindex == index && newbase == base) {
+ printf("[OK]\tIndex and base were preserved\n");
+ } else {
+ printf("[FAIL]\tAfter switch, GS = 0x%hx and GSBASE = 0x%lx\n",
+ newindex, newbase);
+ nerrs++;
+ }
+}
+
static void *threadproc(void *ctx)
{
while (1) {
@@ -371,6 +441,17 @@ int main()
{
pthread_t thread;

+ /* Probe FSGSBASE */
+ sethandler(SIGILL, sigill, 0);
+ if (sigsetjmp(jmpbuf, 1) == 0) {
+ rdfsbase();
+ have_fsgsbase = true;
+ printf("\tFSGSBASE instructions are enabled\n");
+ } else {
+ printf("\tFSGSBASE instructions are disabled\n");
+ }
+ clearhandler(SIGILL);
+
sethandler(SIGSEGV, sigsegv, 0);

check_gs_value(0);
@@ -417,6 +498,21 @@ int main()

test_unexpected_base();

+ if (have_fsgsbase) {
+ unsigned short ss;
+
+ asm volatile ("mov %%ss, %0" : "=rm" (ss));
+
+ test_wrbase(0, 0);
+ test_wrbase(0, 1);
+ test_wrbase(0, 0x200000000);
+ test_wrbase(0, 0xffffffffffffffff);
+ test_wrbase(ss, 0);
+ test_wrbase(ss, 1);
+ test_wrbase(ss, 0x200000000);
+ test_wrbase(ss, 0xffffffffffffffff);
+ }
+
ftx = 3; /* Kill the thread. */
syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);

--
2.19.1


2019-03-15 20:08:17

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 12/12] x86/fsgsbase/64: Add documentation for FSGSBASE

From: Andi Kleen <[email protected]>

v2: Minor updates to documentation requested in review.
v3: Update for new gcc and various improvements.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/x86/fsgs.txt | 104 +++++++++++++++++++++++++++++++++++++
1 file changed, 104 insertions(+)
create mode 100644 Documentation/x86/fsgs.txt

diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
new file mode 100644
index 000000000000..7a973a5c1767
--- /dev/null
+++ b/Documentation/x86/fsgs.txt
@@ -0,0 +1,104 @@
+
+Using FS and GS prefixes on 64bit x86 linux
+
+The x86 architecture supports segment prefixes per instruction to add an
+offset to an address. On 64bit x86, these are mostly nops, except for FS
+and GS.
+
+This offers an efficient way to reference a global pointer.
+
+The compiler has to generate special code to use these base registers,
+or they can be accessed with inline assembler.
+
+ mov %gs:offset,%reg
+ mov %fs:offset,%reg
+
+On 64bit code, FS is used to address the thread local segment (TLS), declared using
+__thread. The compiler then automatically generates the correct prefixes and
+relocations to access these values.
+
+FS is normally managed by the runtime code or the threading library
+Overwriting it can break a lot of things (including syscalls and gdb),
+but it can make sense to save/restore it for threading purposes.
+
+GS is freely available, but may need special (compiler or inline assembler)
+code to use.
+
+Traditionally 64bit FS and GS could be set by the arch_prctl system call
+
+ arch_prctl(ARCH_SET_GS, value)
+ arch_prctl(ARCH_SET_FS, value)
+
+[There was also an older method using modify_ldt(), inherited from 32bit,
+but this is not discussed here.]
+
+However using a syscall is problematic for user space threading libraries
+that want to context switch in user space. The whole point of them
+is avoiding the overhead of a syscall. It's also cleaner for compilers
+wanting to use the extra register to use instructions to write
+it, or read it directly to compute addresses and offsets.
+
+Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
+access these registers quickly from user context
+
+ RDFSBASE %reg read the FS base (or _readfsbase_u64)
+ RDGSBASE %reg read the GS base (or _readgsbase_u64)
+
+ WRFSBASE %reg write the FS base (or _writefsbase_u64)
+ WRGSBASE %reg write the GS base (or _writegsbase_u64)
+
+If you use the intrinsics include <immintrin.h> and set the -mfsgsbase option.
+
+The instructions are supported by the CPU when the "fsgsbase" string is shown in
+/proc/cpuinfo (or directly retrieved through the CPUID instruction,
+7:0 (ebx), word 9, bit 0)
+
+The instructions are only available to 64bit binaries.
+
+In addition the kernel needs to explicitly enable these instructions, as it
+may otherwise not correctly context switch the state. Newer Linux
+kernels enable this. When the kernel did not enable the instruction
+they will fault with an #UD exception.
+
+An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
+bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
+kernel supports FSGSBASE.
+
+ #include <sys/auxv.h>
+ #include <elf.h>
+
+ /* Will be eventually in asm/hwcap.h */
+ #define HWCAP2_FSGSBASE (1 << 1)
+
+ unsigned val = getauxval(AT_HWCAP2);
+ if (val & HWCAP2_FSGSBASE) {
+ asm("wrgsbase %0" :: "r" (ptr));
+ }
+
+No extra CPUID check needed as the kernel will not set this bit if the CPU
+does not support it.
+
+gcc 6 will have special support to directly access data relative
+to fs/gs using the __seg_fs and __seg_gs address space pointer
+modifiers.
+
+#ifndef __SEG_GS
+#error "Need gcc 6 or later"
+#endif
+
+struct gsdata {
+ int a;
+ int b;
+} gsdata = { 1, 2 };
+
+int __seg_gs *valp = 0; /* offset relative to GS */
+
+ /* Check if kernel supports FSGSBASE as above */
+
+ /* Set up new GS */
+ asm("wrgsbase %0" :: "r" (&gsdata));
+
+ /* Now the global pointer can be used normally */
+ printf("gsdata.a = %d\n", *valp);
+
+Andi Kleen
--
2.19.1


2019-03-15 20:08:31

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 11/12] x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2

From: Andi Kleen <[email protected]>

The kernel needs to explicitly enable FSGSBASE. So, the application needs
to know if it can safely use these instructions. Just looking at the CPUID
bit is not enough because it may be running in a kernel that does not
enable the instructions.

One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want to overwrite
the signal handlers of the main application.

So we need to provide a way for the application to discover the kernel
capability.

I used AT_HWCAP2 in the ELF aux vector which is already used by PPC for
similar things. We define a new Linux defined bitmap returned in AT_HWCAP.
Next to MONITOR/MWAIT, bit 1 is reserved for FSGSBASE capability checks.

The application can then access it manually or using the getauxval()
function in newer glibc.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/uapi/asm/hwcap2.h | 3 +++
arch/x86/kernel/cpu/common.c | 4 +++-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/hwcap2.h b/arch/x86/include/uapi/asm/hwcap2.h
index 6ebaae90e207..c5ce54e749f6 100644
--- a/arch/x86/include/uapi/asm/hwcap2.h
+++ b/arch/x86/include/uapi/asm/hwcap2.h
@@ -5,4 +5,7 @@
/* MONITOR/MWAIT enabled in Ring 3 */
#define HWCAP2_RING3MWAIT (1 << 0)

+/* Kernel allows FSGSBASE instructions available in Ring 3 */
+#define HWCAP2_FSGSBASE BIT(1)
+
#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3d7d4ca1a29e..3bdac91316c9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1369,8 +1369,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_umip(c);

/* Enable FSGSBASE instructions if available. */
- if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
cr4_set_bits(X86_CR4_FSGSBASE);
+ elf_hwcap2 |= HWCAP2_FSGSBASE;
+ }

/*
* The vendor-specific functions might have changed features.
--
2.19.1


2019-03-15 20:08:42

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 10/12] x86/fsgsbase/64: Enable FSGSBASE by default and add a chicken bit

From: Andy Lutomirski <[email protected]>

Now that FSGSBASE is fully supported, remove unsafe_fsgsbase, enable
FSGSBASE by default, and add nofsgsbase to disable it.

Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 3 +-
arch/x86/kernel/cpu/common.c | 32 ++++++++-----------
2 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b6ed956a78ab..29e6924ac957 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2765,8 +2765,7 @@
no5lvl [X86-64] Disable 5-level paging mode. Forces
kernel to use 4-level paging instead.

- unsafe_fsgsbase [X86] Allow FSGSBASE instructions. This will be
- replaced with a nofsgsbase flag.
+ nofsgsbase [X86] Disables FSGSBASE instructions.

no_console_suspend
[HW] Never suspend the console
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 40a2f60e7251..3d7d4ca1a29e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -365,21 +365,21 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}

-/*
- * Temporary hack: FSGSBASE is unsafe until a few kernel code paths are
- * updated. This allows us to get the kernel ready incrementally.
- *
- * Once all the pieces are in place, these will go away and be replaced with
- * a nofsgsbase chicken flag.
- */
-static bool unsafe_fsgsbase;
-
-static __init int setup_unsafe_fsgsbase(char *arg)
+static __init int x86_nofsgsbase_setup(char *arg)
{
- unsafe_fsgsbase = true;
+ /* Require an exact match without trailing characters. */
+ if (strlen(arg))
+ return 0;
+
+ /* Do not emit a message if the feature is not present. */
+ if (!boot_cpu_has(X86_FEATURE_FSGSBASE))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_FSGSBASE);
+ pr_info("nofsgsbase: FSGSBASE disabled\n");
return 1;
}
-__setup("unsafe_fsgsbase", setup_unsafe_fsgsbase);
+__setup("nofsgsbase", x86_nofsgsbase_setup);

/*
* Protection Keys are not available in 32-bit mode.
@@ -1369,12 +1369,8 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_umip(c);

/* Enable FSGSBASE instructions if available. */
- if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
- if (unsafe_fsgsbase)
- cr4_set_bits(X86_CR4_FSGSBASE);
- else
- clear_cpu_cap(c, X86_FEATURE_FSGSBASE);
- }
+ if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ cr4_set_bits(X86_CR4_FSGSBASE);

/*
* The vendor-specific functions might have changed features.
--
2.19.1


2019-03-15 20:08:56

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 02/12] kbuild: Raise the minimum required binutils version to 2.21

It helps to use some new instructions directly in assembly code.

Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linux Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
Documentation/process/changes.rst | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/process/changes.rst b/Documentation/process/changes.rst
index 18735dc460a0..0a18075c485e 100644
--- a/Documentation/process/changes.rst
+++ b/Documentation/process/changes.rst
@@ -31,7 +31,7 @@ you probably needn't concern yourself with isdn4k-utils.
====================== =============== ========================================
GNU C 4.6 gcc --version
GNU make 3.81 make --version
-binutils 2.20 ld -v
+binutils 2.21 ld -v
flex 2.5.35 flex --version
bison 2.0 bison --version
util-linux 2.10o fdformat --version
@@ -77,9 +77,7 @@ You will need GNU make 3.81 or later to build the kernel.
Binutils
--------

-The build system has, as of 4.13, switched to using thin archives (`ar T`)
-rather than incremental linking (`ld -r`) for built-in.a intermediate steps.
-This requires binutils 2.20 or newer.
+Binutils 2.21 or newer is needed to build the kernel.

pkg-config
----------
--
2.19.1


2019-03-15 20:09:06

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

The FSGSBASE instructions allow fast accesses on GSBASE. Now, at the
paranoid_entry, the per-CPU base value can be always copied to GSBASE.
And the original GSBASE value will be restored at the exit.

So far, GSBASE modification has not been directly allowed from userspace.
So, swapping GSBASE has been conditionally executed according to the
kernel-enforced convention that a negative GSBASE indicates a kernel value.
But when FSGSBASE is enabled, userspace can put an arbitrary value in
GSBASE. The change will secure a correct GSBASE value with FSGSBASE.

Also, factor out the RDMSR-based GSBASE read into a new macro,
READ_MSR_GSBASE.

Suggested-by: H. Peter Anvin <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/entry/entry_64.S | 71 +++++++++++++++++++++++++++------
arch/x86/include/asm/fsgsbase.h | 9 +++++
2 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..9df528565e40 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -38,6 +38,7 @@
#include <asm/export.h>
#include <asm/frame.h>
#include <asm/nospec-branch.h>
+#include <asm/fsgsbase.h>
#include <linux/err.h>

#include "calling.h"
@@ -934,10 +935,14 @@ ENTRY(\sym)
addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif

- /* these procedures expect "no swapgs" flag in ebx */
.if \paranoid
+ /*
+ * With FSGSBASE, original GSBASE is stored in %rbx
+ * Without FSGSBASE, expect "no swapgs" flag in %ebx
+ */
jmp paranoid_exit
.else
+ /* Expect "no swapgs" flag in %ebx */
jmp error_exit
.endif

@@ -1151,22 +1156,24 @@ idtentry machine_check do_mce has_error_code=0 paranoid=1
#endif

/*
- * Save all registers in pt_regs, and switch gs if needed.
- * Use slow, but surefire "are we in kernel?" check.
- * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
+ * Save all registers in pt_regs.
+ *
+ * When FSGSBASE enabled, current GSBASE is always copied to %rbx.
+ *
+ * Without FSGSBASE, SWAPGS is needed when entering from userspace.
+ * A positive GSBASE means it is a user value and a negative GSBASE
+ * means it is a kernel value.
+ *
+ * Return:
+ * With FSGSBASE, %rbx has current GSBASE.
+ * Without that,
+ * %ebx=0: need SWAPGS on exit, %ebx=1: otherwise
*/
ENTRY(paranoid_entry)
UNWIND_HINT_FUNC
cld
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
- movl $1, %ebx
- movl $MSR_GS_BASE, %ecx
- rdmsr
- testl %edx, %edx
- js 1f /* negative -> in kernel */
- SWAPGS
- xorl %ebx, %ebx

1:
/*
@@ -1178,9 +1185,38 @@ ENTRY(paranoid_entry)
* This is also why CS (stashed in the "iret frame" by the
* hardware at entry) can not be used: this may be a return
* to kernel code, but with a user CR3 value.
+ *
+ * As long as this PTI macro doesn't depend on kernel GSBASE,
+ * we can do it early. This is because FIND_PERCPU_BASE
+ * references data in kernel space.
*/
SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14

+ /*
+ * Read GSBASE by RDGSBASE. Kernel GSBASE is found
+ * from the per-CPU offset table with a CPU NR.
+ */
+ ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "",\
+ X86_FEATURE_FSGSBASE
+ rdgsbase %rbx
+ FIND_PERCPU_BASE %rax
+ wrgsbase %rax
+ ret
+
+.Lparanoid_entry_no_fsgsbase:
+ movl $1, %ebx
+ /*
+ * FSGSBASE is not in use, so depend on the kernel-enforced
+ * convention that a negative GSBASE indicates a kernel value.
+ */
+ READ_MSR_GSBASE save_reg=%edx
+ testl %edx, %edx /* Negative -> in kernel */
+ jns .Lparanoid_entry_swapgs
+ ret
+
+.Lparanoid_entry_swapgs:
+ SWAPGS
+ xorl %ebx, %ebx
ret
END(paranoid_entry)

@@ -1194,12 +1230,21 @@ END(paranoid_entry)
* be complicated. Fortunately, we there's no good reason
* to try to handle preemption here.
*
- * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
+ * On entry,
+ * With FSGSBASE,
+ * %rbx is original GSBASE that needs to be restored on the exit
+ * Without that,
+ * %ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
*/
ENTRY(paranoid_exit)
UNWIND_HINT_REGS
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF_DEBUG
+ ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
+ X86_FEATURE_FSGSBASE
+ wrgsbase %rbx
+ jmp .Lparanoid_exit_no_swapgs;
+.Lparanoid_exit_no_fsgsbase:
testl %ebx, %ebx /* swapgs needed? */
jnz .Lparanoid_exit_no_swapgs
TRACE_IRQS_IRETQ
@@ -1212,7 +1257,7 @@ ENTRY(paranoid_exit)
/* Always restore stashed CR3 value (see paranoid_entry) */
RESTORE_CR3 scratch_reg=%rbx save_reg=%r14
.Lparanoid_exit_restore:
- jmp restore_regs_and_return_to_kernel
+ jmp restore_regs_and_return_to_kernel
END(paranoid_exit)

/*
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index 5e3dfbe8c1bf..ba7a444ab5c8 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -117,6 +117,15 @@ extern void x86_gsbase_write_cpu_inactive(unsigned long gsbase);

#endif /* CONFIG_SMP */

+.macro READ_MSR_GSBASE save_reg:req
+ movl $MSR_GS_BASE, %ecx
+ /* Read MSR specified by %ecx into %edx:%eax */
+ rdmsr
+ .ifnc \save_reg, %edx
+ movl %edx, \save_reg
+ .endif
+.endm
+
#endif /* CONFIG_X86_64 */

#endif /* __ASSEMBLY__ */
--
2.19.1


2019-03-15 20:09:24

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 04/12] x86/fsgsbase/64: Enable FSGSBASE instructions in the helper functions

The helper functions will switch on faster accesses to FSBASE and GSBASE
when the FSGSBASE feature is enabled.

Accessing user GSBASE needs a couple of SWAPGS operations. It is avoidable
if the user GSBASE is saved at kernel entry, being updated as changes, and
restored back at kernel exit. However, it seems to spend more cycles for
savings and restorations. Little or no benefit was measured from
experiments.

Also, introduce __{rd,wr}gsbase_inactive() as helpers to access user GSBASE
with SWAPGS. Note, for Xen PV, paravirt hooks can be added, since it may
allow a very efficient but different implementation.

Signed-off-by: Chang S. Bae <[email protected]>
Cc: Any Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Cooper <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 27 +++++++---------
arch/x86/kernel/process_64.c | 56 +++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index fdd1177499b4..aefd53767a5d 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -49,35 +49,32 @@ static __always_inline void wrgsbase(unsigned long gsbase)
asm volatile("wrgsbase %0" :: "r" (gsbase) : "memory");
}

+#include <asm/cpufeature.h>
+
/* Helper functions for reading/writing FS/GS base */

static inline unsigned long x86_fsbase_read_cpu(void)
{
unsigned long fsbase;

- rdmsrl(MSR_FS_BASE, fsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ fsbase = rdfsbase();
+ else
+ rdmsrl(MSR_FS_BASE, fsbase);

return fsbase;
}

-static inline unsigned long x86_gsbase_read_cpu_inactive(void)
-{
- unsigned long gsbase;
-
- rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
-
- return gsbase;
-}
-
static inline void x86_fsbase_write_cpu(unsigned long fsbase)
{
- wrmsrl(MSR_FS_BASE, fsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ wrfsbase(fsbase);
+ else
+ wrmsrl(MSR_FS_BASE, fsbase);
}

-static inline void x86_gsbase_write_cpu_inactive(unsigned long gsbase)
-{
- wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
-}
+extern unsigned long x86_gsbase_read_cpu_inactive(void);
+extern void x86_gsbase_write_cpu_inactive(unsigned long gsbase);

#endif /* CONFIG_X86_64 */

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6a62f4af9fcf..87b5ffce2a47 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -160,6 +160,42 @@ enum which_selector {
GS
};

+/*
+ * Interrupts are disabled here. Out of line to be protected
+ * from kprobes. It is not used on Xen paravirt. When paravirt
+ * support is needed, it needs to be renamed with native_ prefix.
+ */
+static noinline unsigned long __rdgsbase_inactive(void)
+{
+ unsigned long gsbase, flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ gsbase = rdgsbase();
+ native_swapgs();
+ local_irq_restore(flags);
+
+ return gsbase;
+}
+NOKPROBE_SYMBOL(__rdgsbase_inactive);
+
+/*
+ * Interrupts are disabled here. Out of line to be protected
+ * from kprobes. It is not used on Xen paravirt. When paravirt
+ * support is needed, it needs to be renamed with native_ prefix.
+ */
+static noinline void __wrgsbase_inactive(unsigned long gsbase)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ wrgsbase(gsbase);
+ native_swapgs();
+ local_irq_restore(flags);
+}
+NOKPROBE_SYMBOL(__wrgsbase_inactive);
+
/*
* Saves the FS or GS base for an outgoing thread if FSGSBASE extensions are
* not available. The goal is to be reasonably fast on non-FSGSBASE systems.
@@ -338,6 +374,26 @@ static unsigned long x86_fsgsbase_read_task(struct task_struct *task,
return base;
}

+unsigned long x86_gsbase_read_cpu_inactive(void)
+{
+ unsigned long gsbase;
+
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ gsbase = __rdgsbase_inactive();
+ else
+ rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
+
+ return gsbase;
+}
+
+void x86_gsbase_write_cpu_inactive(unsigned long gsbase)
+{
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ __wrgsbase_inactive(gsbase);
+ else
+ wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+}
+
unsigned long x86_fsbase_read_task(struct task_struct *task)
{
unsigned long fsbase;
--
2.19.1


2019-03-15 20:09:37

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 01/12] x86/fsgsbase/64: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE

From: Andy Lutomirski <[email protected]>

This is temporary. It will allow the next few patches to be tested
incrementally.

Setting unsafe_fsgsbase is a root hole. Don't do it.

Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Randy Dunlap <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 3 +++
arch/x86/kernel/cpu/common.c | 24 +++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cf8f5877d85f..b6ed956a78ab 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2765,6 +2765,9 @@
no5lvl [X86-64] Disable 5-level paging mode. Forces
kernel to use 4-level paging instead.

+ unsafe_fsgsbase [X86] Allow FSGSBASE instructions. This will be
+ replaced with a nofsgsbase flag.
+
no_console_suspend
[HW] Never suspend the console
Disable suspending of consoles during suspend and
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..40a2f60e7251 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -365,6 +365,22 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}

+/*
+ * Temporary hack: FSGSBASE is unsafe until a few kernel code paths are
+ * updated. This allows us to get the kernel ready incrementally.
+ *
+ * Once all the pieces are in place, these will go away and be replaced with
+ * a nofsgsbase chicken flag.
+ */
+static bool unsafe_fsgsbase;
+
+static __init int setup_unsafe_fsgsbase(char *arg)
+{
+ unsafe_fsgsbase = true;
+ return 1;
+}
+__setup("unsafe_fsgsbase", setup_unsafe_fsgsbase);
+
/*
* Protection Keys are not available in 32-bit mode.
*/
@@ -1352,6 +1368,14 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smap(c);
setup_umip(c);

+ /* Enable FSGSBASE instructions if available. */
+ if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
+ if (unsafe_fsgsbase)
+ cr4_set_bits(X86_CR4_FSGSBASE);
+ else
+ clear_cpu_cap(c, X86_FEATURE_FSGSBASE);
+ }
+
/*
* The vendor-specific functions might have changed features.
* Now we do "generic changes."
--
2.19.1


2019-03-15 20:09:39

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 05/12] x86/fsgsbase/64: Preserve FS/GS state in __switch_to() if FSGSBASE is on

From: Andy Lutomirski <[email protected]>

With the new FSGSBASE instructions, we can efficiently read and write
the FSBASE and GSBASE in __switch_to(). Use that capability to preserve
the full state.

This will enable user code to do whatever it wants with the new
instructions without any kernel-induced gotchas. (There can still be
architectural gotchas: movl %gs,%eax; movl %eax,%gs may change GSBASE
if WRGSBASE was used, but users are expected to read the CPU manual
before doing things like that.)

This is a considerable speedup. It seems to save about 100 cycles
per context switch compared to the baseline 4.6-rc1 behavior on my
Skylake laptop.

[ chang: 5~10% performance improvements were seen by a context switch
benchmark that ran threads with different FS/GSBASE values (to the
baseline 4.16). Minor edit on the changelog. ]

Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_64.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 87b5ffce2a47..d3abe4748d8b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -245,8 +245,18 @@ static __always_inline void save_fsgs(struct task_struct *task)
{
savesegment(fs, task->thread.fsindex);
savesegment(gs, task->thread.gsindex);
- save_base_legacy(task, task->thread.fsindex, FS);
- save_base_legacy(task, task->thread.gsindex, GS);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /*
+ * If FSGSBASE is enabled, we can't make any useful guesses
+ * about the base, and user code expects us to save the current
+ * value. Fortunately, reading the base directly is efficient.
+ */
+ task->thread.fsbase = rdfsbase();
+ task->thread.gsbase = __rdgsbase_inactive();
+ } else {
+ save_base_legacy(task, task->thread.fsindex, FS);
+ save_base_legacy(task, task->thread.gsindex, GS);
+ }
}

#if IS_ENABLED(CONFIG_KVM)
@@ -325,10 +335,22 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
static __always_inline void x86_fsgsbase_load(struct thread_struct *prev,
struct thread_struct *next)
{
- load_seg_legacy(prev->fsindex, prev->fsbase,
- next->fsindex, next->fsbase, FS);
- load_seg_legacy(prev->gsindex, prev->gsbase,
- next->gsindex, next->gsbase, GS);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /* Update the FS and GS selectors if they could have changed. */
+ if (unlikely(prev->fsindex || next->fsindex))
+ loadseg(FS, next->fsindex);
+ if (unlikely(prev->gsindex || next->gsindex))
+ loadseg(GS, next->gsindex);
+
+ /* Update the bases. */
+ wrfsbase(next->fsbase);
+ __wrgsbase_inactive(next->gsbase);
+ } else {
+ load_seg_legacy(prev->fsindex, prev->fsbase,
+ next->fsindex, next->fsbase, FS);
+ load_seg_legacy(prev->gsindex, prev->gsbase,
+ next->gsindex, next->gsbase, GS);
+ }
}

static unsigned long x86_fsgsbase_read_task(struct task_struct *task,
--
2.19.1


2019-03-15 20:10:29

by Chang S. Bae

[permalink] [raw]
Subject: [RESEND PATCH v6 06/12] x86/fsgsbase/64: When copying a thread, use the FSGSBASE instructions if available

Copy real FS/GSBASE values instead of approximation when FSGSBASE is
enabled.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/kernel/process_64.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d3abe4748d8b..c4de0907d909 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -422,7 +422,8 @@ unsigned long x86_fsbase_read_task(struct task_struct *task)

if (task == current)
fsbase = x86_fsbase_read_cpu();
- else if (task->thread.fsindex == 0)
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) ||
+ (task->thread.fsindex == 0))
fsbase = task->thread.fsbase;
else
fsbase = x86_fsgsbase_read_task(task, task->thread.fsindex);
@@ -436,7 +437,8 @@ unsigned long x86_gsbase_read_task(struct task_struct *task)

if (task == current)
gsbase = x86_gsbase_read_cpu_inactive();
- else if (task->thread.gsindex == 0)
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) ||
+ (task->thread.gsindex == 0))
gsbase = task->thread.gsbase;
else
gsbase = x86_fsgsbase_read_task(task, task->thread.gsindex);
@@ -475,10 +477,11 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap_ptr = NULL;

- savesegment(gs, p->thread.gsindex);
- p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
- savesegment(fs, p->thread.fsindex);
- p->thread.fsbase = p->thread.fsindex ? 0 : me->thread.fsbase;
+ save_fsgs(me);
+ p->thread.fsindex = me->thread.fsindex;
+ p->thread.fsbase = me->thread.fsbase;
+ p->thread.gsindex = me->thread.gsindex;
+ p->thread.gsbase = me->thread.gsbase;
savesegment(es, p->thread.es);
savesegment(ds, p->thread.ds);
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
--
2.19.1


2019-03-25 09:04:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 07/12] x86/fsgsbase/64: Introduce the FIND_PERCPU_BASE macro

On Fri, 15 Mar 2019, Chang S. Bae wrote:

> GSBASE is used to find per-CPU data in the kernel.

It's not used to find per cpu data. per cpu data access is using GS based
addressing.

> But when it is unknown,

What is unknown?

> the per-CPU base can be found from the per_cpu_offset table with a CPU NR.
> The CPU NR is extracted from the limit field of the CPUNODE entry in GDT,
> or by the RDPID instruction.
>
> Also, add the GAS-compatible RDPID macro.
>
> The new macro will be used on a following patch.

So this is yet another changelog which describes the WHAT and not the WHY
and lacks any form of sensible context.

> diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
> index aefd53767a5d..5e3dfbe8c1bf 100644
> --- a/arch/x86/include/asm/fsgsbase.h
> +++ b/arch/x86/include/asm/fsgsbase.h
> @@ -78,6 +78,47 @@ extern void x86_gsbase_write_cpu_inactive(unsigned long gsbase);
>
> #endif /* CONFIG_X86_64 */
>
> +#else /* __ASSEMBLY__ */
> +
> +#ifdef CONFIG_X86_64
> +
> +#include <asm/inst.h>

Is there are good reason why this ASM MACRO maze needs to be in a global
visible header. AFAICT this is only used in the entry code. So why can't it
be added to the other entry code macros in calling.h or some other sensible
place?

> +#ifdef CONFIG_SMP
> +
> +/*
> + * CPU/node NR is loaded from the limit (size) field of a special segment
> + * descriptor entry in GDT.
> + */
> +.macro LOAD_CPU_AND_NODE_SEG_LIMIT reg:req
> + movq $__CPUNODE_SEG, \reg
> + lsl \reg, \reg
> +.endm
> +
> +/*
> + * Fetch the per-CPU GSBASE value for this processor and put it in @reg.
> + * We normally use %gs for accessing per-CPU data, but we are setting up
> + * %gs here and obviously can not use %gs itself to access per-CPU data.
> + */
> +.macro FIND_PERCPU_BASE reg:req

This is a complete misnomer. It's not searching for the per cpu base, it's
retrieving the per cpu base from a known place. So something like
GET_PERCPU_BASE would be appropriate.

> + ALTERNATIVE \
> + "LOAD_CPU_AND_NODE_SEG_LIMIT \reg", \
> + "RDPID \reg", \
> + X86_FEATURE_RDPID
> + andq $VDSO_CPUNODE_MASK, \reg
> + movq __per_cpu_offset(, \reg, 8), \reg
> +.endm

> diff --git a/arch/x86/include/asm/inst.h b/arch/x86/include/asm/inst.h
> index f5a796da07f8..d063841a17e3 100644
> --- a/arch/x86/include/asm/inst.h
> +++ b/arch/x86/include/asm/inst.h
> @@ -306,6 +306,21 @@
> .endif
> MODRM 0xc0 movq_r64_xmm_opd1 movq_r64_xmm_opd2
> .endm
> +
> +.macro RDPID opd

So the update to require binutils >= 2.21 does not cover RDPID?

Thanks,

tglx

2019-03-25 09:46:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

On Fri, 15 Mar 2019, Chang S. Bae wrote:

> The FSGSBASE instructions allow fast accesses on GSBASE. Now, at the
> paranoid_entry, the per-CPU base value can be always copied to GSBASE.
> And the original GSBASE value will be restored at the exit.

Again you are describing WHAT but not the WHY.

> So far, GSBASE modification has not been directly allowed from userspace.
> So, swapping GSBASE has been conditionally executed according to the
> kernel-enforced convention that a negative GSBASE indicates a kernel value.
> But when FSGSBASE is enabled, userspace can put an arbitrary value in
> GSBASE. The change will secure a correct GSBASE value with FSGSBASE.

So that's some WHY, but it should be explained _BEFORE_ explaining the
change. This changelog style is as bad as top posting. Why?

1) FSGSBASE is fast

2) Copy GSBASE always on paranoid exit and restore on entry

3) Explain the context

No. You want to explain context first and then explain why this needs a
change when FSGSBASE is enabled and how that change looks like at the
conceptual level.

> Also, factor out the RDMSR-based GSBASE read into a new macro,
> READ_MSR_GSBASE.

This new macro is related to this change in what way? None AFAICT. I'm fine
with the macro itself, but the benefit for a single usage site is dubious.

Adding this macro and using it should be done with a separate patch before
this one, so this patch becomes simpler to review.

> /*
> @@ -1178,9 +1185,38 @@ ENTRY(paranoid_entry)
> * This is also why CS (stashed in the "iret frame" by the
> * hardware at entry) can not be used: this may be a return
> * to kernel code, but with a user CR3 value.
> + *
> + * As long as this PTI macro doesn't depend on kernel GSBASE,
> + * we can do it early. This is because FIND_PERCPU_BASE
> + * references data in kernel space.

It's not about 'can do it early'. The FSGSBASE handling requires that the
kernel page tables are switched in.

And for review and bisectability sake moving the CR3 switch in front of the
GS handling should be done as a separate preparatory patch.

> */
> SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
>
> + /*
> + * Read GSBASE by RDGSBASE. Kernel GSBASE is found
> + * from the per-CPU offset table with a CPU NR.

That CPU NR comes out of thin air, right? This code is complex enough by
itself and does not need further confusion by comments which need a crystal
ball for decoding.

> + */

Sigh. I can't see how that comment explains the ALTERNATIVE jump.

> + ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "",\
> + X86_FEATURE_FSGSBASE

Please separate the above from the below with a new line for readability
sake.

> + rdgsbase %rbx
> + FIND_PERCPU_BASE %rax
> + wrgsbase %rax

So this really should be wrapped in a macro like:

SAVE_AND_SET_GSBASE %rbx, %rax

which makes it entirely clear what this is about.

> + ret
> +

> @@ -1194,12 +1230,21 @@ END(paranoid_entry)
> * be complicated. Fortunately, we there's no good reason
> * to try to handle preemption here.
> *
> - * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
> + * On entry,
> + * With FSGSBASE,
> + * %rbx is original GSBASE that needs to be restored on the exit
> + * Without that,
> + * %ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
> */
> ENTRY(paranoid_exit)
> UNWIND_HINT_REGS
> DISABLE_INTERRUPTS(CLBR_ANY)
> TRACE_IRQS_OFF_DEBUG
> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
> + X86_FEATURE_FSGSBASE
> + wrgsbase %rbx
> + jmp .Lparanoid_exit_no_swapgs;

Again. A few newlines would make it more readable.

This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
the usage sites there is the following code in the nmi maze:

/*
* Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
* as we should not be calling schedule in NMI context.
* Even with normal interrupts enabled. An NMI should not be
* setting NEED_RESCHED or anything that normal interrupts and
* exceptions might do.
*/
call paranoid_entry
UNWIND_HINT_REGS

/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi

/* Always restore stashed CR3 value (see paranoid_entry) */
RESTORE_CR3 scratch_reg=%r15 save_reg=%r14

testl %ebx, %ebx /* swapgs needed? */
jnz nmi_restore
nmi_swapgs:
SWAPGS_UNSAFE_STACK
nmi_restore:
POP_REGS

I might be missing something, but how is that supposed to work when
paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
there is a big fat comment missing explaining why.

Thanks,

tglx






2019-03-25 11:39:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64: Enable FSGSBASE instructions in the helper functions

On Fri, 15 Mar 2019, Chang S. Bae wrote:

> The helper functions will switch on faster accesses to FSBASE and GSBASE
> when the FSGSBASE feature is enabled.
>
> Accessing user GSBASE needs a couple of SWAPGS operations. It is avoidable
> if the user GSBASE is saved at kernel entry, being updated as changes, and
> restored back at kernel exit. However, it seems to spend more cycles for
> savings and restorations. Little or no benefit was measured from
> experiments.

This smells fishy and looking at the end result of this series just
confirms it. This ends up being a mixture of SWAPGS and FSGSBASE usage and
as already pointed out in the other reply, it causes inconsistencies.

Let's look at the big picture.

For both variants GS needs to be swapped on kernel entry and on kernel
exit.

1) SWAPGS

MSR_KERNEL_GS_BASE contains the user space GS when running in the kernel
and the kernel GS when running in user space.

SWAPGS is used to swap the content of GS and MSR_KERNEL_GS_BASE on the
transitions from and to user space.

On context switch MSR_KERNEL_GS_BASE has to be updated when switching
between processes.

User space cannot change GS other than through the PRCTL which updates
MSR_KERNEL_GS_BASE.

2) FSGSBASE

User space can set GS without kernel interaction.

So on user space to kernel space transitions swapping in kernel GS should
simply do:

userGS = RDGSBASE()
WRGSBASE(kernelGS)

and on the way out:

WRGSBASE(userGS)

instead of SWAPGS all over the place.

userGS is stored in thread_struct, except for the few paranoid
exceptions which return straight to user space, e.g. NMI. Those can just
keep it on stack or in a register.

Context switch does not have to do anything at all vs. GS because
thread_struct contains the correct value already.

The PRCTL is straight forward to support. Instead of fiddling with
MSR_KERNEL_GS_BASE it just updates thread struct.

I don't see how that's NOT going to be an advantage and I don't see
either how this seems to cause more cycles for save and restore.

Making it consistently FSGSBASE avoids especially this piece of art in the
context switch path:

local_irq_save(flags);
native_swapgs();
gsbase = rdgsbase();
native_swapgs();
local_irq_restore(flags);

along with it's write counterpart.

The whole point of FSGSBASE support is performance, right?

So can please someone explain why having the following in the context
switch path when it can be completely avoided is enhancing performance:

- 4 x SWAPGS
- 1 x RDMSR
- 1 x WRMSR
- 2 x local_irq_save()
- 2 x local_irq_restore()

Of course the local_irq_save/restore() pairs are utterly pointless because
switch_to() runs with interrupts disabled already.

SWAPGS instead needs:

1 x WRMSR

and nothing else.

So trading the single WRMSR against the above in the context switch path is
gaining performance, right?

The only thing which gains performance is user space switching GS. And this
user space performance gain is achieved by:

- Inconsistent and fragile code with a guarantee for subtle and hard to
diagnose bugs

- Pointless overhead in the context switch code

Sorry, not going to happen ever.

Get your act together and make this consistent. Either SWAPGS or FSGSBASE,
but not a mix of it.

Thanks,

tglx

2019-03-25 12:47:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64: Enable FSGSBASE instructions in the helper functions

On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> The whole point of FSGSBASE support is performance, right?
>
> So can please someone explain why having the following in the context
> switch path when it can be completely avoided is enhancing performance:
>
> - 4 x SWAPGS
> - 1 x RDMSR
> - 1 x WRMSR

Corrrecting myself. That should be:

RDGSBASE
WRGSBASE

obviously. Still the point remains.

> - 2 x local_irq_save()
> - 2 x local_irq_restore()
>
> Of course the local_irq_save/restore() pairs are utterly pointless because
> switch_to() runs with interrupts disabled already.
>
> SWAPGS instead needs:
>
> 1 x WRMSR
>
> and nothing else.
>
> So trading the single WRMSR against the above in the context switch path is
> gaining performance, right?
>
> The only thing which gains performance is user space switching GS. And this
> user space performance gain is achieved by:
>
> - Inconsistent and fragile code with a guarantee for subtle and hard to
> diagnose bugs
>
> - Pointless overhead in the context switch code
>
> Sorry, not going to happen ever.
>
> Get your act together and make this consistent. Either SWAPGS or FSGSBASE,
> but not a mix of it.
>
> Thanks,
>
> tglx
>

2019-03-25 13:08:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64: Enable FSGSBASE instructions in the helper functions

On Mon, 25 Mar 2019, Thomas Gleixner wrote:

> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> > The whole point of FSGSBASE support is performance, right?
> >
> > So can please someone explain why having the following in the context
> > switch path when it can be completely avoided is enhancing performance:
> >
> > - 4 x SWAPGS
> > - 1 x RDMSR
> > - 1 x WRMSR
>
> Corrrecting myself. That should be:
>
> RDGSBASE
> WRGSBASE
>
> obviously. Still the point remains.
>
> > - 2 x local_irq_save()
> > - 2 x local_irq_restore()
> >
> > Of course the local_irq_save/restore() pairs are utterly pointless because
> > switch_to() runs with interrupts disabled already.
> >
> > SWAPGS instead needs:
> >
> > 1 x WRMSR
> >
> > and nothing else.
> >
> > So trading the single WRMSR against the above in the context switch path is
> > gaining performance, right?

And even IF the sequences are faster than the single WRMSR, this does not
justify the mixed bag of SWAPGS/FSGSBASE usage at all.

Thanks,

tglx

2019-03-26 00:40:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64: Enable FSGSBASE instructions in the helper functions

> So on user space to kernel space transitions swapping in kernel GS should
> simply do:
> userGS = RDGSBASE()
> WRGSBASE(kernelGS)

This would also need to find kernelGS first, by doing RDPID and then
reading it from memory in the right index
(which might be a full cache miss if you're unlucky)

SWAPGS will be a lot faster, especially in these rare worst cases
because it has all its state inside the CPU.

-Andi

BTW you managed to only review after Chang went on a long vacation.

<rant>
I don't understand why it takes that long to review these changes
It's one of the largest performance improvements for the context
switch and the NMI in many years plus gives a new free register
to user space, but it only makes progress at a glacial pace.
The original patches for this were posted in 2016.
</rant>


2019-03-26 00:45:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 00/12] x86: Enable FSGSBASE instructions

On Fri, Mar 15, 2019 at 1:07 PM Chang S. Bae <[email protected]> wrote:
>
> Updates from v5 [5]:
> * Drop the new tain flag (TAINT_INSECURE)
> * Cleanup copy_thread_tls(), some changelog, and unnecessary comments on
> assembly macros
> * Rearrange some helper updates appropriately (from patch 4 to 6)

I think this stuff is in generally decent shape, but I have two big
broad comments:

What's the status of the stuff hpa was working on to make the behavior
of modify_ldt() predictable?

Can we please have a test case that explicitly exercises the way that
ptrace reads and writes the base registers?

2019-03-26 15:04:16

by Thomas Gleixner

[permalink] [raw]
Subject: New feature/ABI review process [was Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64:..]

Andi,

On Mon, 25 Mar 2019, Andi Kleen wrote:

> > So on user space to kernel space transitions swapping in kernel GS should
> > simply do:
> > userGS = RDGSBASE()
> > WRGSBASE(kernelGS)
>
> This would also need to find kernelGS first, by doing RDPID and then
> reading it from memory in the right index
> (which might be a full cache miss if you're unlucky)

I'm well aware of that.

> SWAPGS will be a lot faster, especially in these rare worst cases
> because it has all its state inside the CPU.

The well known 'will/might/could/should' word weaseling is not solving
anything.

If you want to advocate the more complex design of mixed SWAPGS/FSGSBASE
then provide numbers and not hand-waving. Numbers of real-world workloads,
not numbers of artificial test cases which exercise the rare worst case.

Yes, it's extra work and it's well spent. If the numbers are not
significantly different then the simpler and consistent design is a clear
win.

According to the changelog on which I reacted you seem to have investigated
that already. Let me cite it again:

> Accessing user GSBASE needs a couple of SWAPGS operations. It is
> avoidable if the user GSBASE is saved at kernel entry, being updated as
> changes, and restored back at kernel exit. However, it seems to spend
> more cycles for savings and restorations. Little or no benefit was
> measured from experiments.

So little or no benefit was measured. I don't see how that maps to your
'SWAPGS will be a lot faster' claim. One of those claims is obviously
wrong.

Aside of this needs more than numbers:

1) Proper documentation how the mixed bag is managed.

2) Extensive comments explaining the subtle inner workings and caveats.

3) Proper changelogs.

You have a track record of not caring much about either of these, but I
very much care for good reasons. I've been bitten by glued on and half
baked patches from Intel in the past 10 years so many times, that I'm
simply refusing to take anything which is not properly structured and
documented.

Especially not when it is touching sensitive areas like this and also has
an impact on the user space ABI.

> BTW you managed to only review after Chang went on a long vacation.

I'm terribly sorry about that. I'll try to adjust my own schedules and
workflow to Intel employees vacation plans in the future.

> <rant>
> I don't understand why it takes that long to review these changes
> It's one of the largest performance improvements for the context
> switch and the NMI in many years plus gives a new free register
> to user space, but it only makes progress at a glacial pace.
> The original patches for this were posted in 2016.
> </rant>

Care to look at the real history of this:

11/2015: First patch-set posted by you, which was rejected on technical grounds

So this so important feature was in limbo for 20 months until Luto picked it
up again. That's surely the fault of the x86 maintainers, right?

07/2017: Discussion about ABI considerations initiated by Andy Lutomirksi.

And it takes another 8 month until patches come around:

03/19/2018: V1 from Chang. Reviewed within days

2 month gap caused by Intel:

05/31/2018: V2 Request from Andy to split the set

06/04/2018: Base-V1 The first chunk of changes.

06/06/2018: Base-V2 Slight modifications

06/07/2018: Base-V3 Slight modifications. Review on 08/18

06/20/2018: Base-V4 Review on 06/22

06/27/2018: Base-V5

2 month gap caused by busy maintainers. You know what they were busy with
at that time, right? Chasing subtle bugs in the so perfect L1TF patches
which were thrown over the fence by you and dealing with the Intel induced
speculation crap to have a consistent and maintainable mitigation including
proper documentation.

08/23/2018: Base-V5 Resend. Review on 9/14

09/18/2018: Base-V6. Merged 10/08

10/23/2018: Full-V3. Review immediate

10/24/2018: Regression detected caused by Base-V6

The so perfect base patch set caused a regression and it takes more than a
month to fix it properly:

10/30/2018: Fix-V1. Broken
10/31/2018: Fix-V2. Broken
11/01/2018: Fix-V3. Broken
11/14/2018: Fix-V4. Broken
11/15/2018: Fix-V5. Broken
11/26/2018: Fix-V6. Finally

2 months to address the Full-V3 feedback:

01/16/2019: Full-V4. Change request

02/01/2019: Full-V5. Review immediate

02/13/2019: Full-V6.

1 month gap caused by busy maintainers. Ash on my main...

03/15/2019: Full-V6 resend

So just to put this straight:

Out of 40 month since the first post in 11/2015:

20 months nothing happened from Intel side
8 months consumed to produce the next set
1 month to fix a regression
2 months consumed to react on review feedback
----------------------------------------------
31 months

versus:

2 months maintainers dealing with Intel crap
1 month maintainers being busy

The rest is the usual review/re-post ping pong delay which sums up, but
from the larger gaps more than 75% are Intel induced and 7% maintainer
induced.

It's pretty obvious why it takes that long, right?

Back to the current version of patches:

Putting the design question aside. Even if the mixed SWAPGS/FSGSBASE thing
is the right thing to do, the patch set is not acceptable in the current
form. Again for the record:

1) Lack of documentation.

2) Lack of proper justification and explanation of the design.

3) Patches doing different things at once.

4) A yet to be explained inconsistency in the NMI code.

5) Pointless and mindless local_irq_save/restore() in switch_to() which
this performance important patch set tries to optimize.

I as a maintainer don't have to decode all of the above from a jumble of
complex patches, right?

Just for the record:

You can rant until you're blue in the face, it's not going to change
the fact that this stuff is not ready. It's neither changing the fact
that all of the above could have been addressed by Intel _before_
posting V6.

You very well know the expectations and it's not my personal pet peeve,
it's clearly documented in Documentation/process/*.

I'm dead tired of your unfounded complaints and of your permanent refusal to
collaborate. Keep that up and the last x86 maintainer who was willing to
deal with you in the past 10 years will finally open up a reserved slot
in his email filters to /dev/null.

Thanks,

Thomas

2019-03-26 22:57:29

by Andi Kleen

[permalink] [raw]
Subject: Re: New feature/ABI review process [was Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64:..]

>
> If you want to advocate the more complex design of mixed SWAPGS/FSGSBASE
> then provide numbers and not hand-waving. Numbers of real-world workloads,
> not numbers of artificial test cases which exercise the rare worst case.

Well you're proposing the much more complicated solution, not me.

SWAPGS is simple and it works everywhere except for paranoid.

> Yes, it's extra work and it's well spent. If the numbers are not
> significantly different then the simpler and consistent design is a clear
> win.

As long as everything is cache hot it's likely only a couple
of cycles difference (as Intel CPUs are very good executing
crappy code too), but if it's not then you end up with a huge cache miss
cost, causing jitter. That's a problem for real time for example.

> > Accessing user GSBASE needs a couple of SWAPGS operations. It is
> > avoidable if the user GSBASE is saved at kernel entry, being updated as
> > changes, and restored back at kernel exit. However, it seems to spend
> > more cycles for savings and restorations. Little or no benefit was
> > measured from experiments.
>
> So little or no benefit was measured. I don't see how that maps to your
> 'SWAPGS will be a lot faster' claim. One of those claims is obviously
> wrong.

If everything is cache hot it won't make much difference,
but if you have a cache miss you end up eating the cost.

>
> Aside of this needs more than numbers:
>
> 1) Proper documentation how the mixed bag is managed.

How SWAPGS is managed?

Like it always was since 20+ years when the x86_64
port was originally born.

The only case which has to do an two SWAPGS is the
context switch when it switches the base. Everything else
just does SWAPGS at the edges for kernel entries.

> You have a track record of not caring much about either of these, but I
> very much care for good reasons. I've been bitten by glued on and half
> baked patches from Intel in the past 10 years so many times, that I'm
> simply refusing to take anything which is not properly structured and
> documented.

In this case you're proposing the change, the Intel patch just leaves
SWAPGS alone. So you have to describe why it's a good idea.
At least what you proposed on this wasn't convincing
and would be rejected by a proper code review.

-Andi


2019-03-27 21:17:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: New feature/ABI review process [was Re: [RESEND PATCH v6 04/12] x86/fsgsbase/64:..]

On Tue, 26 Mar 2019, Andi Kleen wrote:
> As long as everything is cache hot it's likely only a couple
> of cycles difference (as Intel CPUs are very good executing
> crappy code too), but if it's not then you end up with a huge cache miss
> cost, causing jitter. That's a problem for real time for example.

That extra cache miss is really not the worst issue for realtime. The
inherent latencies of contemporary systems have way worse to offer than
that. Any realtime system has to cope with the worst case and an extra
cache miss is not the end of the world.

> > > Accessing user GSBASE needs a couple of SWAPGS operations. It is
> > > avoidable if the user GSBASE is saved at kernel entry, being updated as
> > > changes, and restored back at kernel exit. However, it seems to spend
> > > more cycles for savings and restorations. Little or no benefit was
> > > measured from experiments.
> >
> > So little or no benefit was measured. I don't see how that maps to your
> > 'SWAPGS will be a lot faster' claim. One of those claims is obviously
> > wrong.
>
> If everything is cache hot it won't make much difference,
> but if you have a cache miss you end up eating the cost.
>
> >
> > Aside of this needs more than numbers:
> >
> > 1) Proper documentation how the mixed bag is managed.
>
> How SWAPGS is managed?
>
> Like it always was since 20+ years when the x86_64
> port was originally born.

I know how SWAPGS works.

> The only case which has to do an two SWAPGS is the
> context switch when it switches the base. Everything else
> just does SWAPGS at the edges for kernel entries.

And exactly here is the problem. You are not even describing it correctly
now:

You cannot do SWAPGS on _all_ edges.

You cannot do SWAPGS in the paranoid entry when FSGSBASE is in use, because
user space can write arbitrary values into GS. Which breaks the existing
differentiation of kernel/user GS. That's why you have the FSGSBASE variant
there. Is that documented?

The changelog has some convoluted description of it:

"The FSGSBASE instructions allow fast accesses on GSBASE. Now, at the
paranoid_entry, the per-CPU base value can be always copied to GSBASE.
And the original GSBASE value will be restored at the exit."

So that part blurbs about fast access and comes first. Really useful.

"So far, GSBASE modification has not been directly allowed from userspace.
So, swapping GSBASE has been conditionally executed according to the
kernel-enforced convention that a negative GSBASE indicates a kernel value.
But when FSGSBASE is enabled, userspace can put an arbitrary value in
GSBASE. The change will secure a correct GSBASE value with FSGSBASE."

I can decode that because I'm familiar with the inner workings of the
paranoid entry code. But that changelog is just not providing properly
structured information and the full context.

What's worse is the comment in the code itself:

+ * When FSGSBASE enabled, current GSBASE is always copied to %rbx.

Where is the documentation that FSGSBASE is required to be used here and
why? I can blody well see from the code that the FSGSBASE path does this
unconditionally. But that does not explain why and it does not explain why
FSGSBASE is not used all over the place instead of SWAPGS and just here.

+ * Without FSGSBASE, SWAPGS is needed when entering from userspace.
+ * A positive GSBASE means it is a user value and a negative GSBASE
+ * means it is a kernel value.

So this has more explanation about the SWAPGS mode than about the
subtlities of FSGSBASE.

This stuff wants to be documented in great length for everyones sake
including yourself when you have to stare into that code a year from now. I
don't care about you're headache but I care about mine and that of people
who might end up debugging some subtle bug in that area.

Thanks,

tglx








2019-03-30 16:18:08

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 12/12] x86/fsgsbase/64: Add documentation for FSGSBASE

On 3/15/19 1:06 PM, Chang S. Bae wrote:
> From: Andi Kleen <[email protected]>
>
> v2: Minor updates to documentation requested in review.
> v3: Update for new gcc and various improvements.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Chang S. Bae <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> Documentation/x86/fsgs.txt | 104 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 104 insertions(+)
> create mode 100644 Documentation/x86/fsgs.txt
>

Globally s/64bit/64-bit/ and s/32bit/32-bit/.

More comments below.

> diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
> new file mode 100644
> index 000000000000..7a973a5c1767
> --- /dev/null
> +++ b/Documentation/x86/fsgs.txt
> @@ -0,0 +1,104 @@
> +
> +Using FS and GS prefixes on 64bit x86 linux
> +
> +The x86 architecture supports segment prefixes per instruction to add an
> +offset to an address. On 64bit x86, these are mostly nops, except for FS
> +and GS.
> +
> +This offers an efficient way to reference a global pointer.
> +
> +The compiler has to generate special code to use these base registers,
> +or they can be accessed with inline assembler.
> +
> + mov %gs:offset,%reg
> + mov %fs:offset,%reg
> +
> +On 64bit code, FS is used to address the thread local segment (TLS), declared using
> +__thread. The compiler then automatically generates the correct prefixes and
> +relocations to access these values.
> +
> +FS is normally managed by the runtime code or the threading library

library.

> +Overwriting it can break a lot of things (including syscalls and gdb),
> +but it can make sense to save/restore it for threading purposes.
> +
> +GS is freely available, but may need special (compiler or inline assembler)
> +code to use.
> +
> +Traditionally 64bit FS and GS could be set by the arch_prctl system call
> +
> + arch_prctl(ARCH_SET_GS, value)
> + arch_prctl(ARCH_SET_FS, value)
> +
> +[There was also an older method using modify_ldt(), inherited from 32bit,
> +but this is not discussed here.]
> +
> +However using a syscall is problematic for user space threading libraries

However,

> +that want to context switch in user space. The whole point of them
> +is avoiding the overhead of a syscall. It's also cleaner for compilers
> +wanting to use the extra register to use instructions to write
> +it, or read it directly to compute addresses and offsets.
> +
> +Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
> +access these registers quickly from user context

context. {or context:}

> +
> + RDFSBASE %reg read the FS base (or _readfsbase_u64)
> + RDGSBASE %reg read the GS base (or _readgsbase_u64)
> +
> + WRFSBASE %reg write the FS base (or _writefsbase_u64)
> + WRGSBASE %reg write the GS base (or _writegsbase_u64)
> +
> +If you use the intrinsics include <immintrin.h> and set the -mfsgsbase option.

intrinsics,

> +
> +The instructions are supported by the CPU when the "fsgsbase" string is shown in
> +/proc/cpuinfo (or directly retrieved through the CPUID instruction,
> +7:0 (ebx), word 9, bit 0)

add ending '.' above.

> +
> +The instructions are only available to 64bit binaries.
> +
> +In addition the kernel needs to explicitly enable these instructions, as it
> +may otherwise not correctly context switch the state. Newer Linux
> +kernels enable this. When the kernel did not enable the instruction

does not

> +they will fault with an #UD exception.

well, I would say "with a #UD exception."

> +
> +An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2

FSGSBASE-enabled

> +bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
> +kernel supports FSGSBASE.
> +
> + #include <sys/auxv.h>
> + #include <elf.h>
> +
> + /* Will be eventually in asm/hwcap.h */
> + #define HWCAP2_FSGSBASE (1 << 1)
> +
> + unsigned val = getauxval(AT_HWCAP2);
> + if (val & HWCAP2_FSGSBASE) {
> + asm("wrgsbase %0" :: "r" (ptr));
> + }
> +
> +No extra CPUID check needed as the kernel will not set this bit if the CPU

CPUID check is needed

> +does not support it.
> +
> +gcc 6 will have special support to directly access data relative

"will have"? future? or:

gcc 6 has special support

> +to fs/gs using the __seg_fs and __seg_gs address space pointer
> +modifiers.
> +
> +#ifndef __SEG_GS
> +#error "Need gcc 6 or later"
> +#endif
> +
> +struct gsdata {
> + int a;
> + int b;
> +} gsdata = { 1, 2 };
> +
> +int __seg_gs *valp = 0; /* offset relative to GS */
> +
> + /* Check if kernel supports FSGSBASE as above */
> +
> + /* Set up new GS */
> + asm("wrgsbase %0" :: "r" (&gsdata));
> +
> + /* Now the global pointer can be used normally */
> + printf("gsdata.a = %d\n", *valp);
> +
> +Andi Kleen
>


--
~Randy

2019-04-05 08:38:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> On Fri, 15 Mar 2019, Chang S. Bae wrote:
> > ENTRY(paranoid_exit)
> > UNWIND_HINT_REGS
> > DISABLE_INTERRUPTS(CLBR_ANY)
> > TRACE_IRQS_OFF_DEBUG
> > + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
> > + X86_FEATURE_FSGSBASE
> > + wrgsbase %rbx
> > + jmp .Lparanoid_exit_no_swapgs;
>
> Again. A few newlines would make it more readable.
>
> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
> the usage sites there is the following code in the nmi maze:
>
> /*
> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
> * as we should not be calling schedule in NMI context.
> * Even with normal interrupts enabled. An NMI should not be
> * setting NEED_RESCHED or anything that normal interrupts and
> * exceptions might do.
> */
> call paranoid_entry
> UNWIND_HINT_REGS
>
> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> movq %rsp, %rdi
> movq $-1, %rsi
> call do_nmi
>
> /* Always restore stashed CR3 value (see paranoid_entry) */
> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>
> testl %ebx, %ebx /* swapgs needed? */
> jnz nmi_restore
> nmi_swapgs:
> SWAPGS_UNSAFE_STACK
> nmi_restore:
> POP_REGS
>
> I might be missing something, but how is that supposed to work when
> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> there is a big fat comment missing explaining why.

So this _is_ broken.

On entry:

rbx = rdgsbase()
wrgsbase(KERNEL_GS)

On exit:

if (ebx == 0)
swapgs

The resulting matrix:

| ENTRY GS | RBX | EXIT | GS on IRET | RESULT
| | | | |
1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
| | | | |
2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
| | | | |
3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
| | | | |
4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL


#1 Just works by chance because it's unlikely that the lower 32bits of a
per CPU kernel GS are all 0.

But it's just a question of probability that this turns into a
non-debuggable once per year crash (think KASLR).

#4 This can happen when the NMI hits the kernel in some other entry code
_BEFORE_ or _AFTER_ swapgs.

User space using GS addressing with GS[31:0] != 0 will crash and burn.

IIRC FSGSBASE is about fast user space GS switching with (almost) no
limits on the value ...

Oh well.

tglx

2019-04-05 13:51:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry



> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
>
>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
>>> ENTRY(paranoid_exit)
>>> UNWIND_HINT_REGS
>>> DISABLE_INTERRUPTS(CLBR_ANY)
>>> TRACE_IRQS_OFF_DEBUG
>>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
>>> + X86_FEATURE_FSGSBASE
>>> + wrgsbase %rbx
>>> + jmp .Lparanoid_exit_no_swapgs;
>>
>> Again. A few newlines would make it more readable.
>>
>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
>> the usage sites there is the following code in the nmi maze:
>>
>> /*
>> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
>> * as we should not be calling schedule in NMI context.
>> * Even with normal interrupts enabled. An NMI should not be
>> * setting NEED_RESCHED or anything that normal interrupts and
>> * exceptions might do.
>> */
>> call paranoid_entry
>> UNWIND_HINT_REGS
>>
>> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
>> movq %rsp, %rdi
>> movq $-1, %rsi
>> call do_nmi
>>
>> /* Always restore stashed CR3 value (see paranoid_entry) */
>> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>>
>> testl %ebx, %ebx /* swapgs needed? */
>> jnz nmi_restore
>> nmi_swapgs:
>> SWAPGS_UNSAFE_STACK
>> nmi_restore:
>> POP_REGS
>>
>> I might be missing something, but how is that supposed to work when
>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
>> there is a big fat comment missing explaining why.
>
> So this _is_ broken.
>
> On entry:
>
> rbx = rdgsbase()
> wrgsbase(KERNEL_GS)
>
> On exit:
>
> if (ebx == 0)
> swapgs
>
> The resulting matrix:
>
> | ENTRY GS | RBX | EXIT | GS on IRET | RESULT
> | | | | |
> 1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
> | | | | |
> 2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
> | | | | |
> 3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
> | | | | |
> 4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL
>
>
> #1 Just works by chance because it's unlikely that the lower 32bits of a
> per CPU kernel GS are all 0.
>
> But it's just a question of probability that this turns into a
> non-debuggable once per year crash (think KASLR).
>
> #4 This can happen when the NMI hits the kernel in some other entry code
> _BEFORE_ or _AFTER_ swapgs.
>
> User space using GS addressing with GS[31:0] != 0 will crash and burn.
>
>

Hi all-

In a previous incarnation of these patches, I complained about the use of SWAPGS in the paranoid path. Now I’m putting my maintainer foot down. On a non-FSGSBASE system, the paranoid path known, definitively, which GS is where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at the RIP that you interrupted, you cannot know whether you have user or kernel GSBASE live, since they can have literally the same value. One of the numerous versions of this patch compared the values and just said “well, it’s harmless to SWAPGS if user code happens to use the same value as the kernel”. I complained that it was far too fragile.

So I’m putting my foot down. If you all want my ack, you’re going to save the old GS, load the new one with WRGSBASE, and, on return, you’re going to restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid path.

Obviously, for the non-paranoid path, it all keeps working exactly like it does now.

Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.

—Andy

2019-05-01 13:54:04

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry


> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
>
>
>
>> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
>>
>>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
>>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
>>>> ENTRY(paranoid_exit)
>>>> UNWIND_HINT_REGS
>>>> DISABLE_INTERRUPTS(CLBR_ANY)
>>>> TRACE_IRQS_OFF_DEBUG
>>>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
>>>> + X86_FEATURE_FSGSBASE
>>>> + wrgsbase %rbx
>>>> + jmp .Lparanoid_exit_no_swapgs;
>>>
>>> Again. A few newlines would make it more readable.
>>>
>>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
>>> the usage sites there is the following code in the nmi maze:
>>>
>>> /*
>>> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
>>> * as we should not be calling schedule in NMI context.
>>> * Even with normal interrupts enabled. An NMI should not be
>>> * setting NEED_RESCHED or anything that normal interrupts and
>>> * exceptions might do.
>>> */
>>> call paranoid_entry
>>> UNWIND_HINT_REGS
>>>
>>> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
>>> movq %rsp, %rdi
>>> movq $-1, %rsi
>>> call do_nmi
>>>
>>> /* Always restore stashed CR3 value (see paranoid_entry) */
>>> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>>>
>>> testl %ebx, %ebx /* swapgs needed? */
>>> jnz nmi_restore
>>> nmi_swapgs:
>>> SWAPGS_UNSAFE_STACK
>>> nmi_restore:
>>> POP_REGS
>>>
>>> I might be missing something, but how is that supposed to work when
>>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
>>> there is a big fat comment missing explaining why.
>>
>> So this _is_ broken.
>>
>> On entry:
>>
>> rbx = rdgsbase()
>> wrgsbase(KERNEL_GS)
>>
>> On exit:
>>
>> if (ebx == 0)
>> swapgs
>>
>> The resulting matrix:
>>
>> | ENTRY GS | RBX | EXIT | GS on IRET | RESULT
>> | | | | |
>> 1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
>> | | | | |
>> 2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
>> | | | | |
>> 3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
>> | | | | |
>> 4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL
>>
>>
>> #1 Just works by chance because it's unlikely that the lower 32bits of a
>> per CPU kernel GS are all 0.
>>
>> But it's just a question of probability that this turns into a
>> non-debuggable once per year crash (think KASLR).
>>
>> #4 This can happen when the NMI hits the kernel in some other entry code
>> _BEFORE_ or _AFTER_ swapgs.
>>
>> User space using GS addressing with GS[31:0] != 0 will crash and burn.
>>
>>
>
> Hi all-
>
> In a previous incarnation of these patches, I complained about the use of SWAPGS in the paranoid path. Now I’m putting my maintainer foot down. On a non-FSGSBASE system, the paranoid path known, definitively, which GS is where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at the RIP that you interrupted, you cannot know whether you have user or kernel GSBASE live, since they can have literally the same value. One of the numerous versions of this patch compared the values and just said “well, it’s harmless to SWAPGS if user code happens to use the same value as the kernel”. I complained that it was far too fragile.
>
> So I’m putting my foot down. If you all want my ack, you’re going to save the old GS, load the new one with WRGSBASE, and, on return, you’re going to restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid path.
>
> Obviously, for the non-paranoid path, it all keeps working exactly like it does now.

Although I can see some other concerns with this, looks like it is still worth pursuing.

>
> Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
>

I think I need to understand the issue. Appreciate if you can elaborate a little bit.

> —Andy

2019-05-01 13:55:57

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 07/12] x86/fsgsbase/64: Introduce the FIND_PERCPU_BASE macro


> On Mar 25, 2019, at 02:02, Thomas Gleixner <[email protected]> wrote:
>
> On Fri, 15 Mar 2019, Chang S. Bae wrote:
>
>> diff --git a/arch/x86/include/asm/inst.h b/arch/x86/include/asm/inst.h
>> index f5a796da07f8..d063841a17e3 100644
>> --- a/arch/x86/include/asm/inst.h
>> +++ b/arch/x86/include/asm/inst.h
>> @@ -306,6 +306,21 @@
>> .endif
>> MODRM 0xc0 movq_r64_xmm_opd1 movq_r64_xmm_opd2
>> .endm
>> +
>> +.macro RDPID opd
>
> So the update to require binutils >= 2.21 does not cover RDPID?
>

I can see RDPID support in 2.27 release. I wonder if we can even require >= 2.27
right now.

> Thanks,
>
> tglx

2019-05-01 17:44:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>
>
> > On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
> >
> >
> >
> >> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
> >>
> >>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> >>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
> >>>> ENTRY(paranoid_exit)
> >>>> UNWIND_HINT_REGS
> >>>> DISABLE_INTERRUPTS(CLBR_ANY)
> >>>> TRACE_IRQS_OFF_DEBUG
> >>>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
> >>>> + X86_FEATURE_FSGSBASE
> >>>> + wrgsbase %rbx
> >>>> + jmp .Lparanoid_exit_no_swapgs;
> >>>
> >>> Again. A few newlines would make it more readable.
> >>>
> >>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
> >>> the usage sites there is the following code in the nmi maze:
> >>>
> >>> /*
> >>> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
> >>> * as we should not be calling schedule in NMI context.
> >>> * Even with normal interrupts enabled. An NMI should not be
> >>> * setting NEED_RESCHED or anything that normal interrupts and
> >>> * exceptions might do.
> >>> */
> >>> call paranoid_entry
> >>> UNWIND_HINT_REGS
> >>>
> >>> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> >>> movq %rsp, %rdi
> >>> movq $-1, %rsi
> >>> call do_nmi
> >>>
> >>> /* Always restore stashed CR3 value (see paranoid_entry) */
> >>> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> >>>
> >>> testl %ebx, %ebx /* swapgs needed? */
> >>> jnz nmi_restore
> >>> nmi_swapgs:
> >>> SWAPGS_UNSAFE_STACK
> >>> nmi_restore:
> >>> POP_REGS
> >>>
> >>> I might be missing something, but how is that supposed to work when
> >>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> >>> there is a big fat comment missing explaining why.
> >>
> >> So this _is_ broken.
> >>
> >> On entry:
> >>
> >> rbx = rdgsbase()
> >> wrgsbase(KERNEL_GS)
> >>
> >> On exit:
> >>
> >> if (ebx == 0)
> >> swapgs
> >>
> >> The resulting matrix:
> >>
> >> | ENTRY GS | RBX | EXIT | GS on IRET | RESULT
> >> | | | | |
> >> 1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
> >> | | | | |
> >> 2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
> >> | | | | |
> >> 3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
> >> | | | | |
> >> 4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL
> >>
> >>
> >> #1 Just works by chance because it's unlikely that the lower 32bits of a
> >> per CPU kernel GS are all 0.
> >>
> >> But it's just a question of probability that this turns into a
> >> non-debuggable once per year crash (think KASLR).
> >>
> >> #4 This can happen when the NMI hits the kernel in some other entry code
> >> _BEFORE_ or _AFTER_ swapgs.
> >>
> >> User space using GS addressing with GS[31:0] != 0 will crash and burn.
> >>
> >>
> >
> > Hi all-
> >
> > In a previous incarnation of these patches, I complained about the use of SWAPGS in the paranoid path. Now I’m putting my maintainer foot down. On a non-FSGSBASE system, the paranoid path known, definitively, which GS is where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at the RIP that you interrupted, you cannot know whether you have user or kernel GSBASE live, since they can have literally the same value. One of the numerous versions of this patch compared the values and just said “well, it’s harmless to SWAPGS if user code happens to use the same value as the kernel”. I complained that it was far too fragile.
> >
> > So I’m putting my foot down. If you all want my ack, you’re going to save the old GS, load the new one with WRGSBASE, and, on return, you’re going to restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid path.
> >
> > Obviously, for the non-paranoid path, it all keeps working exactly like it does now.
>
> Although I can see some other concerns with this, looks like it is still worth pursuing.
>
> >
> > Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
> >
>
> I think I need to understand the issue. Appreciate if you can elaborate a little bit.
>

This patch series gives a particular behavior to PTRACE_SETREGS and
PTRACE_POKEUSER. There should be a test case that validates that
behavior, including testing the weird cases where gs != 0 and gsbase
contains unusual values. Some existing tests might be pretty close to
doing what's needed.

Beyond that, the current putreg() code does this:

case offsetof(struct user_regs_struct,gs_base):
/*
* Exactly the same here as the %fs handling above.
*/
if (value >= TASK_SIZE_MAX)
return -EIO;
if (child->thread.gsbase != value)
return do_arch_prctl_64(child, ARCH_SET_GS, value);
return 0;

and do_arch_prctl_64(), in turn, does this:

case ARCH_SET_GS: {
if (unlikely(arg2 >= TASK_SIZE_MAX))
return -EPERM;

preempt_disable();
/*
* ARCH_SET_GS has always overwritten the index
* and the base. Zero is the most sensible value
* to put in the index, and is the only value that
* makes any sense if FSGSBASE is unavailable.
*/
if (task == current) {
[not used for ptrace]
} else {
task->thread.gsindex = 0;
x86_gsbase_write_task(task, arg2);
}

...

So writing the value that was already there to gsbase via putreg()
does nothing, but writing a *different* value implicitly clears gs,
but writing a different value will clear gs.

This behavior is, AFAICT, complete nonsense. It happens to work
because usually gdb writes the same value back, and, in any case, gs
comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
But I think that this behavior should be fixed up and probably tested.
Certainly the behavior should *not* be the same on a fsgsbase kernel,
and and the fsgsbase behavior definitely needs a selftest.

2019-05-01 18:03:18

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry


> On May 1, 2019, at 10:40, Andy Lutomirski <[email protected]> wrote:
>
> On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>>
>>
>>> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
>>>
>>>
>>>
>>>> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
>>>>
>>>>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
>>>>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
>>>>>> ENTRY(paranoid_exit)
>>>>>> UNWIND_HINT_REGS
>>>>>> DISABLE_INTERRUPTS(CLBR_ANY)
>>>>>> TRACE_IRQS_OFF_DEBUG
>>>>>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
>>>>>> + X86_FEATURE_FSGSBASE
>>>>>> + wrgsbase %rbx
>>>>>> + jmp .Lparanoid_exit_no_swapgs;
>>>>>
>>>>> Again. A few newlines would make it more readable.
>>>>>
>>>>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
>>>>> the usage sites there is the following code in the nmi maze:
>>>>>
>>>>> /*
>>>>> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
>>>>> * as we should not be calling schedule in NMI context.
>>>>> * Even with normal interrupts enabled. An NMI should not be
>>>>> * setting NEED_RESCHED or anything that normal interrupts and
>>>>> * exceptions might do.
>>>>> */
>>>>> call paranoid_entry
>>>>> UNWIND_HINT_REGS
>>>>>
>>>>> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
>>>>> movq %rsp, %rdi
>>>>> movq $-1, %rsi
>>>>> call do_nmi
>>>>>
>>>>> /* Always restore stashed CR3 value (see paranoid_entry) */
>>>>> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>>>>>
>>>>> testl %ebx, %ebx /* swapgs needed? */
>>>>> jnz nmi_restore
>>>>> nmi_swapgs:
>>>>> SWAPGS_UNSAFE_STACK
>>>>> nmi_restore:
>>>>> POP_REGS
>>>>>
>>>>> I might be missing something, but how is that supposed to work when
>>>>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
>>>>> there is a big fat comment missing explaining why.
>>>>
>>>> So this _is_ broken.
>>>>
>>>> On entry:
>>>>
>>>> rbx = rdgsbase()
>>>> wrgsbase(KERNEL_GS)
>>>>
>>>> On exit:
>>>>
>>>> if (ebx == 0)
>>>> swapgs
>>>>
>>>> The resulting matrix:
>>>>
>>>> | ENTRY GS | RBX | EXIT | GS on IRET | RESULT
>>>> | | | | |
>>>> 1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
>>>> | | | | |
>>>> 2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
>>>> | | | | |
>>>> 3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
>>>> | | | | |
>>>> 4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL
>>>>
>>>>
>>>> #1 Just works by chance because it's unlikely that the lower 32bits of a
>>>> per CPU kernel GS are all 0.
>>>>
>>>> But it's just a question of probability that this turns into a
>>>> non-debuggable once per year crash (think KASLR).
>>>>
>>>> #4 This can happen when the NMI hits the kernel in some other entry code
>>>> _BEFORE_ or _AFTER_ swapgs.
>>>>
>>>> User space using GS addressing with GS[31:0] != 0 will crash and burn.
>>>>
>>>>
>>>
>>> Hi all-
>>>
>>> In a previous incarnation of these patches, I complained about the use of SWAPGS in the paranoid path. Now I’m putting my maintainer foot down. On a non-FSGSBASE system, the paranoid path known, definitively, which GS is where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at the RIP that you interrupted, you cannot know whether you have user or kernel GSBASE live, since they can have literally the same value. One of the numerous versions of this patch compared the values and just said “well, it’s harmless to SWAPGS if user code happens to use the same value as the kernel”. I complained that it was far too fragile.
>>>
>>> So I’m putting my foot down. If you all want my ack, you’re going to save the old GS, load the new one with WRGSBASE, and, on return, you’re going to restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid path.
>>>
>>> Obviously, for the non-paranoid path, it all keeps working exactly like it does now.
>>
>> Although I can see some other concerns with this, looks like it is still worth pursuing.
>>
>>>
>>> Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
>>>
>>
>> I think I need to understand the issue. Appreciate if you can elaborate a little bit.
>>
>
> This patch series gives a particular behavior to PTRACE_SETREGS and
> PTRACE_POKEUSER. There should be a test case that validates that
> behavior, including testing the weird cases where gs != 0 and gsbase
> contains unusual values. Some existing tests might be pretty close to
> doing what's needed.
>
> Beyond that, the current putreg() code does this:
>
> case offsetof(struct user_regs_struct,gs_base):
> /*
> * Exactly the same here as the %fs handling above.
> */
> if (value >= TASK_SIZE_MAX)
> return -EIO;
> if (child->thread.gsbase != value)
> return do_arch_prctl_64(child, ARCH_SET_GS, value);
> return 0;
>
> and do_arch_prctl_64(), in turn, does this:
>
> case ARCH_SET_GS: {
> if (unlikely(arg2 >= TASK_SIZE_MAX))
> return -EPERM;
>
> preempt_disable();
> /*
> * ARCH_SET_GS has always overwritten the index
> * and the base. Zero is the most sensible value
> * to put in the index, and is the only value that
> * makes any sense if FSGSBASE is unavailable.
> */
> if (task == current) {
> [not used for ptrace]
> } else {
> task->thread.gsindex = 0;
> x86_gsbase_write_task(task, arg2);
> }
>
> ...
>
> So writing the value that was already there to gsbase via putreg()
> does nothing, but writing a *different* value implicitly clears gs,
> but writing a different value will clear gs.
>
> This behavior is, AFAICT, complete nonsense. It happens to work
> because usually gdb writes the same value back, and, in any case, gs
> comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
> But I think that this behavior should be fixed up and probably tested.
> Certainly the behavior should *not* be the same on a fsgsbase kernel,
> and and the fsgsbase behavior definitely needs a selftest.

Okay, got the point; now crystal clear.

I have my own test case for that though, need to find a very simple and
acceptable solution.

Thanks,
Chang

2019-05-01 20:28:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry



> On May 1, 2019, at 1:21 PM, Bae, Chang Seok <[email protected]> wrote:
>
>
>>> On May 1, 2019, at 11:01, Bae, Chang Seok <[email protected]> wrote:
>>>
>>> On May 1, 2019, at 10:40, Andy Lutomirski <[email protected]> wrote:
>>>
>>> On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>>>>
>>>>
>>>>> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
>>>>>
>>>>> Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
>>>>>
>>>>
>>>> I think I need to understand the issue. Appreciate if you can elaborate a little bit.
>>>>
>>>
>>> This patch series gives a particular behavior to PTRACE_SETREGS and
>>> PTRACE_POKEUSER. There should be a test case that validates that
>>> behavior, including testing the weird cases where gs != 0 and gsbase
>>> contains unusual values. Some existing tests might be pretty close to
>>> doing what's needed.
>>>
>>> Beyond that, the current putreg() code does this:
>>>
>>> case offsetof(struct user_regs_struct,gs_base):
>>> /*
>>> * Exactly the same here as the %fs handling above.
>>> */
>>> if (value >= TASK_SIZE_MAX)
>>> return -EIO;
>>> if (child->thread.gsbase != value)
>>> return do_arch_prctl_64(child, ARCH_SET_GS, value);
>>> return 0;
>>>
>>> and do_arch_prctl_64(), in turn, does this:
>>>
>>> case ARCH_SET_GS: {
>>> if (unlikely(arg2 >= TASK_SIZE_MAX))
>>> return -EPERM;
>>>
>>> preempt_disable();
>>> /*
>>> * ARCH_SET_GS has always overwritten the index
>>> * and the base. Zero is the most sensible value
>>> * to put in the index, and is the only value that
>>> * makes any sense if FSGSBASE is unavailable.
>>> */
>>> if (task == current) {
>>> [not used for ptrace]
>>> } else {
>>> task->thread.gsindex = 0;
>>> x86_gsbase_write_task(task, arg2);
>>> }
>>>
>>> ...
>>>
>>> So writing the value that was already there to gsbase via putreg()
>>> does nothing, but writing a *different* value implicitly clears gs,
>>> but writing a different value will clear gs.
>>>
>>> This behavior is, AFAICT, complete nonsense. It happens to work
>>> because usually gdb writes the same value back, and, in any case, gs
>>> comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
>>> But I think that this behavior should be fixed up and probably tested.
>>> Certainly the behavior should *not* be the same on a fsgsbase kernel,
>>> and and the fsgsbase behavior definitely needs a selftest.
>>
>> Okay, got the point; now crystal clear.
>>
>> I have my own test case for that though, need to find a very simple and
>> acceptable solution.
>>
>
> One solution that I recall, HPA once suggested, is:
> Write registers in a reverse order from user_regs_struct, for SETREGS
>
> Assuming these for clarification, first:
> * old and new index != 0
> * taking GS as an example though, should be the same with FS
>
> Then, interesting cases would be something like these, without FSGSBASE:
> Case (a), when index only changed to (new index):
> (Then, the result after SETREGS would be)
> GS = (new index), GSBASE = the base fetched from (new index)
> Case (b), when base only changed to (new base):
> Case (c), when both are changed:
> GS = 0, GSBASE = (new base)
>
> Now, with FSGSBASE:
> Case (a):
> GS = (new index), GSBASE = (old base)
> Case (b):
> GS = (old index), GSBASE = (new base)
> Case (c):
> GS = (new index), GSBASE = (new base)
>
> As a reference, today's kernel behavior, without FSGSBASE:
> Case (a):
> GS = (new index), GSBASE = the base fetched from (new index)
> Case (b):
> GS = (old index), GSBASE = (old base)
> Case (c):
> GS = (new index), GSBASE = the base fetched from (new index)
>
> Now, with that reverse ordering and taking that "GSBASE is important" [1],
> it looks like to be working in terms of its base value:
> Case (b) and (c) will behave the same as with FSGSBASE
> Case (a) still differs between w/ and w/o FSGSBASE.
> Well, I'd say this bit comes from the 'new model' vs. the 'leagcy
> model'. So, then okay with that. Any thoughts?
>
>
>

This seems more complicated than needed. How about we just remove all the magic and make putreg on the base registers never change the selector.

As far as I can tell, the only downside is that, on a non-FSGSBASE kernel, setting only the base if the selector already has a nonzero value won’t work, but I would be quite surprised if this breaks anything.

2019-05-01 21:05:50

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry


> On May 1, 2019, at 13:25, Andy Lutomirski <[email protected]> wrote:
>
>
>
>> On May 1, 2019, at 1:21 PM, Bae, Chang Seok <[email protected]> wrote:
>>
>>
>>>> On May 1, 2019, at 11:01, Bae, Chang Seok <[email protected]> wrote:
>>>>
>>>> On May 1, 2019, at 10:40, Andy Lutomirski <[email protected]> wrote:
>>>>
>>>> On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>>>>>
>>>>>
>>>>>> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
>>>>>>
>>>>>> Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
>>>>>>
>>>>>
>>>>> I think I need to understand the issue. Appreciate if you can elaborate a little bit.
>>>>>
>>>>
>>>> This patch series gives a particular behavior to PTRACE_SETREGS and
>>>> PTRACE_POKEUSER. There should be a test case that validates that
>>>> behavior, including testing the weird cases where gs != 0 and gsbase
>>>> contains unusual values. Some existing tests might be pretty close to
>>>> doing what's needed.
>>>>
>>>> Beyond that, the current putreg() code does this:
>>>>
>>>> case offsetof(struct user_regs_struct,gs_base):
>>>> /*
>>>> * Exactly the same here as the %fs handling above.
>>>> */
>>>> if (value >= TASK_SIZE_MAX)
>>>> return -EIO;
>>>> if (child->thread.gsbase != value)
>>>> return do_arch_prctl_64(child, ARCH_SET_GS, value);
>>>> return 0;
>>>>
>>>> and do_arch_prctl_64(), in turn, does this:
>>>>
>>>> case ARCH_SET_GS: {
>>>> if (unlikely(arg2 >= TASK_SIZE_MAX))
>>>> return -EPERM;
>>>>
>>>> preempt_disable();
>>>> /*
>>>> * ARCH_SET_GS has always overwritten the index
>>>> * and the base. Zero is the most sensible value
>>>> * to put in the index, and is the only value that
>>>> * makes any sense if FSGSBASE is unavailable.
>>>> */
>>>> if (task == current) {
>>>> [not used for ptrace]
>>>> } else {
>>>> task->thread.gsindex = 0;
>>>> x86_gsbase_write_task(task, arg2);
>>>> }
>>>>
>>>> ...
>>>>
>>>> So writing the value that was already there to gsbase via putreg()
>>>> does nothing, but writing a *different* value implicitly clears gs,
>>>> but writing a different value will clear gs.
>>>>
>>>> This behavior is, AFAICT, complete nonsense. It happens to work
>>>> because usually gdb writes the same value back, and, in any case, gs
>>>> comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
>>>> But I think that this behavior should be fixed up and probably tested.
>>>> Certainly the behavior should *not* be the same on a fsgsbase kernel,
>>>> and and the fsgsbase behavior definitely needs a selftest.
>>>
>>> Okay, got the point; now crystal clear.
>>>
>>> I have my own test case for that though, need to find a very simple and
>>> acceptable solution.
>>>
>>
>> One solution that I recall, HPA once suggested, is:
>> Write registers in a reverse order from user_regs_struct, for SETREGS
>>
>> Assuming these for clarification, first:
>> * old and new index != 0
>> * taking GS as an example though, should be the same with FS
>>
>> Then, interesting cases would be something like these, without FSGSBASE:
>> Case (a), when index only changed to (new index):
>> (Then, the result after SETREGS would be)
>> GS = (new index), GSBASE = the base fetched from (new index)
>> Case (b), when base only changed to (new base):
>> Case (c), when both are changed:
>> GS = 0, GSBASE = (new base)
>>
>> Now, with FSGSBASE:
>> Case (a):
>> GS = (new index), GSBASE = (old base)
>> Case (b):
>> GS = (old index), GSBASE = (new base)
>> Case (c):
>> GS = (new index), GSBASE = (new base)
>>
>> As a reference, today's kernel behavior, without FSGSBASE:
>> Case (a):
>> GS = (new index), GSBASE = the base fetched from (new index)
>> Case (b):
>> GS = (old index), GSBASE = (old base)
>> Case (c):
>> GS = (new index), GSBASE = the base fetched from (new index)
>>
>> Now, with that reverse ordering and taking that "GSBASE is important" [1],
>> it looks like to be working in terms of its base value:
>> Case (b) and (c) will behave the same as with FSGSBASE
>> Case (a) still differs between w/ and w/o FSGSBASE.
>> Well, I'd say this bit comes from the 'new model' vs. the 'leagcy
>> model'. So, then okay with that. Any thoughts?
>>
>>
>>
>
> This seems more complicated than needed. How about we just remove all the magic and make putreg on the base registers never change the selector.
>

Hmm, just wonder what's benefit in terms of making a non-FSGSBASE system
behave more similar to one with FSGSBASE (although I would buy that removal).
Well, if we're okay with such divergence, maybe that's it.

> As far as I can tell, the only downside is that, on a non-FSGSBASE kernel, setting only the base if the selector already has a nonzero value won’t work, but I would be quite surprised if this breaks anything.



2019-05-02 00:31:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry



> On May 1, 2019, at 2:04 PM, Bae, Chang Seok <[email protected]> wrote:
>
>
>> On May 1, 2019, at 13:25, Andy Lutomirski <[email protected]> wrote:
>>
>>
>>
>>> On May 1, 2019, at 1:21 PM, Bae, Chang Seok <[email protected]> wrote:
>>>
>>>
>>>>> On May 1, 2019, at 11:01, Bae, Chang Seok <[email protected]> wrote:
>>>>>
>>>>> On May 1, 2019, at 10:40, Andy Lutomirski <[email protected]> wrote:
>>>>>
>>>>>> On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
>>>>>>>
>>>>>>> Furthermore, if you folks even want me to review this series, the ptrace tests need to be in place. On inspection of the current code (after the debacle a few releases back), it appears the SETREGSET’s effect depends on the current values in the registers — it does not actually seem to reliably load the whole state. So my confidence will be greatly increased if your series first adds a test that detects that bug (and fails!), then fixes the bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
>>>>>>>
>>>>>>
>>>>>> I think I need to understand the issue. Appreciate if you can elaborate a little bit.
>>>>>>
>>>>>
>>>>> This patch series gives a particular behavior to PTRACE_SETREGS and
>>>>> PTRACE_POKEUSER. There should be a test case that validates that
>>>>> behavior, including testing the weird cases where gs != 0 and gsbase
>>>>> contains unusual values. Some existing tests might be pretty close to
>>>>> doing what's needed.
>>>>>
>>>>> Beyond that, the current putreg() code does this:
>>>>>
>>>>> case offsetof(struct user_regs_struct,gs_base):
>>>>> /*
>>>>> * Exactly the same here as the %fs handling above.
>>>>> */
>>>>> if (value >= TASK_SIZE_MAX)
>>>>> return -EIO;
>>>>> if (child->thread.gsbase != value)
>>>>> return do_arch_prctl_64(child, ARCH_SET_GS, value);
>>>>> return 0;
>>>>>
>>>>> and do_arch_prctl_64(), in turn, does this:
>>>>>
>>>>> case ARCH_SET_GS: {
>>>>> if (unlikely(arg2 >= TASK_SIZE_MAX))
>>>>> return -EPERM;
>>>>>
>>>>> preempt_disable();
>>>>> /*
>>>>> * ARCH_SET_GS has always overwritten the index
>>>>> * and the base. Zero is the most sensible value
>>>>> * to put in the index, and is the only value that
>>>>> * makes any sense if FSGSBASE is unavailable.
>>>>> */
>>>>> if (task == current) {
>>>>> [not used for ptrace]
>>>>> } else {
>>>>> task->thread.gsindex = 0;
>>>>> x86_gsbase_write_task(task, arg2);
>>>>> }
>>>>>
>>>>> ...
>>>>>
>>>>> So writing the value that was already there to gsbase via putreg()
>>>>> does nothing, but writing a *different* value implicitly clears gs,
>>>>> but writing a different value will clear gs.
>>>>>
>>>>> This behavior is, AFAICT, complete nonsense. It happens to work
>>>>> because usually gdb writes the same value back, and, in any case, gs
>>>>> comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
>>>>> But I think that this behavior should be fixed up and probably tested.
>>>>> Certainly the behavior should *not* be the same on a fsgsbase kernel,
>>>>> and and the fsgsbase behavior definitely needs a selftest.
>>>>
>>>> Okay, got the point; now crystal clear.
>>>>
>>>> I have my own test case for that though, need to find a very simple and
>>>> acceptable solution.
>>>>
>>>
>>> One solution that I recall, HPA once suggested, is:
>>> Write registers in a reverse order from user_regs_struct, for SETREGS
>>>
>>> Assuming these for clarification, first:
>>> * old and new index != 0
>>> * taking GS as an example though, should be the same with FS
>>>
>>> Then, interesting cases would be something like these, without FSGSBASE:
>>> Case (a), when index only changed to (new index):
>>> (Then, the result after SETREGS would be)
>>> GS = (new index), GSBASE = the base fetched from (new index)
>>> Case (b), when base only changed to (new base):
>>> Case (c), when both are changed:
>>> GS = 0, GSBASE = (new base)
>>>
>>> Now, with FSGSBASE:
>>> Case (a):
>>> GS = (new index), GSBASE = (old base)
>>> Case (b):
>>> GS = (old index), GSBASE = (new base)
>>> Case (c):
>>> GS = (new index), GSBASE = (new base)
>>>
>>> As a reference, today's kernel behavior, without FSGSBASE:
>>> Case (a):
>>> GS = (new index), GSBASE = the base fetched from (new index)
>>> Case (b):
>>> GS = (old index), GSBASE = (old base)
>>> Case (c):
>>> GS = (new index), GSBASE = the base fetched from (new index)
>>>
>>> Now, with that reverse ordering and taking that "GSBASE is important" [1],
>>> it looks like to be working in terms of its base value:
>>> Case (b) and (c) will behave the same as with FSGSBASE
>>> Case (a) still differs between w/ and w/o FSGSBASE.
>>> Well, I'd say this bit comes from the 'new model' vs. the 'leagcy
>>> model'. So, then okay with that. Any thoughts?
>>>
>>>
>>>
>>
>> This seems more complicated than needed. How about we just remove all the magic and make putreg on the base registers never change the selector.
>>
>
> Hmm, just wonder what's benefit in terms of making a non-FSGSBASE system
> behave more similar to one with FSGSBASE (although I would buy that removal).

Simplicity. The current behavior is IMO nuts.

> Well, if we're okay with such divergence, maybe that's it.
>
>> As far as I can tell, the only downside is that, on a non-FSGSBASE kernel, setting only the base if the selector already has a nonzero value won’t work, but I would be quite surprised if this breaks anything.
>
>
>

2019-05-06 22:58:02

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry



> On Mar 25, 2019, at 02:44, Thomas Gleixner <[email protected]> wrote:
>
> On Fri, 15 Mar 2019, Chang S. Bae wrote:
>
>> The FSGSBASE instructions allow fast accesses on GSBASE. Now, at the
>> paranoid_entry, the per-CPU base value can be always copied to GSBASE.
>> And the original GSBASE value will be restored at the exit.
>
> Again you are describing WHAT but not the WHY.
>
>> So far, GSBASE modification has not been directly allowed from userspace.
>> So, swapping GSBASE has been conditionally executed according to the
>> kernel-enforced convention that a negative GSBASE indicates a kernel value.
>> But when FSGSBASE is enabled, userspace can put an arbitrary value in
>> GSBASE. The change will secure a correct GSBASE value with FSGSBASE.
>
> So that's some WHY, but it should be explained _BEFORE_ explaining the
> change. This changelog style is as bad as top posting. Why?
>
> 1) FSGSBASE is fast
>
> 2) Copy GSBASE always on paranoid exit and restore on entry
>
> 3) Explain the context
>
> No. You want to explain context first and then explain why this needs a
> change when FSGSBASE is enabled and how that change looks like at the
> conceptual level.
>
>> Also, factor out the RDMSR-based GSBASE read into a new macro,
>> READ_MSR_GSBASE.
>
> This new macro is related to this change in what way? None AFAICT. I'm fine
> with the macro itself, but the benefit for a single usage site is dubious.
>
> Adding this macro and using it should be done with a separate patch before
> this one, so this patch becomes simpler to review.
>
>> /*
>> @@ -1178,9 +1185,38 @@ ENTRY(paranoid_entry)
>> * This is also why CS (stashed in the "iret frame" by the
>> * hardware at entry) can not be used: this may be a return
>> * to kernel code, but with a user CR3 value.
>> + *
>> + * As long as this PTI macro doesn't depend on kernel GSBASE,
>> + * we can do it early. This is because FIND_PERCPU_BASE
>> + * references data in kernel space.
>
> It's not about 'can do it early'. The FSGSBASE handling requires that the
> kernel page tables are switched in.
>
> And for review and bisectability sake moving the CR3 switch in front of the
> GS handling should be done as a separate preparatory patch.
>
>> */
>> SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
>>
>> + /*
>> + * Read GSBASE by RDGSBASE. Kernel GSBASE is found
>> + * from the per-CPU offset table with a CPU NR.
>
> That CPU NR comes out of thin air, right? This code is complex enough by
> itself and does not need further confusion by comments which need a crystal
> ball for decoding.
>
>> + */
>
> Sigh. I can't see how that comment explains the ALTERNATIVE jump.
>
>> + ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "",\
>> + X86_FEATURE_FSGSBASE
>
> Please separate the above from the below with a new line for readability
> sake.
>
>> + rdgsbase %rbx
>> + FIND_PERCPU_BASE %rax
>> + wrgsbase %rax
>
> So this really should be wrapped in a macro like:
>
> SAVE_AND_SET_GSBASE %rbx, %rax
>
> which makes it entirely clear what this is about.
>
>> + ret
>> +
>
>> @@ -1194,12 +1230,21 @@ END(paranoid_entry)
>> * be complicated. Fortunately, we there's no good reason
>> * to try to handle preemption here.
>> *
>> - * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
>> + * On entry,
>> + * With FSGSBASE,
>> + * %rbx is original GSBASE that needs to be restored on the exit
>> + * Without that,
>> + * %ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
>> */
>> ENTRY(paranoid_exit)
>> UNWIND_HINT_REGS
>> DISABLE_INTERRUPTS(CLBR_ANY)
>> TRACE_IRQS_OFF_DEBUG
>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
>> + X86_FEATURE_FSGSBASE
>> + wrgsbase %rbx
>> + jmp .Lparanoid_exit_no_swapgs;
>
> Again. A few newlines would make it more readable.
>
> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
> the usage sites there is the following code in the nmi maze:
>
> /*
> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
> * as we should not be calling schedule in NMI context.
> * Even with normal interrupts enabled. An NMI should not be
> * setting NEED_RESCHED or anything that normal interrupts and
> * exceptions might do.
> */
> call paranoid_entry
> UNWIND_HINT_REGS
>
> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> movq %rsp, %rdi
> movq $-1, %rsi
> call do_nmi
>
> /* Always restore stashed CR3 value (see paranoid_entry) */
> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>
> testl %ebx, %ebx /* swapgs needed? */
> jnz nmi_restore
> nmi_swapgs:
> SWAPGS_UNSAFE_STACK
> nmi_restore:
> POP_REGS
>

Oh!, almost miss this bit. Will be terrifying if leave them like this way.

> I might be missing something, but how is that supposed to work when
> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> there is a big fat comment missing explaining why.
>

You will see a revision shortly. Thanks

Chang