2018-05-31 18:05:00

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 00/15] x86: Enable FSGSBASE instructions

FSGSBASE is 64-bit instruction set to allow read/write
FS/GS base from any privilege. As introduced from
Ivybridge, enabling effort has been revolving quite long
[2,3,4] for various reasons. After extended discussions [1],
this patchset is proposed to introduce new ABIs of
customizing FS/GS base (separate from its selector).

FSGSBASE-enabled VM can be located on hosts with
either HW virtualization or SW emulation. KVM advertises
FSGSBASE when physical CPU has and emulation is
supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
may disable FSGSBASE for seamless VM migrations [6].

A couple of major benefits are expected. Kernel will have
performance improvement in context switch by skipping MSR
write for GS base. User-level programs (such as JAVA-based)
benefit from avoiding system calls to edit FS/GS base.

Changes when FSGSBASE enabled:
(1) In context switch, a thread's FS/GS base is secured
regardless of its selector base on the discussion [1].
(2) (Subsequently) ptracer should expect divergence of FS/GS
index and base values. There was controveral debate on
the concerns with backward incompatibility with that.
(Cases for GDB than other toolchains [7,8]) Current
patchset as baseline version does not contain support
for the backward compatibility
(3) On paranoid entry, GS base is updated to per_CPU base
and the original base is restored at the exit

Updates from V1 [9]:
* (3), instead of comparing current and kernel GS bases
* (2); does not include ptracer backward compatibility
patches.
* With (3), CPU number is stored as early as possible
(before hitting IST stack) than at vDSO initialization.
* Include FSGSBASE documentation and enumerating capability
for user space
* Add TAINT_INSECURE flag and vDSO cleanup for the (CPU
number) initialization

[1] Recent discussion on LKML:
https://marc.info/?t=150147053700001&r=1&w=2
[2] Andy Lutomirski’s rebase work :
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
[3] Patch set shown in year 2016:
https://marc.info/?t=145857711900001&r=1&w=2
[4] First patch set: https://lkml.org/lkml/2015/4/10/573
[5] QEMU with FSGSBASE emulation:
https://github.com/qemu/qemu/blob/026aaf47c02b79036feb830206cfebb2a726510d/target/i386/translate.c#L8186
[6] 5-level EPT:
http://lkml.kernel.org/r/[email protected]
[7] RR/FSGSBASE:
https://mail.mozilla.org/pipermail/rr-dev/2018-March/000616.html
[8] CRIU/FSGSBASE:
https://lists.openvz.org/pipermail/criu/2018-March/040654.html
[9] V1:
https://lkml.org/lkml/2018/3/19/1699

Andi Kleen (3):
x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/fsgsbase/64: Add documentation for FSGSBASE

Andy Lutomirski (4):
x86/fsgsbase/64: Make ptrace read FS/GS base accurately
x86/fsgsbase/64: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
x86/fsgsbase/64: Preserve FS/GS state in __switch_to if FSGSBASE is on
x86/fsgsbase/64: Enable FSGSBASE by default and add a chicken bit

Chang S. Bae (8):
x86/fsgsbase/64: Introduce FS/GS base helper functions
x86/fsgsbase/64: Use FS/GS base helpers in core dump
x86/fsgsbase/64: Factor out load FS/GS segments from __switch_to
x86/vdso: Move out the CPU number store
taint: Add taint for insecure
x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
x86/fsgsbase/64: When copying a thread, use FSGSBASE if enabled
x86/fsgsbase/64: Use per-CPU base as GS base on paranoid_entry

Documentation/admin-guide/kernel-parameters.txt | 2 +
Documentation/sysctl/kernel.txt | 1 +
Documentation/x86/entry_64.txt | 9 +
Documentation/x86/fsgs.txt | 104 +++++++++
arch/x86/entry/entry_64.S | 74 +++++--
arch/x86/entry/vdso/vgetcpu.c | 2 +-
arch/x86/entry/vdso/vma.c | 38 +---
arch/x86/include/asm/elf.h | 6 +-
arch/x86/include/asm/fsgsbase.h | 171 +++++++++++++++
arch/x86/include/asm/inst.h | 15 ++
arch/x86/include/asm/segment.h | 4 +
arch/x86/include/asm/vgtod.h | 2 -
arch/x86/include/uapi/asm/hwcap2.h | 3 +
arch/x86/kernel/cpu/common.c | 39 ++++
arch/x86/kernel/process_64.c | 274 ++++++++++++++++++++----
arch/x86/kernel/ptrace.c | 28 +--
arch/x86/kernel/setup_percpu.c | 30 ++-
include/linux/kernel.h | 3 +-
kernel/panic.c | 1 +
19 files changed, 681 insertions(+), 125 deletions(-)
create mode 100644 Documentation/x86/fsgs.txt
create mode 100644 arch/x86/include/asm/fsgsbase.h

--
2.7.4



2018-05-31 18:00:05

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 03/15] x86/fsgsbase/64: Use FS/GS base helpers in core dump

When new FSGSBASE instructions enabled, this read will be
switched to be faster.

Based-on-code-from: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/elf.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 0d157d2..49acefb 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -10,6 +10,7 @@
#include <asm/ptrace.h>
#include <asm/user.h>
#include <asm/auxvec.h>
+#include <asm/fsgsbase.h>

typedef unsigned long elf_greg_t;

@@ -205,7 +206,6 @@ void set_personality_ia32(bool);

#define ELF_CORE_COPY_REGS(pr_reg, regs) \
do { \
- unsigned long base; \
unsigned v; \
(pr_reg)[0] = (regs)->r15; \
(pr_reg)[1] = (regs)->r14; \
@@ -228,8 +228,8 @@ do { \
(pr_reg)[18] = (regs)->flags; \
(pr_reg)[19] = (regs)->sp; \
(pr_reg)[20] = (regs)->ss; \
- rdmsrl(MSR_FS_BASE, base); (pr_reg)[21] = base; \
- rdmsrl(MSR_KERNEL_GS_BASE, base); (pr_reg)[22] = base; \
+ (pr_reg)[21] = read_fsbase(); \
+ (pr_reg)[22] = read_inactive_gsbase(); \
asm("movl %%ds,%0" : "=r" (v)); (pr_reg)[23] = v; \
asm("movl %%es,%0" : "=r" (v)); (pr_reg)[24] = v; \
asm("movl %%fs,%0" : "=r" (v)); (pr_reg)[25] = v; \
--
2.7.4


2018-05-31 18:00:25

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 15/15] x86/fsgsbase/64: Add documentation for FSGSBASE

From: Andi Kleen <[email protected]>

v2: Minor updates to documentation requested in review.
v3: Update for new gcc and various improvements.

Signed-off-by: Andi Kleen <[email protected]>
[chang: Minor edit and include descriptions for entry
changes by FSGSBASE]
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/x86/entry_64.txt | 9 ++++
Documentation/x86/fsgs.txt | 104 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 113 insertions(+)
create mode 100644 Documentation/x86/fsgs.txt

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
index c1df8eb..9ff38d9 100644
--- a/Documentation/x86/entry_64.txt
+++ b/Documentation/x86/entry_64.txt
@@ -102,3 +102,12 @@ We try to only use IST entries and the paranoid entry code for vectors
that absolutely need the more expensive check for the GS base - and we
generate all 'normal' entry points with the regular (faster) paranoid=0
variant.
+
+When FSGSBASE enabled, an arbitrary GS base (including negative value)
+is possible in user space. Thus, current GS base does not offer any
+useful guesses of the origin, whether from kernel or from user space.
+Also, finding the proper kernel GS base is efficiently possible;
+doing RDPID/LSL to get CPU number and indexing it into a table will
+take per-CPU base. With that, paranoid entry with FSGSBASE will always
+overwrite GS base with the per-CPU base, while the original GS base is
+stored and restored back to GS base on its paranoid exit.
diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
new file mode 100644
index 0000000..ebe15f3
--- /dev/null
+++ b/Documentation/x86/fsgs.txt
@@ -0,0 +1,104 @@
+
+Using FS and GS prefixes on 64bit x86 linux
+
+The x86 architecture supports segment prefixes per instruction to add an
+offset to an address. On 64bit x86, these are mostly nops, except for FS
+and GS.
+
+This offers an efficient way to reference a global pointer.
+
+The compiler has to generate special code to use these base registers,
+or they can be accessed with inline assembler.
+
+ mov %gs:offset,%reg
+ mov %fs:offset,%reg
+
+On 64bit code, FS is used to address the thread local segment (TLS), declared using
+__thread. The compiler then automatically generates the correct prefixes and
+relocations to access these values.
+
+FS is normally managed by the runtime code or the threading library
+Overwriting it can break a lot of things (including syscalls and gdb),
+but it can make sense to save/restore it for threading purposes.
+
+GS is freely available, but may need special (compiler or inline assembler)
+code to use.
+
+Traditionally 64bit FS and GS could be set by the arch_prctl system call
+
+ arch_prctl(ARCH_SET_GS, value)
+ arch_prctl(ARCH_SET_FS, value)
+
+[There was also an older method using modify_ldt(), inherited from 32bit,
+but this is not discussed here.]
+
+However using a syscall is problematic for user space threading libraries
+that want to context switch in user space. The whole point of them
+is avoiding the overhead of a syscall. It's also cleaner for compilers
+wanting to use the extra register to use instructions to write
+it, or read it directly to compute addresses and offsets.
+
+Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
+access these registers quickly from user context
+
+ RDFSBASE %reg read the FS base (or _readfsbase_u64)
+ RDGSBASE %reg read the GS base (or _readgsbase_u64)
+
+ WRFSBASE %reg write the FS base (or _writefsbase_u64)
+ WRGSBASE %reg write the GS base (or _writegsbase_u64)
+
+If you use the intrinsics include <immintrin.h> and set the -mfsgsbase option.
+
+The instructions are supported by the CPU when the "fsgsbase" string is shown in
+/proc/cpuinfo (or directly retrieved through the CPUID instruction,
+7:0 (ebx), word 9, bit 0)
+
+The instructions are only available to 64bit binaries.
+
+In addition the kernel needs to explicitly enable these instructions, as it
+may otherwise not correctly context switch the state. Newer Linux
+kernels enable this. When the kernel did not enable the instruction
+they will fault with an #UD exception.
+
+An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
+bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
+kernel supports RDFSGSBASE.
+
+ #include <sys/auxv.h>
+ #include <elf.h>
+
+ /* Will be eventually in asm/hwcap.h */
+ #define HWCAP2_FSGSBASE (1 << 1)
+
+ unsigned val = getauxval(AT_HWCAP2);
+ if (val & HWCAP2_FSGSBASE) {
+ asm("wrgsbase %0" :: "r" (ptr));
+ }
+
+No extra CPUID check needed as the kernel will not set this bit if the CPU
+does not support it.
+
+gcc 6 will have special support to directly access data relative
+to fs/gs using the __seg_fs and __seg_gs address space pointer
+modifiers.
+
+#ifndef __SEG_GS
+#error "Need gcc 6 or later"
+#endif
+
+struct gsdata {
+ int a;
+ int b;
+} gsdata = { 1, 2 };
+
+int __seg_gs *valp = 0; /* offset relative to GS */
+
+ /* Check if kernel supports FSGSBASE as above */
+
+ /* Set up new GS */
+ asm("wrgsbase %0" :: "r" (&gsdata));
+
+ /* Now the global pointer can be used normally */
+ printf("gsdata.a = %d\n", valp->a);
+
+Andi Kleen
--
2.7.4


2018-05-31 18:00:48

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 04/15] x86/fsgsbase/64: Factor out load FS/GS segments from __switch_to

Instead of open code, load_fsgs() will cleanup __switch_to
and symmetric with FS/GS segment save. When FSGSBASE
enabled, X86_FEATURE_FSGSBASE check will be incorporated.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_64.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e498671..cebf240 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -279,6 +279,15 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}

+static __always_inline void load_fsgs(struct thread_struct *prev,
+ struct thread_struct *next)
+{
+ load_seg_legacy(prev->fsindex, prev->fsbase,
+ next->fsindex, next->fsbase, FS);
+ load_seg_legacy(prev->gsindex, prev->gsbase,
+ next->gsindex, next->gsbase, GS);
+}
+
static unsigned long task_seg_base(struct task_struct *task,
unsigned short selector)
{
@@ -588,10 +597,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);

- load_seg_legacy(prev->fsindex, prev->fsbase,
- next->fsindex, next->fsbase, FS);
- load_seg_legacy(prev->gsindex, prev->gsbase,
- next->gsindex, next->gsbase, GS);
+ load_fsgs(prev, next);

switch_fpu_finish(next_fpu, cpu);

--
2.7.4


2018-05-31 18:01:35

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 12/15] x86/fsgsbase/64: Use per-CPU base as GS base on paranoid_entry

FSGSBASE allows fast access on GS base. With that, per-CPU
base is always copied to GS base on paranoid entry. The
current GS base value is restored on the exit.

Currently, userspace can't modify GS base and the kernel's
conventions are that a negative GS base means it is a kernel
value and a positive GS base means it is a user value. But,
with FSGSBASE enabled, userspace can put arbitrary data in
there. This behavior will be the same with the patch.

Per-CPU base can be found from per_cpu_offset table with CPU
number, which is in the (per-CPU) segment limit or obtained
by RDPID instruction.

GAS-compatible RDPID macro is included.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 74 +++++++++++++++++++++++++++++++++--------
arch/x86/include/asm/fsgsbase.h | 57 +++++++++++++++++++++++++++++++
arch/x86/include/asm/inst.h | 15 +++++++++
3 files changed, 132 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3166b96..cfac4c0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -38,6 +38,8 @@
#include <asm/export.h>
#include <asm/frame.h>
#include <asm/nospec-branch.h>
+#include <asm/vdso.h>
+#include <asm/fsgsbase.h>
#include <linux/err.h>

#include "calling.h"
@@ -954,10 +956,14 @@ ENTRY(\sym)
addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif

- /* these procedures expect "no swapgs" flag in ebx */
.if \paranoid
+ /*
+ * With FSGSBASE, original GS base is stored in rbx
+ * Without FSGSBASE, expect "no swapgs" flag in ebx
+ */
jmp paranoid_exit
.else
+ /* expect "no swapgs" flag in ebx */
jmp error_exit
.endif

@@ -1168,26 +1174,57 @@ idtentry machine_check do_mce has_error_code=0 paranoid=1
#endif

/*
- * Save all registers in pt_regs, and switch gs if needed.
- * Use slow, but surefire "are we in kernel?" check.
- * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
+ * Save all registers in pt_regs.
+ *
+ * When FSGSBASE enabled, current GS base is always copied to rbx.
+ *
+ * Without FSGSBASE, SWAPGS is needed when entering from userspace.
+ * A positive GS base means it is a user value and a negative GS
+ * base means it is a kernel value.
+ *
+ * Return:
+ * With FSGSBASE, rbx has current GS base.
+ * Without that,
+ * ebx=0: need SWAPGS on exit, ebx=1: otherwise
*/
ENTRY(paranoid_entry)
UNWIND_HINT_FUNC
cld
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
- movl $1, %ebx
- movl $MSR_GS_BASE, %ecx
- rdmsr
- testl %edx, %edx
- js 1f /* negative -> in kernel */
- SWAPGS
- xorl %ebx, %ebx

-1:
+ /*
+ * As long as this PTI macro doesn't depend on kernel GS base,
+ * we can do it early. This is because FIND_PERCPU_BASE
+ * references data in kernel space.
+ */
SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14

+ /*
+ * Read GS base by RDGSBASE. Kernel GS base is found
+ * from the per-CPU offset table with CPU number.
+ */
+ ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "",\
+ X86_FEATURE_FSGSBASE
+ RDGSBASE %rbx
+ FIND_PERCPU_BASE %rax
+ WRGSBASE %rax
+ ret
+
+.Lparanoid_entry_no_fsgsbase:
+ movl $1, %ebx
+ /*
+ * FSGSBASE is not in use, so depend on the kernel-enforced
+ * convention that a negative GS base indicates a kernel value.
+ */
+ READ_MSR_GSBASE save_reg=%edx
+ testl %edx, %edx /* negative -> in kernel */
+ jns .Lparanoid_entry_swapgs
+ ret
+
+.Lparanoid_entry_swapgs:
+ SWAPGS
+ xorl %ebx, %ebx
ret
END(paranoid_entry)

@@ -1201,12 +1238,21 @@ END(paranoid_entry)
* be complicated. Fortunately, we there's no good reason
* to try to handle preemption here.
*
- * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
+ * On entry,
+ * With FSGSBASE,
+ * rbx is original GS base that needs to be restored on the exit
+ * Without that,
+ * ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
*/
ENTRY(paranoid_exit)
UNWIND_HINT_REGS
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF_DEBUG
+ ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
+ X86_FEATURE_FSGSBASE
+ WRGSBASE %rbx
+ jmp .Lparanoid_exit_no_swapgs;
+.Lparanoid_exit_no_fsgsbase:
testl %ebx, %ebx /* swapgs needed? */
jnz .Lparanoid_exit_no_swapgs
TRACE_IRQS_IRETQ
@@ -1217,7 +1263,7 @@ ENTRY(paranoid_exit)
TRACE_IRQS_IRETQ_DEBUG
RESTORE_CR3 scratch_reg=%rbx save_reg=%r14
.Lparanoid_exit_restore:
- jmp restore_regs_and_return_to_kernel
+ jmp restore_regs_and_return_to_kernel
END(paranoid_exit)

/*
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index 903c7a0..3a5e1ec 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -107,6 +107,63 @@ void write_inactive_gsbase(unsigned long gsbase);
MODRM 0xd0 wrgsbase_opd 1
.endm

+#if CONFIG_SMP
+
+/*
+ * Fetch the per-CPU GSBASE value for this processor and put it in @reg.
+ * We normally use %GS for accessing per-CPU data, but we are setting up
+ * %GS here and obviously can not use %GS itself to access per-CPU data.
+ */
+.macro FIND_PERCPU_BASE_RDPID reg:req
+ RDPID \reg
+
+ /*
+ * CPU number is written before IST initialization. Later,
+ * processor id is (also) written during vDSO initialization,
+ * with 12 bits for the CPU and 8 bits for the node.
+ */
+ andq $PERCPU_CPU_MASK, \reg
+ /*
+ * Kernel GS base is looked up from the __per_cpu_offset list with
+ * the CPU number (processor id).
+ */
+ movq __per_cpu_offset(, \reg, 8), \reg
+.endm
+
+.macro FIND_PERCPU_BASE_SEG_LIMIT reg:req
+ /* CPU number is found from the limit of PER_CPU entry in GDT */
+ movq $__PER_CPU_SEG, \reg
+ lsl \reg, \reg
+
+ /* Same as FIND_PERCPU_BASE_RDPID */
+ andq $PERCPU_CPU_MASK, \reg
+ movq __per_cpu_offset(, \reg, 8), \reg
+.endm
+
+.macro FIND_PERCPU_BASE reg:req
+ ALTERNATIVE \
+ "FIND_PERCPU_BASE_SEG_LIMIT \reg", \
+ "FIND_PERCPU_BASE_RDPID \reg", \
+ X86_FEATURE_RDPID
+.endm
+
+#else
+
+.macro FIND_PERCPU_BASE reg:req
+ /* Tracking the base offset value */
+ movq pcpu_unit_offsets(%rip), \reg
+.endm
+
+#endif /* CONFIG_SMP */
+
+.macro READ_MSR_GSBASE save_reg:req
+ movl $MSR_GS_BASE, %ecx
+ /* Read MSR specified by %ecx into %edx:%eax */
+ rdmsr
+ .ifnc \save_reg, %edx
+ movl %edx, \save_reg
+ .endif
+.endm
#endif /* CONFIG_X86_64 */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/inst.h b/arch/x86/include/asm/inst.h
index f5a796d..d063841 100644
--- a/arch/x86/include/asm/inst.h
+++ b/arch/x86/include/asm/inst.h
@@ -306,6 +306,21 @@
.endif
MODRM 0xc0 movq_r64_xmm_opd1 movq_r64_xmm_opd2
.endm
+
+.macro RDPID opd
+ REG_TYPE rdpid_opd_type \opd
+ .if rdpid_opd_type == REG_TYPE_R64
+ R64_NUM rdpid_opd \opd
+ .else
+ R32_NUM rdpid_opd \opd
+ .endif
+ .byte 0xf3
+ .if rdpid_opd > 7
+ PFX_REX rdpid_opd 0
+ .endif
+ .byte 0x0f, 0xc7
+ MODRM 0xc0 rdpid_opd 0x7
+.endm
#endif

#endif
--
2.7.4


2018-05-31 18:01:35

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 01/15] x86/fsgsbase/64: Introduce FS/GS base helper functions

With new helpers, FS/GS base access is centralized.
Eventually, when FSGSBASE instruction enabled, it will
be faster.

The helpers are used on ptrace APIs (PTRACE_ARCH_PRCTL,
PTRACE_SETREG, PTRACE_GETREG, etc). Idea is to keep
the FS/GS-update mechanism organized.

Notion of "active" and "inactive" are used to distinguish
GS bases between "kernel" and "user". "inactive" GS base
is the GS base, backed up at kernel entries, of inactive
(user) task's.

Based-on-code-from: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 47 +++++++++++++++
arch/x86/kernel/process_64.c | 128 +++++++++++++++++++++++++++++-----------
arch/x86/kernel/ptrace.c | 28 +++------
3 files changed, 149 insertions(+), 54 deletions(-)
create mode 100644 arch/x86/include/asm/fsgsbase.h

diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
new file mode 100644
index 0000000..0d4fbef
--- /dev/null
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_FSGSBASE_H
+#define _ASM_FSGSBASE_H 1
+
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_X86_64
+
+#include <asm/msr-index.h>
+
+/*
+ * Read/write an (inactive) task's fsbase or gsbase. This returns
+ * the value that the FS/GS base would have (if the task were to be
+ * resumed). The current task is also supported.
+ */
+unsigned long read_task_fsbase(struct task_struct *task);
+unsigned long read_task_gsbase(struct task_struct *task);
+int write_task_fsbase(struct task_struct *task, unsigned long fsbase);
+int write_task_gsbase(struct task_struct *task, unsigned long gsbase);
+
+/* Helper functions for reading/writing FS/GS base */
+
+static inline unsigned long read_fsbase(void)
+{
+ unsigned long fsbase;
+
+ rdmsrl(MSR_FS_BASE, fsbase);
+ return fsbase;
+}
+
+void write_fsbase(unsigned long fsbase);
+
+static inline unsigned long read_inactive_gsbase(void)
+{
+ unsigned long gsbase;
+
+ rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ return gsbase;
+}
+
+void write_inactive_gsbase(unsigned long gsbase);
+
+#endif /* CONFIG_X86_64 */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_FSGSBASE_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 12bb445..5506e1b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -54,6 +54,7 @@
#include <asm/vdso.h>
#include <asm/intel_rdt_sched.h>
#include <asm/unistd.h>
+#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
@@ -278,6 +279,90 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}

+void write_fsbase(unsigned long fsbase)
+{
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(FS, 0);
+ wrmsrl(MSR_FS_BASE, fsbase);
+}
+
+void write_inactive_gsbase(unsigned long gsbase)
+{
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(GS, 0);
+ wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+}
+
+unsigned long read_task_fsbase(struct task_struct *task)
+{
+ unsigned long fsbase;
+
+ if (task == current)
+ fsbase = read_fsbase();
+ else
+ /*
+ * XXX: This will not behave as expected if called
+ * if fsindex != 0
+ */
+ fsbase = task->thread.fsbase;
+
+ return fsbase;
+}
+
+unsigned long read_task_gsbase(struct task_struct *task)
+{
+ unsigned long gsbase;
+
+ if (task == current)
+ gsbase = read_inactive_gsbase();
+ else
+ /*
+ * XXX: This will not behave as expected if called
+ * if gsindex != 0
+ */
+ gsbase = task->thread.gsbase;
+
+ return gsbase;
+}
+
+int write_task_fsbase(struct task_struct *task, unsigned long fsbase)
+{
+ int cpu;
+
+ /*
+ * Not strictly needed for fs, but do it for symmetry
+ * with gs
+ */
+ if (unlikely(fsbase >= TASK_SIZE_MAX))
+ return -EPERM;
+
+ cpu = get_cpu();
+ task->thread.fsbase = fsbase;
+ if (task == current)
+ write_fsbase(fsbase);
+ task->thread.fsindex = 0;
+ put_cpu();
+
+ return 0;
+}
+
+int write_task_gsbase(struct task_struct *task, unsigned long gsbase)
+{
+ int cpu;
+
+ if (unlikely(gsbase >= TASK_SIZE_MAX))
+ return -EPERM;
+
+ cpu = get_cpu();
+ task->thread.gsbase = gsbase;
+ if (task == current)
+ write_inactive_gsbase(gsbase);
+ task->thread.gsindex = 0;
+ put_cpu();
+
+ return 0;
+}
+
int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p, unsigned long tls)
{
@@ -618,54 +703,27 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
{
int ret = 0;
- int doit = task == current;
- int cpu;

switch (option) {
- case ARCH_SET_GS:
- if (arg2 >= TASK_SIZE_MAX)
- return -EPERM;
- cpu = get_cpu();
- task->thread.gsindex = 0;
- task->thread.gsbase = arg2;
- if (doit) {
- load_gs_index(0);
- ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, arg2);
- }
- put_cpu();
+ case ARCH_SET_GS: {
+ ret = write_task_gsbase(task, arg2);
break;
- case ARCH_SET_FS:
- /* Not strictly needed for fs, but do it for symmetry
- with gs */
- if (arg2 >= TASK_SIZE_MAX)
- return -EPERM;
- cpu = get_cpu();
- task->thread.fsindex = 0;
- task->thread.fsbase = arg2;
- if (doit) {
- /* set the selector to 0 to not confuse __switch_to */
- loadsegment(fs, 0);
- ret = wrmsrl_safe(MSR_FS_BASE, arg2);
- }
- put_cpu();
+ }
+ case ARCH_SET_FS: {
+ ret = write_task_fsbase(task, arg2);
break;
+ }
case ARCH_GET_FS: {
unsigned long base;

- if (doit)
- rdmsrl(MSR_FS_BASE, base);
- else
- base = task->thread.fsbase;
+ base = read_task_fsbase(task);
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
case ARCH_GET_GS: {
unsigned long base;

- if (doit)
- rdmsrl(MSR_KERNEL_GS_BASE, base);
- else
- base = task->thread.gsbase;
+ base = read_task_gsbase(task);
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index ed5c4cd..b2f0beb 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -39,6 +39,7 @@
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
#include <asm/syscall.h>
+#include <asm/fsgsbase.h>

#include "tls.h"

@@ -396,12 +397,11 @@ static int putreg(struct task_struct *child,
if (value >= TASK_SIZE_MAX)
return -EIO;
/*
- * When changing the segment base, use do_arch_prctl_64
- * to set either thread.fs or thread.fsindex and the
- * corresponding GDT slot.
+ * When changing the FS base, use the same
+ * mechanism as for do_arch_prctl_64
*/
if (child->thread.fsbase != value)
- return do_arch_prctl_64(child, ARCH_SET_FS, value);
+ return write_task_fsbase(child, value);
return 0;
case offsetof(struct user_regs_struct,gs_base):
/*
@@ -410,7 +410,7 @@ static int putreg(struct task_struct *child,
if (value >= TASK_SIZE_MAX)
return -EIO;
if (child->thread.gsbase != value)
- return do_arch_prctl_64(child, ARCH_SET_GS, value);
+ return write_task_gsbase(child, value);
return 0;
#endif
}
@@ -434,20 +434,10 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
return get_flags(task);

#ifdef CONFIG_X86_64
- case offsetof(struct user_regs_struct, fs_base): {
- /*
- * XXX: This will not behave as expected if called on
- * current or if fsindex != 0.
- */
- return task->thread.fsbase;
- }
- case offsetof(struct user_regs_struct, gs_base): {
- /*
- * XXX: This will not behave as expected if called on
- * current or if fsindex != 0.
- */
- return task->thread.gsbase;
- }
+ case offsetof(struct user_regs_struct, fs_base):
+ return read_task_fsbase(task);
+ case offsetof(struct user_regs_struct, gs_base):
+ return read_task_gsbase(task);
#endif
}

--
2.7.4


2018-05-31 18:01:41

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 14/15] x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2

From: Andi Kleen <[email protected]>

The kernel needs to explicitly enable FSGSBASE. So, the application
needs to know if it can safely use these instructions. Just looking
at the CPUID bit is not enough because it may be running in a kernel
that does not enable the instructions.

One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want
to overwrite the signal handlers of the main application.

So we need to provide a way for the application to discover the kernel
capability.

I used AT_HWCAP2 in the ELF aux vector which is already used by
PPC for similar things. We define a new Linux defined bitmap
returned in AT_HWCAP. Next to MONITOR/MWAIT, bit 1 is reserved for
FSGSBASE capability checks.

The application can then access it manually or using
the getauxval() function in newer glibc.

Signed-off-by: Andi Kleen <[email protected]>
[chang: Rebase and edit the patch note accordingly]
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/uapi/asm/hwcap2.h | 3 +++
arch/x86/kernel/cpu/common.c | 4 +++-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/hwcap2.h b/arch/x86/include/uapi/asm/hwcap2.h
index 6ebaae9..c5ce54e 100644
--- a/arch/x86/include/uapi/asm/hwcap2.h
+++ b/arch/x86/include/uapi/asm/hwcap2.h
@@ -5,4 +5,7 @@
/* MONITOR/MWAIT enabled in Ring 3 */
#define HWCAP2_RING3MWAIT (1 << 0)

+/* Kernel allows FSGSBASE instructions available in Ring 3 */
+#define HWCAP2_FSGSBASE BIT(1)
+
#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0339bb3..5e6b46c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1303,8 +1303,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_umip(c);

/* Enable FSGSBASE instructions if available. */
- if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
cr4_set_bits(X86_CR4_FSGSBASE);
+ elf_hwcap2 |= HWCAP2_FSGSBASE;
+ }

/*
* The vendor-specific functions might have changed features.
--
2.7.4


2018-05-31 18:02:10

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 10/15] x86/fsgsbase/64: Preserve FS/GS state in __switch_to if FSGSBASE is on

From: Andy Lutomirski <[email protected]>

With the new FSGSBASE instructions, we can efficiently read and write
the FS and GS bases in __switch_to. Use that capability to preserve
the full state.

This will enable user code to do whatever it wants with the new
instructions without any kernel-induced gotchas. (There can still be
architectural gotchas: movl %gs,%eax; movl %eax,%gs may change GSBASE
if WRGSBASE was used, but users are expected to read the CPU manual
before doing things like that.)

This is a considerable speedup. It seems to save about 100 cycles
per context switch compared to the baseline 4.6-rc1 behavior on my
Skylake laptop.

Signed-off-by: Andy Lutomirski <[email protected]>
[chang: 5~10% performance improvement on context switch micro-
benchmark, when FS/GS base is actively switched.]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_64.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 8ba947f..55af719 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -234,8 +234,18 @@ static __always_inline void save_fsgs(struct task_struct *task)
{
savesegment(fs, task->thread.fsindex);
savesegment(gs, task->thread.gsindex);
- save_base_legacy(task, task->thread.fsindex, FS);
- save_base_legacy(task, task->thread.gsindex, GS);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /*
+ * If FSGSBASE is enabled, we can't make any useful guesses
+ * about the base, and user code expects us to save the current
+ * value. Fortunately, reading the base directly is efficient.
+ */
+ task->thread.fsbase = rdfsbase();
+ task->thread.gsbase = rd_inactive_gsbase();
+ } else {
+ save_base_legacy(task, task->thread.fsindex, FS);
+ save_base_legacy(task, task->thread.gsindex, GS);
+ }
}

#if IS_ENABLED(CONFIG_KVM)
@@ -314,10 +324,22 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
static __always_inline void load_fsgs(struct thread_struct *prev,
struct thread_struct *next)
{
- load_seg_legacy(prev->fsindex, prev->fsbase,
- next->fsindex, next->fsbase, FS);
- load_seg_legacy(prev->gsindex, prev->gsbase,
- next->gsindex, next->gsbase, GS);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /* Update the FS and GS selectors if they could have changed. */
+ if (unlikely(prev->fsindex || next->fsindex))
+ loadseg(FS, next->fsindex);
+ if (unlikely(prev->gsindex || next->gsindex))
+ loadseg(GS, next->gsindex);
+
+ /* Update the bases. */
+ wrfsbase(next->fsbase);
+ wr_inactive_gsbase(next->gsbase);
+ } else {
+ load_seg_legacy(prev->fsindex, prev->fsbase,
+ next->fsindex, next->fsbase, FS);
+ load_seg_legacy(prev->gsindex, prev->gsbase,
+ next->gsindex, next->gsbase, GS);
+ }
}

static unsigned long task_seg_base(struct task_struct *task,
--
2.7.4


2018-05-31 18:02:21

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 13/15] x86/fsgsbase/64: Enable FSGSBASE by default and add a chicken bit

From: Andy Lutomirski <[email protected]>

Now that FSGSBASE is fully supported, remove unsafe_fsgsbase, enable
FSGSBASE by default, and add nofsgsbase to disable it.

Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 3 +--
arch/x86/kernel/cpu/common.c | 34 ++++++++++---------------
2 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b92c4a0..a62eeb3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2600,8 +2600,7 @@
emulation library even if a 387 maths coprocessor
is present.

- unsafe_fsgsbase [X86] Allow FSGSBASE instructions. This will be
- replaced with a nofsgsbase flag.
+ nofsgsbase [X86] Disables FSGSBASE instructions.

no_console_suspend
[HW] Never suspend the console
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5412d62..0339bb3 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -355,23 +355,21 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}

-/*
- * Temporary hack: FSGSBASE is unsafe until a few kernel code paths are updated.
- * This allows us to get the kernel ready incrementally. Setting
- * unsafe_fsgsbase will allow the series to be bisected if necessary.
- *
- * Once all the pieces are in place, this will go away and be replaced with
- * a nofsgsbase chicken flag.
- */
-static bool unsafe_fsgsbase;
-
-static __init int setup_unsafe_fsgsbase(char *arg)
+static __init int x86_nofsgsbase_setup(char *arg)
{
- unsafe_fsgsbase = true;
- add_taint(TAINT_INSECURE, LOCKDEP_STILL_OK);
+ /* require an exact match without trailing characters */
+ if (strlen(arg))
+ return 0;
+
+ /* do not emit a message if the feature is not present */
+ if (!boot_cpu_has(X86_FEATURE_FSGSBASE))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_FSGSBASE);
+ pr_info("nofsgsbase: FSGSBASE disabled\n");
return 1;
}
-__setup("unsafe_fsgsbase", setup_unsafe_fsgsbase);
+__setup("nofsgsbase", x86_nofsgsbase_setup);

/*
* Protection Keys are not available in 32-bit mode.
@@ -1305,12 +1303,8 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_umip(c);

/* Enable FSGSBASE instructions if available. */
- if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
- if (unsafe_fsgsbase)
- cr4_set_bits(X86_CR4_FSGSBASE);
- else
- clear_cpu_cap(c, X86_FEATURE_FSGSBASE);
- }
+ if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ cr4_set_bits(X86_CR4_FSGSBASE);

/*
* The vendor-specific functions might have changed features.
--
2.7.4


2018-05-31 18:02:32

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 11/15] x86/fsgsbase/64: When copying a thread, use FSGSBASE if enabled

When FSGSBASE enabled, copy real FS/GS base values instead
of approximation.

Factor out to save_fsgs() does not yield the exact same
behavior, because save_base_legacy() does not copy FS/GS base
when index is zero.

Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_64.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 55af719..8f9f2f9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -505,10 +505,16 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap_ptr = NULL;

- savesegment(gs, p->thread.gsindex);
- p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
savesegment(fs, p->thread.fsindex);
- p->thread.fsbase = p->thread.fsindex ? 0 : me->thread.fsbase;
+ savesegment(gs, p->thread.gsindex);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ p->thread.fsbase = rdfsbase();
+ p->thread.gsbase = rd_inactive_gsbase();
+ } else {
+ /* save_base_legacy() does not set base when index is zero. */
+ p->thread.fsbase = p->thread.fsindex ? 0 : me->thread.fsbase;
+ p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
+ }
savesegment(es, p->thread.es);
savesegment(ds, p->thread.ds);
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
--
2.7.4


2018-05-31 18:02:57

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

From: Andy Lutomirski <[email protected]>

ptrace can read FS/GS base using the register access API
(PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL. Make both of these
mechanisms return the actual FS/GS base.

This will improve debuggability by providing the correct information
to ptracer (GDB and etc).

Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Rebase and revise patch description]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_64.c | 59 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 49 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 5506e1b..e498671 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -279,6 +279,49 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}

+static unsigned long task_seg_base(struct task_struct *task,
+ unsigned short selector)
+{
+ unsigned short idx = selector >> 3;
+ unsigned long base;
+
+ if (likely((selector & SEGMENT_TI_MASK) == 0)) {
+ if (unlikely(idx >= GDT_ENTRIES))
+ return 0;
+
+ /*
+ * There are no user segments in the GDT with nonzero bases
+ * other than the TLS segments.
+ */
+ if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
+ return 0;
+
+ idx -= GDT_ENTRY_TLS_MIN;
+ base = get_desc_base(&task->thread.tls_array[idx]);
+ } else {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ struct ldt_struct *ldt;
+
+ /*
+ * If performance here mattered, we could protect the LDT
+ * with RCU. This is a slow path, though, so we can just
+ * take the mutex.
+ */
+ mutex_lock(&task->mm->context.lock);
+ ldt = task->mm->context.ldt;
+ if (unlikely(idx >= ldt->nr_entries))
+ base = 0;
+ else
+ base = get_desc_base(ldt->entries + idx);
+ mutex_unlock(&task->mm->context.lock);
+#else
+ base = 0;
+#endif
+ }
+
+ return base;
+}
+
void write_fsbase(unsigned long fsbase)
{
/* set the selector to 0 to not confuse __switch_to */
@@ -299,12 +342,10 @@ unsigned long read_task_fsbase(struct task_struct *task)

if (task == current)
fsbase = read_fsbase();
- else
- /*
- * XXX: This will not behave as expected if called
- * if fsindex != 0
- */
+ else if (task->thread.fsindex == 0)
fsbase = task->thread.fsbase;
+ else
+ fsbase = task_seg_base(task, task->thread.fsindex);

return fsbase;
}
@@ -315,12 +356,10 @@ unsigned long read_task_gsbase(struct task_struct *task)

if (task == current)
gsbase = read_inactive_gsbase();
- else
- /*
- * XXX: This will not behave as expected if called
- * if gsindex != 0
- */
+ else if (task->thread.gsindex == 0)
gsbase = task->thread.gsbase;
+ else
+ gsbase = task_seg_base(task, task->thread.gsindex);

return gsbase;
}
--
2.7.4


2018-05-31 18:03:15

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 07/15] x86/fsgsbase/64: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE

From: Andy Lutomirski <[email protected]>

This is temporary. It will allow the next few patches to be tested
incrementally.

Setting unsafe_fsgsbase is a root hole. Don't do it.

Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Fix the deactivated flag. Add TAINT_INSECURE flag.]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 3 +++
arch/x86/kernel/cpu/common.c | 26 +++++++++++++++++++++++++
2 files changed, 29 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f2040d4..b92c4a0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2600,6 +2600,9 @@
emulation library even if a 387 maths coprocessor
is present.

+ unsafe_fsgsbase [X86] Allow FSGSBASE instructions. This will be
+ replaced with a nofsgsbase flag.
+
no_console_suspend
[HW] Never suspend the console
Disable suspending of consoles during suspend and
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0be333f..5412d62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -356,6 +356,24 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
}

/*
+ * Temporary hack: FSGSBASE is unsafe until a few kernel code paths are updated.
+ * This allows us to get the kernel ready incrementally. Setting
+ * unsafe_fsgsbase will allow the series to be bisected if necessary.
+ *
+ * Once all the pieces are in place, this will go away and be replaced with
+ * a nofsgsbase chicken flag.
+ */
+static bool unsafe_fsgsbase;
+
+static __init int setup_unsafe_fsgsbase(char *arg)
+{
+ unsafe_fsgsbase = true;
+ add_taint(TAINT_INSECURE, LOCKDEP_STILL_OK);
+ return 1;
+}
+__setup("unsafe_fsgsbase", setup_unsafe_fsgsbase);
+
+/*
* Protection Keys are not available in 32-bit mode.
*/
static bool pku_disabled;
@@ -1286,6 +1304,14 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smap(c);
setup_umip(c);

+ /* Enable FSGSBASE instructions if available. */
+ if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
+ if (unsafe_fsgsbase)
+ cr4_set_bits(X86_CR4_FSGSBASE);
+ else
+ clear_cpu_cap(c, X86_FEATURE_FSGSBASE);
+ }
+
/*
* The vendor-specific functions might have changed features.
* Now we do "generic changes."
--
2.7.4


2018-05-31 18:03:37

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 08/15] x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions

From: Andi Kleen <[email protected]>

Add C intrinsics and assembler macros for the new RD/WR FS/GS BASE
instructions.

Very straight forward. Used in followon patch.

[luto: rename the variables from fs and gs to fsbase and gsbase and
make fsgs.h safe to include on 32-bit kernels.]

v2: Use __always_inline

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Replace new instruction macros with GAS-compatible and
renaming. Note: if GCC supports it, we can add -mfsgsbase to
CFLAGS and use the builtins here for extra performance.]
Signed-off-by: Chang S. Bae <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 70 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)

diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index 0d4fbef..ed42015 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -18,6 +18,42 @@ unsigned long read_task_gsbase(struct task_struct *task);
int write_task_fsbase(struct task_struct *task, unsigned long fsbase);
int write_task_gsbase(struct task_struct *task, unsigned long gsbase);

+/* Must be protected by X86_FEATURE_FSGSBASE check. */
+
+static __always_inline unsigned long rdfsbase(void)
+{
+ unsigned long fsbase;
+
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xc0 # rdfsbaseq %%rax"
+ : "=a" (fsbase)
+ :: "memory");
+ return fsbase;
+}
+
+static __always_inline unsigned long rdgsbase(void)
+{
+ unsigned long gsbase;
+
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8 # rdgsbaseq %%rax;"
+ : "=a" (gsbase)
+ :: "memory");
+ return gsbase;
+}
+
+static __always_inline void wrfsbase(unsigned long fsbase)
+{
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xd0 # wrfsbaseq %%rax"
+ :: "a" (fsbase)
+ : "memory");
+}
+
+static __always_inline void wrgsbase(unsigned long gsbase)
+{
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xd8 # wrgsbaseq %%rax;"
+ :: "a" (gsbase)
+ : "memory");
+}
+
/* Helper functions for reading/writing FS/GS base */

static inline unsigned long read_fsbase(void)
@@ -42,6 +78,40 @@ void write_inactive_gsbase(unsigned long gsbase);

#endif /* CONFIG_X86_64 */

+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_X86_64
+
+#include <asm/inst.h>
+
+.macro RDGSBASE opd
+ REG_TYPE rdgsbase_opd_type \opd
+ .if rdgsbase_opd_type == REG_TYPE_R64
+ R64_NUM rdgsbase_opd \opd
+ .byte 0xf3
+ PFX_REX rdgsbase_opd 0 W = 1
+ .else
+ .error "RDGSBASE: only for 64-bit value"
+ .endif
+ .byte 0xf, 0xae
+ MODRM 0xc0 rdgsbase_opd 1
+.endm
+
+.macro WRGSBASE opd
+ REG_TYPE wrgsbase_opd_type \opd
+ .if wrgsbase_opd_type == REG_TYPE_R64
+ R64_NUM wrgsbase_opd \opd
+ .byte 0xf3
+ PFX_REX wrgsbase_opd 0 W = 1
+ .else
+ .error "WRGSBASE: only for 64-bit value"
+ .endif
+ .byte 0xf, 0xae
+ MODRM 0xd0 wrgsbase_opd 1
+.endm
+
+#endif /* CONFIG_X86_64 */
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_FSGSBASE_H */
--
2.7.4


2018-05-31 18:03:52

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 09/15] x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions

The helper functions switch on faster access to FS/GS, when
FSGSBASE enabled.

Accessing user GS base needs a couple of SWPAGS. It is avoidable
if the user GS base is copied at kernel entry and updated as
changed, and (actual) GS base is written back at kernel exit.
However, it costs more cycles to do that. The measured
overhead was (almost) offset to the benefit.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: Any Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 17 ++++------
arch/x86/kernel/process_64.c | 75 +++++++++++++++++++++++++++++++++++------
2 files changed, 72 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index ed42015..903c7a0 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -54,26 +54,23 @@ static __always_inline void wrgsbase(unsigned long gsbase)
: "memory");
}

+#include <asm/cpufeature.h>
+
/* Helper functions for reading/writing FS/GS base */

static inline unsigned long read_fsbase(void)
{
unsigned long fsbase;

- rdmsrl(MSR_FS_BASE, fsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ fsbase = rdfsbase();
+ else
+ rdmsrl(MSR_FS_BASE, fsbase);
return fsbase;
}

void write_fsbase(unsigned long fsbase);
-
-static inline unsigned long read_inactive_gsbase(void)
-{
- unsigned long gsbase;
-
- rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
- return gsbase;
-}
-
+unsigned long read_inactive_gsbase(void);
void write_inactive_gsbase(unsigned long gsbase);

#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index cebf240..8ba947f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -154,6 +154,38 @@ enum which_selector {
};

/*
+ * Interrupts are disabled here.
+ * Out of line to be protected from kprobes.
+ */
+static noinline __kprobes unsigned long rd_inactive_gsbase(void)
+{
+ unsigned long gsbase, flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ gsbase = rdgsbase();
+ native_swapgs();
+ local_irq_restore(flags);
+
+ return gsbase;
+}
+
+/*
+ * Interrupts are disabled here.
+ * Out of line to be protected from kprobes.
+ */
+static noinline __kprobes void wr_inactive_gsbase(unsigned long gsbase)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ wrgsbase(gsbase);
+ native_swapgs();
+ local_irq_restore(flags);
+}
+
+/*
* Saves the FS or GS base for an outgoing thread if FSGSBASE extensions are
* not available. The goal is to be reasonably fast on non-FSGSBASE systems.
* It's forcibly inlined because it'll generate better code and this function
@@ -333,16 +365,35 @@ static unsigned long task_seg_base(struct task_struct *task,

void write_fsbase(unsigned long fsbase)
{
- /* set the selector to 0 to not confuse __switch_to */
- loadseg(FS, 0);
- wrmsrl(MSR_FS_BASE, fsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ wrfsbase(fsbase);
+ } else {
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(FS, 0);
+ wrmsrl(MSR_FS_BASE, fsbase);
+ }
+}
+
+unsigned long read_inactive_gsbase(void)
+{
+ unsigned long gsbase;
+
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ gsbase = rd_inactive_gsbase();
+ else
+ rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ return gsbase;
}

void write_inactive_gsbase(unsigned long gsbase)
{
- /* set the selector to 0 to not confuse __switch_to */
- loadseg(GS, 0);
- wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ wr_inactive_gsbase(gsbase);
+ } else {
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(GS, 0);
+ wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ }
}

unsigned long read_task_fsbase(struct task_struct *task)
@@ -351,7 +402,8 @@ unsigned long read_task_fsbase(struct task_struct *task)

if (task == current)
fsbase = read_fsbase();
- else if (task->thread.fsindex == 0)
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) ||
+ (task->thread.fsindex == 0))
fsbase = task->thread.fsbase;
else
fsbase = task_seg_base(task, task->thread.fsindex);
@@ -365,7 +417,8 @@ unsigned long read_task_gsbase(struct task_struct *task)

if (task == current)
gsbase = read_inactive_gsbase();
- else if (task->thread.gsindex == 0)
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) ||
+ (task->thread.gsindex == 0))
gsbase = task->thread.gsbase;
else
gsbase = task_seg_base(task, task->thread.gsindex);
@@ -388,7 +441,8 @@ int write_task_fsbase(struct task_struct *task, unsigned long fsbase)
task->thread.fsbase = fsbase;
if (task == current)
write_fsbase(fsbase);
- task->thread.fsindex = 0;
+ if (!static_cpu_has(X86_FEATURE_FSGSBASE))
+ task->thread.fsindex = 0;
put_cpu();

return 0;
@@ -405,7 +459,8 @@ int write_task_gsbase(struct task_struct *task, unsigned long gsbase)
task->thread.gsbase = gsbase;
if (task == current)
write_inactive_gsbase(gsbase);
- task->thread.gsindex = 0;
+ if (!static_cpu_has(X86_FEATURE_FSGSBASE))
+ task->thread.gsindex = 0;
put_cpu();

return 0;
--
2.7.4


2018-05-31 18:04:25

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 05/15] x86/vdso: Move out the CPU number store

The CPU (and node) number will be written, as early enough,
to the segment limit of per CPU data and TSC_AUX MSR entry.
The information has been retrieved by vgetcpu in user space
and will be also loaded from the paranoid entry, when
FSGSBASE enabled. So, it is moved out from vDSO to the CPU
initialization path where IST setup is serialized.

Now, redundant setting of the segment in entry/vdso/vma.c
was removed; a substantial code removal. It removes a
hotplug notifier, makes a facility useful to both the kernel
and userspace unconditionally available much sooner, and
unification with i386. (Thanks to HPA for suggesting the
cleanup)

Signed-off-by: Chang S. Bae <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vgetcpu.c | 2 +-
arch/x86/entry/vdso/vma.c | 38 +-------------------------------------
arch/x86/include/asm/segment.h | 4 ++++
arch/x86/include/asm/vgtod.h | 2 --
arch/x86/kernel/cpu/common.c | 17 +++++++++++++++++
arch/x86/kernel/setup_percpu.c | 30 ++++++++++++++++++++++++++----
6 files changed, 49 insertions(+), 44 deletions(-)

diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
index 8ec3d1f..1373281 100644
--- a/arch/x86/entry/vdso/vgetcpu.c
+++ b/arch/x86/entry/vdso/vgetcpu.c
@@ -18,7 +18,7 @@ __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
p = __getcpu();

if (cpu)
- *cpu = p & VGETCPU_CPU_MASK;
+ *cpu = p & PERCPU_CPU_MASK;
if (node)
*node = p >> 12;
return 0;
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 5b8b556..3f9d43f 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -332,40 +332,6 @@ static __init int vdso_setup(char *s)
return 0;
}
__setup("vdso=", vdso_setup);
-#endif
-
-#ifdef CONFIG_X86_64
-static void vgetcpu_cpu_init(void *arg)
-{
- int cpu = smp_processor_id();
- struct desc_struct d = { };
- unsigned long node = 0;
-#ifdef CONFIG_NUMA
- node = cpu_to_node(cpu);
-#endif
- if (static_cpu_has(X86_FEATURE_RDTSCP))
- write_rdtscp_aux((node << 12) | cpu);
-
- /*
- * Store cpu number in limit so that it can be loaded
- * quickly in user space in vgetcpu. (12 bits for the CPU
- * and 8 bits for the node)
- */
- d.limit0 = cpu | ((node & 0xf) << 12);
- d.limit1 = node >> 4;
- d.type = 5; /* RO data, expand down, accessed */
- d.dpl = 3; /* Visible to user code */
- d.s = 1; /* Not a system segment */
- d.p = 1; /* Present */
- d.d = 1; /* 32-bit */
-
- write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PER_CPU, &d, DESCTYPE_S);
-}
-
-static int vgetcpu_online(unsigned int cpu)
-{
- return smp_call_function_single(cpu, vgetcpu_cpu_init, NULL, 1);
-}

static int __init init_vdso(void)
{
@@ -375,9 +341,7 @@ static int __init init_vdso(void)
init_vdso_image(&vdso_image_x32);
#endif

- /* notifier priority > KVM */
- return cpuhp_setup_state(CPUHP_AP_X86_VDSO_VMA_ONLINE,
- "x86/vdso/vma:online", vgetcpu_online, NULL);
+ return 0;
}
subsys_initcall(init_vdso);
#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 8f09012b..648d301 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -225,6 +225,10 @@
#define GDT_ENTRY_TLS_ENTRIES 3
#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES* 8)

+/* Bit size and mask of CPU number stored in the per CPU data */
+#define PERCPU_CPU_SIZE 12
+#define PERCPU_CPU_MASK 0xfff
+
#ifdef __KERNEL__

/*
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..1cc9d30 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -79,8 +79,6 @@ static inline void gtod_write_end(struct vsyscall_gtod_data *s)

#ifdef CONFIG_X86_64

-#define VGETCPU_CPU_MASK 0xfff
-
static inline unsigned int __getcpu(void)
{
unsigned int p;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 38276f5..0be333f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1665,6 +1665,23 @@ void cpu_init(void)

wrmsrl(MSR_FS_BASE, 0);
wrmsrl(MSR_KERNEL_GS_BASE, 0);
+
+ if (static_cpu_has(X86_FEATURE_RDTSCP)) {
+ unsigned long node = 0;
+
+#ifdef CONFIG_NUMA
+ node = early_cpu_to_node(cpu);
+#endif
+
+ /*
+ * Store cpu number in TSC_AUX. (12 bits for the CPU
+ * and rest upper bits for the node number)
+ * It will be loaded in user space by vgetcpu (vDSO)
+ * and from the paranoid entry to find per-CPU base.
+ */
+ write_rdtscp_aux((node << PERCPU_CPU_SIZE) | cpu);
+ }
+
barrier();

x86_configure_nx();
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index ea554f8..b26202e 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -155,12 +155,34 @@ static void __init pcpup_populate_pte(unsigned long addr)

static inline void setup_percpu_segment(int cpu)
{
-#ifdef CONFIG_X86_32
- struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
- 0xFFFFF);
+ unsigned long node = 0;
+ struct desc_struct d = { };
+
+#ifdef CONFIG_NUMA
+ node = early_cpu_to_node(cpu);
+#endif

- write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, &d, DESCTYPE_S);
+ /*
+ * Store cpu number in limit.
+ * (12 bits for the CPU and 8 bits for the node number)
+ * It will be loaded in user space by vgetcpu (vDSO)
+ * and from the paranoid entry to find per-CPU base.
+ */
+ d.limit0 = cpu | ((node & 0xf) << PERCPU_CPU_SIZE);
+ d.limit1 = node >> 4;
+ d.type = 5; /* R0 data, expand down, accessed */
+ d.dpl = 3; /* Visible to user code */
+ d.s = 1; /* Not a system segment */
+ d.p = 1; /* Present */
+ d.d = 1; /* 32-bit */
+
+ write_gdt_entry(get_cpu_gdt_rw(cpu),
+#ifdef CONFIG_X86_32
+ GDT_ENTRY_PERCPU,
+#else /* 64 bit */
+ GDT_ENTRY_PER_CPU,
#endif
+ &d, DESCTYPE_S);
}

void __init setup_per_cpu_areas(void)
--
2.7.4


2018-05-31 18:04:50

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH V2 06/15] taint: Add taint for insecure

When adding new feature support, patches need to be
incrementally applied and tested with temporal parameters.
For such testing (or root-only) purposes, the new flag
will serve to tag the kernel taint state properly.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/sysctl/kernel.txt | 1 +
include/linux/kernel.h | 3 ++-
kernel/panic.c | 1 +
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d..06c4009 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -992,6 +992,7 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
32768 (K): The kernel has been live patched.
65536 (X): Auxiliary taint, defined and used by for distros.
131072 (T): The kernel was built with the struct randomization plugin.
+262144 (Z): The kernel is running in a known insecure configuration.

==============================================================

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 6a1eb0b..7051415 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -563,7 +563,8 @@ extern enum system_states {
#define TAINT_LIVEPATCH 15
#define TAINT_AUX 16
#define TAINT_RANDSTRUCT 17
-#define TAINT_FLAGS_COUNT 18
+#define TAINT_INSECURE 18
+#define TAINT_FLAGS_COUNT 19

struct taint_flag {
char c_true; /* character printed when tainted */
diff --git a/kernel/panic.c b/kernel/panic.c
index 42e4874..53ea36a 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -327,6 +327,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
[ TAINT_LIVEPATCH ] = { 'K', ' ', true },
[ TAINT_AUX ] = { 'X', ' ', true },
[ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
+ [ TAINT_INSECURE ] = { 'Z', ' ', false },
};

/**
--
2.7.4


2018-05-31 20:15:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 01/15] x86/fsgsbase/64: Introduce FS/GS base helper functions

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> With new helpers, FS/GS base access is centralized.
> Eventually, when FSGSBASE instruction enabled, it will
> be faster.
>
> The helpers are used on ptrace APIs (PTRACE_ARCH_PRCTL,
> PTRACE_SETREG, PTRACE_GETREG, etc). Idea is to keep
> the FS/GS-update mechanism organized.
>
> Notion of "active" and "inactive" are used to distinguish
> GS bases between "kernel" and "user". "inactive" GS base
> is the GS base, backed up at kernel entries, of inactive
> (user) task's.

I'm fine with the code, but the changelog entry is confusing. A bunch
of the active helpers don't contain the term "active".

> +/*
> + * Read/write an (inactive) task's fsbase or gsbase. This returns
> + * the value that the FS/GS base would have (if the task were to be
> + * resumed). The current task is also supported.
> + */

Please change to "Read/write a task's fsbase or gsbase. ... These work
on current or on a different non-running task."

> +
> +unsigned long read_task_fsbase(struct task_struct *task)
> +{
> + unsigned long fsbase;
> +
> + if (task == current)
> + fsbase = read_fsbase();
> + else
> + /*
> + * XXX: This will not behave as expected if called
> + * if fsindex != 0
> + */
> + fsbase = task->thread.fsbase;
> +

Please put braces around the if and else blocks whenever either of
them spans multiple lines. Also, maybe change add a note to the
comment and/or the changelog that this is preserving an existing bug
and that it's fixed later in the series.

2018-05-31 20:16:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> From: Andy Lutomirski <[email protected]>
>
> ptrace can read FS/GS base using the register access API
> (PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL. Make both of these
> mechanisms return the actual FS/GS base.
>
> This will improve debuggability by providing the correct information
> to ptracer (GDB and etc).

LGTM.

2018-05-31 20:17:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 03/15] x86/fsgsbase/64: Use FS/GS base helpers in core dump

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> When new FSGSBASE instructions enabled, this read will be
> switched to be faster.

s/switched to be/become/

Otherwise:
Reviewed-by: Andy Lutomirski <[email protected]>

2018-05-31 20:17:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 04/15] x86/fsgsbase/64: Factor out load FS/GS segments from __switch_to

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> Instead of open code, load_fsgs() will cleanup __switch_to
> and symmetric with FS/GS segment save. When FSGSBASE
> enabled, X86_FEATURE_FSGSBASE check will be incorporated.

Reviewed-by: Andy Lutomirski <[email protected]>

2018-05-31 20:26:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 06/15] taint: Add taint for insecure

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> When adding new feature support, patches need to be
> incrementally applied and tested with temporal parameters.
> For such testing (or root-only) purposes, the new flag
> will serve to tag the kernel taint state properly.

I'm okay with this, I guess, but I'm not at all convinced we need it.

2018-05-31 20:27:39

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 05/15] x86/vdso: Move out the CPU number store

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> The CPU (and node) number will be written, as early enough,
> to the segment limit of per CPU data and TSC_AUX MSR entry.
> The information has been retrieved by vgetcpu in user space
> and will be also loaded from the paranoid entry, when
> FSGSBASE enabled. So, it is moved out from vDSO to the CPU
> initialization path where IST setup is serialized.
>
> Now, redundant setting of the segment in entry/vdso/vma.c
> was removed; a substantial code removal. It removes a
> hotplug notifier, makes a facility useful to both the kernel
> and userspace unconditionally available much sooner, and
> unification with i386. (Thanks to HPA for suggesting the
> cleanup)
>
> Signed-off-by: Chang S. Bae <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>

> +/* Bit size and mask of CPU number stored in the per CPU data */
> +#define PERCPU_CPU_SIZE 12
> +#define PERCPU_CPU_MASK 0xfff

This name is confusing. Maybe LSL_TSCP_CPU_MASK?

Can you also add a helpers like:

static inline unsigned long make_lsl_tscp(unsigned int cpu, unsigned
int node) { ... }
static inline unsigned int lsl_tscp_to_cpu(unsigned long x) { ... }
static inline unsigned int lsl_tscp_to_node(unsigned long x) { ... }

and use them everywhere? This will make it lot easier to understand the code.

> +#ifdef CONFIG_X86_32
> + GDT_ENTRY_PERCPU,
> +#else /* 64 bit */
> + GDT_ENTRY_PER_CPU,
> #endif

Please just rename one of these to match the other one.

2018-05-31 20:33:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

On May 31, 2018 1:14:59 PM PDT, Andy Lutomirski <[email protected]> wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
><[email protected]> wrote:
>>
>> From: Andy Lutomirski <[email protected]>
>>
>> ptrace can read FS/GS base using the register access API
>> (PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL. Make both of these
>> mechanisms return the actual FS/GS base.
>>
>> This will improve debuggability by providing the correct information
>> to ptracer (GDB and etc).
>
>LGTM.

LGTM?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-05-31 20:40:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 00/15] x86: Enable FSGSBASE instructions

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> FSGSBASE is 64-bit instruction set to allow read/write
> FS/GS base from any privilege. As introduced from
> Ivybridge, enabling effort has been revolving quite long
> [2,3,4] for various reasons. After extended discussions [1],
> this patchset is proposed to introduce new ABIs of
> customizing FS/GS base (separate from its selector).


Thanks!

I have two general comments:

1. Can you try and generate a new version of patches 1-5 quickly? I
think it would be nice to get them merged this cycle.

2. I spoke to hpa, and he said that, after further investigation of
how gdb works, a command like 'p $gs = 0x7' results in
PTRACE_POKEUSER. He further suggested that it would therefore be
reasonable to have POKEUSER on gs refresh gsindex (assuming the poked
value is nonzero, sigh) and to make PTRACE_SETREGS iterate over the
registers in reverse order so that it behaves sanely. Is this indeed
the case?

3. The ptrace behavior is sufficiently subtle that I think it needs a
test case. Can you add a new selftest (or extend the existing
fsgsbase selftest) to do something like this:

- Create an LDT entry in slot zero with base == 1.
- Read out the hwcap bit indicating whether we have the new instructions on.
- MOV 0x7 to %gs and use ptrace to read gsbase. Confirm that the result is 1.
- MOV 0x7 to %gs, do wrgsbase to change the base to 2 (if supported),
and use ptrace to read gsbase. Confirm that the result is 2.
- Same as previous test, but with 0x0 instead of 0x7.
- Allocate a TLS segment with base == 3. Load it into %gs. Use
ptrace to read gsbase. Confirm that the result is 3.
- Use ptrace to toggle %gs (using POKEUSER) back and forth between
0x0, 0x7, and the TLS segment. In each case, immediately use ptrace
to read the base and confirm that you get the expected result. Then
resume the tracee and read the base directly, confirming that you get
the expected result.
- Use PTRACE_SETREGS to load gs = 0, gsbase = 4. Confirm that
GETREGS returns those values back and confirm that they are in fact
loaded into the tracee.
- Use PTRACE_SETREGS to load gs = 0x7, gsbase = 4. Confirm that
GETREGS returns those values back and confirm that they have the
expected values (which will depend on the hwcap bit). Also confirm
that the expected values are loaded into the tracee.

Does this seem reasonable? The mov_ss_trap testcase has a nice bit of
code you can borrow to invoke ptrace operations on yourself.

2018-05-31 20:40:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

On Thu, May 31, 2018 at 1:31 PM <[email protected]> wrote:
>
> On May 31, 2018 1:14:59 PM PDT, Andy Lutomirski <[email protected]> wrote:
> >On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> ><[email protected]> wrote:
> >>
> >> From: Andy Lutomirski <[email protected]>
> >>
> >> ptrace can read FS/GS base using the register access API
> >> (PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL. Make both of these
> >> mechanisms return the actual FS/GS base.
> >>
> >> This will improve debuggability by providing the correct information
> >> to ptracer (GDB and etc).
> >
> >LGTM.
>
> LGTM?

Looks good to me.

2018-05-31 20:53:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH V2 06/15] taint: Add taint for insecure

On May 31, 2018 1:25:39 PM PDT, Andy Lutomirski <[email protected]> wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
><[email protected]> wrote:
>>
>> When adding new feature support, patches need to be
>> incrementally applied and tested with temporal parameters.
>> For such testing (or root-only) purposes, the new flag
>> will serve to tag the kernel taint state properly.
>
>I'm okay with this, I guess, but I'm not at all convinced we need it.

This was my idea. It isn't the only thing that may want it, and I think it is critical that we give the system a way to flag that the system contains experimental code that is known to break security. Sometimes that kind of experimental code is useful (I have written some myself, e.g. to treat SMAP), but it is a good idea to be able to flag such a kernel permanently, even if it's a module that can be removed.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-05-31 21:04:07

by Chang S. Bae

[permalink] [raw]
Subject: RE: [PATCH V2 01/15] x86/fsgsbase/64: Introduce FS/GS base helper functions

>> Notion of "active" and "inactive" are used to distinguish
>> GS bases between "kernel" and "user". "inactive" GS base
>> is the GS base, backed up at kernel entries, of inactive
>> (user) task's.

> I'm fine with the code, but the changelog entry is confusing. A bunch
> of the active helpers don't contain the term "active".

Okay, will take "active" out from the note.

>> +/*
>> + * Read/write an (inactive) task's fsbase or gsbase. This returns
>> + * the value that the FS/GS base would have (if the task were to be
>> + * resumed). The current task is also supported.
>> + */

> Please change to "Read/write a task's fsbase or gsbase. ... These work
> on current or on a different non-running task."

Will do that.

>> +
>> +unsigned long read_task_fsbase(struct task_struct *task)
>> +{
>> + unsigned long fsbase;
>> +
>> + if (task == current)
>> + fsbase = read_fsbase();
>> + else
>> + /*
>> + * XXX: This will not behave as expected if called
>> + * if fsindex != 0
>> + */
>> + fsbase = task->thread.fsbase;
>> +

> Please put braces around the if and else blocks whenever either of
> them spans multiple lines. Also, maybe change add a note to the
> comment and/or the changelog that this is preserving an existing bug
> and that it's fixed later in the series.

Okay, thanks.

2018-05-31 21:04:26

by Chang S. Bae

[permalink] [raw]
Subject: RE: [PATCH V2 03/15] x86/fsgsbase/64: Use FS/GS base helpers in core dump

Okay, will fix. Thanks

-----Original Message-----
From: Andy Lutomirski [mailto:[email protected]]
Sent: Thursday, May 31, 2018 1:16 PM

On Thu, May 31, 2018 at 10:58 AM Chang S. Bae <[email protected]> wrote:
>
> When new FSGSBASE instructions enabled, this read will be switched to
> be faster.

s/switched to be/become/

Otherwise:
Reviewed-by: Andy Lutomirski <[email protected]>

2018-05-31 21:07:39

by Chang S. Bae

[permalink] [raw]
Subject: RE: [PATCH V2 05/15] x86/vdso: Move out the CPU number store

>> +/* Bit size and mask of CPU number stored in the per CPU data */
>> +#define PERCPU_CPU_SIZE 12
>> +#define PERCPU_CPU_MASK 0xfff

> This name is confusing. Maybe LSL_TSCP_CPU_MASK?

> Can you also add a helpers like:

> static inline unsigned long make_lsl_tscp(unsigned int cpu, unsigned
> int node) { ... }
> static inline unsigned int lsl_tscp_to_cpu(unsigned long x) { ... }
> static inline unsigned int lsl_tscp_to_node(unsigned long x) { ... }

> and use them everywhere? This will make it lot easier to understand the code.

Will rename and include the helpers as suggested.

> +#ifdef CONFIG_X86_32
> + GDT_ENTRY_PERCPU,
> +#else /* 64 bit */
> + GDT_ENTRY_PER_CPU,
> #endif

> Please just rename one of these to match the other one.

Okay, I will try.

2018-05-31 21:12:24

by Chang S. Bae

[permalink] [raw]
Subject: RE: [PATCH V2 00/15] x86: Enable FSGSBASE instructions

> 1. Can you try and generate a new version of patches 1-5 quickly? I think it would be nice to get them merged this cycle.

Okay, let me reply, first, what I can do now. Will do submit patches with revisions, (as soon as I can).

2018-06-05 07:06:41

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [x86/vdso] f52001961d: BUG:kernel_hang_in_early-boot_stage,last_printk:Probing_EDD(edd=off_to_disable)...ok


FYI, we noticed the following commit (built with gcc-5):

commit: f52001961d6e5c397e40c4d440103288cdce9a79 ("x86/vdso: Move out the CPU number store")
url: https://github.com/0day-ci/linux/commits/Chang-S-Bae/x86-Enable-FSGSBASE-instructions/20180602-125452


in testcase: boot

on test machine: qemu-system-i386 -enable-kvm -m 256M

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+--------------------------------------------------------------------------------------+------------+------------+
| | 38a5d7ab4a | f52001961d |
+--------------------------------------------------------------------------------------+------------+------------+
| boot_successes | 0 | 0 |
| boot_failures | 14 | 36 |
| WARNING:at_lib/debugobjects.c:#__debug_object_init | 14 | |
| EIP:__debug_object_init | 14 | |
| BUG:kernel_hang_in_early-boot_stage,last_printk:Probing_EDD(edd=off_to_disable)...ok | 0 | 36 |
+--------------------------------------------------------------------------------------+------------+------------+



early console in setup code
Probing EDD (edd=off to disable)... ok
BUG: kernel hang in early-boot stage, last printk: Probing EDD (edd=off to disable)... ok
Linux version 4.17.0-rc6-00163-gf520019 #1
Command line: ip=::::vm-lkp-nhm-dp1-yocto-i386-16::dhcp root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-lkp-nhm-dp1-yocto-i386-16/boot-1-yocto-tiny-i386-2016-04-22.cgz-f52001961d6e5c397e40c4d440103288cdce9a79-20180604-59345-q3lbrc-0.yaml ARCH=i386 kconfig=i386-randconfig-x0-06021157 branch=linux-devel/devel-catchup-201806021429 commit=f52001961d6e5c397e40c4d440103288cdce9a79 BOOT_IMAGE=/pkg/linux/i386-randconfig-x0-06021157/gcc-5/f52001961d6e5c397e40c4d440103288cdce9a79/vmlinuz-4.17.0-rc6-00163-gf520019 max_uptime=600 RESULT_ROOT=/result/boot/1/vm-lkp-nhm-dp1-yocto-i386/yocto-tiny-i386-2016-04-22.cgz/i386-randconfig-x0-06021157/gcc-5/f52001961d6e5c397e40c4d440103288cdce9a79/0 LKP_SERVER=inn debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw drbd.minor_count=8 rcuperf.shutdown=0

Elapsed time: 310



To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Xiaolong


Attachments:
(No filename) (2.92 kB)
config-4.17.0-rc6-00163-gf520019 (111.51 kB)
job-script (4.11 kB)
dmesg.xz (1.40 kB)
Download all attachments