FSGSBASE is 64-bit instruction set to allow read/write
FS/GS base from any privilege. As introduced from
Ivybridge, enabling effort has been revolving quite long
[2,3,4] for various reasons. After extended discussions [1],
this patchset is proposed to introduce new ABIs of
customizing FS/GS base (separate from its selector).
FSGSBASE-enabled VM can be located on hosts with
either HW virtualization or SW emulation. KVM advertises
FSGSBASE when physical CPU has and emulation is
supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
may disable FSGSBASE for seamless VM migrations [6].
A couple of major benefits are expected. Kernel will have
performance improvement in context switch by skipping MSR
write for GS base. User-level programs (such as JAVA-based)
benefit from avoiding system calls to edit FS/GS base.
Changes when FSGSBASE enabled:
* In context switch, a thread's FS/GS base is secured
regardless of its selector [1].
* (Subsequently) when ptracer updates FS/GS base, it is
(always) carried onto the tracee. This is an outcome of
discussions with our GDB contact. Also, checked with other
toolchains [7, 8].
* On paranoid entry, GS base is compared with the kernel’s
to decide the necessity of SWAPGS.
[1] Recent discussion on LKML:
https://marc.info/?t=150147053700001&r=1&w=2
[2] Andy Lutomirski’s rebase work :
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
[3] Patch set shown in year 2016:
https://marc.info/?t=145857711900001&r=1&w=2
[4] First patch set: https://lkml.org/lkml/2015/4/10/573
[5] QEMU with FSGSBASE emulation:
https://github.com/qemu/qemu/blob/026aaf47c02b79036feb830206cfebb2a726510d/target/i386/translate.c#L8186
[6] 5-level EPT:
http://lkml.kernel.org/r/[email protected]
[7] RR/FSGSBASE:
https://mail.mozilla.org/pipermail/rr-dev/2018-March/000616.html
[8] CRIU/FSGSBASE:
https://lists.openvz.org/pipermail/criu/2018-March/040654.html
Andi Kleen (1):
x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions
Andy Lutomirski (4):
x86/fsgsbase/64: Make ptrace read FS/GS base accurately
x86/fsgsbase/64: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
x86/fsgsbase/64: Preserve FS/GS state in __switch_to if FSGSBASE is on
x86/fsgsbase/64: Enable FSGSBASE by default and add a chicken bit
Chang S. Bae (10):
x86/fsgsbase/64: Introduce FS/GS base helper functions
x86/fsgsbase/64: Use FS/GS base helpers in core dump
x86/fsgsbase/64: Factor out load FS/GS segments from __switch_to
x86/ptrace: A new macro to get an offset of user_regs_struct
x86/fsgsbase/64: Add putregs() to handle multiple elements' setting
x86/fsgsbase/64: putregs() in a reverse order
x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
x86/fsgsbase/64: When copying a thread, use FSGSBASE if enabled
x86/fsgsbase/64: With FSGSBASE, compare GS bases on paranoid_entry
x86/fsgsbase/64: Support legacy behavior when FS/GS updated by ptracer
Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/x86/entry/entry_64.S | 54 ++++-
arch/x86/include/asm/elf.h | 6 +-
arch/x86/include/asm/fsgsbase.h | 200 +++++++++++++++++
arch/x86/include/asm/inst.h | 15 ++
arch/x86/kernel/cpu/common.c | 20 ++
arch/x86/kernel/process_64.c | 274 ++++++++++++++++++++----
arch/x86/kernel/ptrace.c | 253 ++++++++++++++++------
8 files changed, 696 insertions(+), 128 deletions(-)
create mode 100644 arch/x86/include/asm/fsgsbase.h
--
2.7.4
From: Andy Lutomirski <[email protected]>
With the new FSGSBASE instructions, we can efficiently read and write
the FS and GS bases in __switch_to. Use that capability to preserve
the full state.
This will enable user code to do whatever it wants with the new
instructions without any kernel-induced gotchas. (There can still be
architectural gotchas: movl %gs,%eax; movl %eax,%gs may change GSBASE
if WRGSBASE was used, but users are expected to read the CPU manual
before doing things like that.)
This is a considerable speedup. It seems to save about 100 cycles
per context switch compared to the baseline 4.6-rc1 behavior on my
Skylake laptop.
Signed-off-by: Andy Lutomirski <[email protected]>
[chang: 5~10% performance improvement on context switch micro-
benchmark, when FS/GS base is actively switched.]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/process_64.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 877636a..7249a54 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -234,8 +234,18 @@ static __always_inline void save_fsgs(struct task_struct *task)
{
savesegment(fs, task->thread.fsindex);
savesegment(gs, task->thread.gsindex);
- save_base_legacy(task, task->thread.fsindex, FS);
- save_base_legacy(task, task->thread.gsindex, GS);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /*
+ * If FSGSBASE is enabled, we can't make any useful guesses
+ * about the base, and user code expects us to save the current
+ * value. Fortunately, reading the base directly is efficient.
+ */
+ task->thread.fsbase = rdfsbase();
+ task->thread.gsbase = rd_shadow_gsbase();
+ } else {
+ save_base_legacy(task, task->thread.fsindex, FS);
+ save_base_legacy(task, task->thread.gsindex, GS);
+ }
}
static __always_inline void loadseg(enum which_selector which,
@@ -300,10 +310,22 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
static __always_inline void load_fsgs(struct thread_struct *prev,
struct thread_struct *next)
{
- load_seg_legacy(prev->fsindex, prev->fsbase,
- next->fsindex, next->fsbase, FS);
- load_seg_legacy(prev->gsindex, prev->gsbase,
- next->gsindex, next->gsbase, GS);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /* Update the FS and GS selectors if they could have changed. */
+ if (unlikely(prev->fsindex || next->fsindex))
+ loadseg(FS, next->fsindex);
+ if (unlikely(prev->gsindex || next->gsindex))
+ loadseg(GS, next->gsindex);
+
+ /* Update the bases. */
+ wrfsbase(next->fsbase);
+ wr_shadow_gsbase(next->gsbase);
+ } else {
+ load_seg_legacy(prev->fsindex, prev->fsbase,
+ next->fsindex, next->fsbase, FS);
+ load_seg_legacy(prev->gsindex, prev->gsbase,
+ next->gsindex, next->gsbase, GS);
+ }
}
static unsigned long task_seg_base(struct task_struct *task,
--
2.7.4
putregs() can be used to handle multiple elements flexibly.
It is useful when inter-dependency lies in updating a
group of context entries. There will be a case with FSGSBASE.
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/ptrace.c | 51 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 40 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index d8a1e1b..9c09bf0 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -421,6 +421,22 @@ static int putreg(struct task_struct *child,
return 0;
}
+static int putregs(struct task_struct *child,
+ unsigned int offset,
+ unsigned int count,
+ const unsigned long *values)
+{
+ const unsigned long *v = values;
+ int ret = 0;
+
+ while (count >= sizeof(*v) && !ret) {
+ ret = putreg(child, offset, *v++);
+ count -= sizeof(*v);
+ offset += sizeof(*v);
+ }
+ return ret;
+}
+
static unsigned long getreg(struct task_struct *task, unsigned long offset)
{
switch (offset) {
@@ -477,24 +493,37 @@ static int genregs_set(struct task_struct *target,
const void *kbuf, const void __user *ubuf)
{
int ret = 0;
+
if (kbuf) {
- const unsigned long *k = kbuf;
- while (count >= sizeof(*k) && !ret) {
- ret = putreg(target, pos, *k++);
- count -= sizeof(*k);
- pos += sizeof(*k);
- }
+ ret = putregs(target, pos, count, kbuf);
} else {
const unsigned long __user *u = ubuf;
- while (count >= sizeof(*u) && !ret) {
+ const unsigned long *genregs = NULL;
+ unsigned long *buf = NULL;
+ unsigned int remains = 0;
+
+ buf = kmalloc(count, GFP_KERNEL);
+
+ if (unlikely(!buf))
+ return -ENOMEM;
+
+ genregs = buf;
+ remains = count;
+
+ while (remains >= sizeof(*u) && !ret) {
unsigned long word;
+
ret = __get_user(word, u++);
- if (ret)
+ if (unlikely(ret))
break;
- ret = putreg(target, pos, word);
- count -= sizeof(*u);
- pos += sizeof(*u);
+ memcpy(buf++, &word, sizeof(*u));
+ remains -= sizeof(*u);
}
+
+ if (likely(!ret))
+ /* Allows to handle multiple elements accordingly. */
+ ret = putregs(target, pos, count, genregs);
+ kfree(genregs);
}
return ret;
}
--
2.7.4
This patch makes a walk of user_regs_struct reversely. Main
reason for doing this is to put FS/GS base setting after
the selector.
Each element is independently set now. When FS/GS base is
(only) updated, its index is reset to zero. In putregs(),
it does not reset when both FS/GS base and selector are
covered.
When FSGSBASE is enabled, an arbitrary base value is possible
anyways, so it is going to be reasonable to write base lastly.
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/ptrace.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 45 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9c09bf0..ee37e28 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -426,14 +426,56 @@ static int putregs(struct task_struct *child,
unsigned int count,
const unsigned long *values)
{
- const unsigned long *v = values;
+ const unsigned long *v = values + count / sizeof(unsigned long);
int ret = 0;
+#ifdef CONFIG_X86_64
+ bool fs_fully_covered = (offset <= USER_REGS_OFFSET(fs_base)) &&
+ ((offset + count) >= USER_REGS_OFFSET(fs));
+ bool gs_fully_covered = (offset <= USER_REGS_OFFSET(gs_base)) &&
+ ((offset + count) >= USER_REGS_OFFSET(gs));
+
+ offset += count - sizeof(*v);
+
+ while (count >= sizeof(*v) && !ret) {
+ v--;
+ switch (offset) {
+ case USER_REGS_OFFSET(fs_base):
+ if (fs_fully_covered) {
+ if (unlikely(*v >= TASK_SIZE_MAX))
+ return -EIO;
+ /*
+ * When changing both %fs (index) and %fsbase
+ * write_task_fsbase() tends to overwrite
+ * task's %fs. Simply setting base only here.
+ */
+ if (child->thread.fsbase != *v)
+ child->thread.fsbase = *v;
+ break;
+ }
+ case USER_REGS_OFFSET(gs_base):
+ if (gs_fully_covered) {
+ if (unlikely(*v >= TASK_SIZE_MAX))
+ return -EIO;
+ /* Same here as the %fs handling above */
+ if (child->thread.gsbase != *v)
+ child->thread.gsbase = *v;
+ break;
+ }
+ default:
+ ret = putreg(child, offset, *v);
+ }
+ count -= sizeof(*v);
+ offset -= sizeof(*v);
+ }
+#else
+ offset += count - sizeof(*v);
while (count >= sizeof(*v) && !ret) {
- ret = putreg(child, offset, *v++);
+ ret = putreg(child, offset, *(--v));
count -= sizeof(*v);
- offset += sizeof(*v);
+ offset -= sizeof(*v);
}
+#endif /* CONFIG_X86_64 */
return ret;
}
--
2.7.4
From: Andy Lutomirski <[email protected]>
Now that FSGSBASE is fully supported, remove unsafe_fsgsbase, enable
FSGSBASE by default, and add nofsgsbase to disable it.
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 3 +--
arch/x86/kernel/cpu/common.c | 33 +++++++++++--------------
2 files changed, 15 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 87c2260..433dd17 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2567,8 +2567,7 @@
emulation library even if a 387 maths coprocessor
is present.
- unsafe_fsgsbase [X86] Allow FSGSBASE instructions. This will be
- replaced with a nofsgsbase flag.
+ nofsgsbase [X86] Disables FSGSBASE instructions.
no_console_suspend
[HW] Never suspend the console
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 019bdc3..f9b7eba 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -355,22 +355,21 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}
-/*
- * Temporary hack: FSGSBASE is unsafe until a few kernel code paths are updated.
- * This allows us to get the kernel ready incrementally. Setting
- * unsafe_fsgsbase will allow the series to be bisected if necessary.
- *
- * Once all the pieces are in place, this will go away and be replaced with
- * a nofsgsbase chicken flag.
- */
-static bool unsafe_fsgsbase;
-
-static __init int setup_unsafe_fsgsbase(char *arg)
+static __init int x86_nofsgsbase_setup(char *arg)
{
- unsafe_fsgsbase = true;
+ /* require an exact match without trailing characters */
+ if (strlen(arg))
+ return 0;
+
+ /* do not emit a message if the feature is not present */
+ if (!boot_cpu_has(X86_FEATURE_FSGSBASE))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_FSGSBASE);
+ pr_info("nofsgsbase: FSGSBASE disabled\n");
return 1;
}
-__setup("unsafe_fsgsbase", setup_unsafe_fsgsbase);
+__setup("nofsgsbase", x86_nofsgsbase_setup);
/*
* Protection Keys are not available in 32-bit mode.
@@ -1256,12 +1255,8 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_umip(c);
/* Enable FSGSBASE instructions if available. */
- if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
- if (unsafe_fsgsbase)
- cr4_set_bits(X86_CR4_FSGSBASE);
- else
- clear_cpu_cap(c, X86_FEATURE_FSGSBASE);
- }
+ if (cpu_has(c, X86_FEATURE_FSGSBASE))
+ cr4_set_bits(X86_CR4_FSGSBASE);
/*
* The vendor-specific functions might have changed features.
--
2.7.4
Proliferation of offsetof() for user_regs_struct is trimmed
down with the USER_REGS_OFFSET macro.
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
cc: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/ptrace.c | 78 +++++++++++++++++++++++++-----------------------
1 file changed, 40 insertions(+), 38 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index b2f0beb..d8a1e1b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -149,6 +149,8 @@ static inline bool invalid_selector(u16 value)
return unlikely(value != 0 && (value & SEGMENT_RPL_MASK) != USER_RPL);
}
+#define USER_REGS_OFFSET(r) offsetof(struct user_regs_struct, r)
+
#ifdef CONFIG_X86_32
#define FLAG_MASK FLAG_MASK_32
@@ -194,7 +196,7 @@ static u16 get_segment_reg(struct task_struct *task, unsigned long offset)
* Returning the value truncates it to 16 bits.
*/
unsigned int retval;
- if (offset != offsetof(struct user_regs_struct, gs))
+ if (offset != USER_REGS_OFFSET(gs))
retval = *pt_regs_access(task_pt_regs(task), offset);
else {
if (task == current)
@@ -224,8 +226,8 @@ static int set_segment_reg(struct task_struct *task,
* safely use invalid selectors) from a kernel trap frame.
*/
switch (offset) {
- case offsetof(struct user_regs_struct, cs):
- case offsetof(struct user_regs_struct, ss):
+ case USER_REGS_OFFSET(cs):
+ case USER_REGS_OFFSET(ss):
if (unlikely(value == 0))
return -EIO;
@@ -233,7 +235,7 @@ static int set_segment_reg(struct task_struct *task,
*pt_regs_access(task_pt_regs(task), offset) = value;
break;
- case offsetof(struct user_regs_struct, gs):
+ case USER_REGS_OFFSET(gs):
if (task == current)
set_user_gs(task_pt_regs(task), value);
else
@@ -261,34 +263,34 @@ static u16 get_segment_reg(struct task_struct *task, unsigned long offset)
unsigned int seg;
switch (offset) {
- case offsetof(struct user_regs_struct, fs):
+ case USER_REGS_OFFSET(fs):
if (task == current) {
/* Older gas can't assemble movq %?s,%r?? */
asm("movl %%fs,%0" : "=r" (seg));
return seg;
}
return task->thread.fsindex;
- case offsetof(struct user_regs_struct, gs):
+ case USER_REGS_OFFSET(gs):
if (task == current) {
asm("movl %%gs,%0" : "=r" (seg));
return seg;
}
return task->thread.gsindex;
- case offsetof(struct user_regs_struct, ds):
+ case USER_REGS_OFFSET(ds):
if (task == current) {
asm("movl %%ds,%0" : "=r" (seg));
return seg;
}
return task->thread.ds;
- case offsetof(struct user_regs_struct, es):
+ case USER_REGS_OFFSET(es):
if (task == current) {
asm("movl %%es,%0" : "=r" (seg));
return seg;
}
return task->thread.es;
- case offsetof(struct user_regs_struct, cs):
- case offsetof(struct user_regs_struct, ss):
+ case USER_REGS_OFFSET(cs):
+ case USER_REGS_OFFSET(ss):
break;
}
return *pt_regs_access(task_pt_regs(task), offset);
@@ -304,22 +306,22 @@ static int set_segment_reg(struct task_struct *task,
return -EIO;
switch (offset) {
- case offsetof(struct user_regs_struct,fs):
+ case USER_REGS_OFFSET(fs):
task->thread.fsindex = value;
if (task == current)
loadsegment(fs, task->thread.fsindex);
break;
- case offsetof(struct user_regs_struct,gs):
+ case USER_REGS_OFFSET(gs):
task->thread.gsindex = value;
if (task == current)
load_gs_index(task->thread.gsindex);
break;
- case offsetof(struct user_regs_struct,ds):
+ case USER_REGS_OFFSET(ds):
task->thread.ds = value;
if (task == current)
loadsegment(ds, task->thread.ds);
break;
- case offsetof(struct user_regs_struct,es):
+ case USER_REGS_OFFSET(es):
task->thread.es = value;
if (task == current)
loadsegment(es, task->thread.es);
@@ -328,12 +330,12 @@ static int set_segment_reg(struct task_struct *task,
/*
* Can't actually change these in 64-bit mode.
*/
- case offsetof(struct user_regs_struct,cs):
+ case USER_REGS_OFFSET(cs):
if (unlikely(value == 0))
return -EIO;
task_pt_regs(task)->cs = value;
break;
- case offsetof(struct user_regs_struct,ss):
+ case USER_REGS_OFFSET(ss):
if (unlikely(value == 0))
return -EIO;
task_pt_regs(task)->ss = value;
@@ -381,19 +383,19 @@ static int putreg(struct task_struct *child,
unsigned long offset, unsigned long value)
{
switch (offset) {
- case offsetof(struct user_regs_struct, cs):
- case offsetof(struct user_regs_struct, ds):
- case offsetof(struct user_regs_struct, es):
- case offsetof(struct user_regs_struct, fs):
- case offsetof(struct user_regs_struct, gs):
- case offsetof(struct user_regs_struct, ss):
+ case USER_REGS_OFFSET(cs):
+ case USER_REGS_OFFSET(ds):
+ case USER_REGS_OFFSET(es):
+ case USER_REGS_OFFSET(fs):
+ case USER_REGS_OFFSET(gs):
+ case USER_REGS_OFFSET(ss):
return set_segment_reg(child, offset, value);
- case offsetof(struct user_regs_struct, flags):
+ case USER_REGS_OFFSET(flags):
return set_flags(child, value);
#ifdef CONFIG_X86_64
- case offsetof(struct user_regs_struct,fs_base):
+ case USER_REGS_OFFSET(fs_base):
if (value >= TASK_SIZE_MAX)
return -EIO;
/*
@@ -403,7 +405,7 @@ static int putreg(struct task_struct *child,
if (child->thread.fsbase != value)
return write_task_fsbase(child, value);
return 0;
- case offsetof(struct user_regs_struct,gs_base):
+ case USER_REGS_OFFSET(gs_base):
/*
* Exactly the same here as the %fs handling above.
*/
@@ -422,21 +424,21 @@ static int putreg(struct task_struct *child,
static unsigned long getreg(struct task_struct *task, unsigned long offset)
{
switch (offset) {
- case offsetof(struct user_regs_struct, cs):
- case offsetof(struct user_regs_struct, ds):
- case offsetof(struct user_regs_struct, es):
- case offsetof(struct user_regs_struct, fs):
- case offsetof(struct user_regs_struct, gs):
- case offsetof(struct user_regs_struct, ss):
+ case USER_REGS_OFFSET(cs):
+ case USER_REGS_OFFSET(ds):
+ case USER_REGS_OFFSET(es):
+ case USER_REGS_OFFSET(fs):
+ case USER_REGS_OFFSET(gs):
+ case USER_REGS_OFFSET(ss):
return get_segment_reg(task, offset);
- case offsetof(struct user_regs_struct, flags):
+ case USER_REGS_OFFSET(flags):
return get_flags(task);
#ifdef CONFIG_X86_64
- case offsetof(struct user_regs_struct, fs_base):
+ case USER_REGS_OFFSET(fs_base):
return read_task_fsbase(task);
- case offsetof(struct user_regs_struct, gs_base):
+ case USER_REGS_OFFSET(gs_base):
return read_task_gsbase(task);
#endif
}
@@ -885,7 +887,7 @@ long arch_ptrace(struct task_struct *child, long request,
#define SEG32(rs) \
case offsetof(struct user32, regs.rs): \
return set_segment_reg(child, \
- offsetof(struct user_regs_struct, rs), \
+ USER_REGS_OFFSET(rs), \
value); \
break
@@ -959,7 +961,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
#define SEG32(rs) \
case offsetof(struct user32, regs.rs): \
*val = get_segment_reg(child, \
- offsetof(struct user_regs_struct, rs)); \
+ USER_REGS_OFFSET(rs)); \
break
static int getreg32(struct task_struct *child, unsigned regno, u32 *val)
@@ -1153,7 +1155,7 @@ static long x32_arch_ptrace(struct task_struct *child,
ret = -EIO;
if ((addr & (sizeof(data) - 1)) || addr >= sizeof(struct user) ||
- addr < offsetof(struct user_regs_struct, cs))
+ addr < USER_REGS_OFFSET(cs))
break;
tmp = 0; /* Default return condition */
@@ -1174,7 +1176,7 @@ static long x32_arch_ptrace(struct task_struct *child,
case PTRACE_POKEUSR:
ret = -EIO;
if ((addr & (sizeof(data) - 1)) || addr >= sizeof(struct user) ||
- addr < offsetof(struct user_regs_struct, cs))
+ addr < USER_REGS_OFFSET(cs))
break;
if (addr < sizeof(struct user_regs_struct))
--
2.7.4
When FSGSBASE enabled, ptracer's FS/GS selector update
fetches FS/GS base from GDT/LDT. This emulation of FS/GS
segment loading provides backward compatibility for the
legacy ptracers.
When ptracer sets FS/GS selector, its base is going to be
(accordingly) reloaded as the tracee resumes. This is without
FSGSBASE. With FSGSBASE, FS/GS base is preserved regardless
of its selector. Thus, emulating FS/GS load in ptrace is
requested to keep compatible with what has been with FS/GS
setting.
Additionally, whenever a new base value is written, the
FSGSBASE-enabled kernel allows the tracee effectively carry
on. This also means that when both selector and base are
changed, the base is not fetched from GDT/LDT, but
preserved as given.
In a summary, ptracer's update on FS/GS selector and base
yields such results on tracee's base:
- When FS/GS selector only changed (to nonzero), fetch base
from GDT/LDT (legacy behavior)
- When FS/GS base (regardless of selector) changed, tracee
will have the base
Suggested-by: Markus T. Metzger <[email protected]>
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 4 +++
arch/x86/kernel/process_64.c | 4 +--
arch/x86/kernel/ptrace.c | 68 +++++++++++++++++++++++++++++++++++------
3 files changed, 65 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index 76d3457..430ae40 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -17,6 +17,10 @@ unsigned long read_task_gsbase(struct task_struct *task);
int write_task_fsbase(struct task_struct *task, unsigned long fsbase);
int write_task_gsbase(struct task_struct *task, unsigned long gsbase);
+/* Read (FS/GS) base from GDT/LDT */
+unsigned long task_seg_base(struct task_struct *task,
+ unsigned short selector);
+
/*
* Must be protected by X86_FEATURE_FSGSBASE check.
*/
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 5aae132..ef32f75 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -328,8 +328,8 @@ static __always_inline void load_fsgs(struct thread_struct *prev,
}
}
-static unsigned long task_seg_base(struct task_struct *task,
- unsigned short selector)
+unsigned long task_seg_base(struct task_struct *task,
+ unsigned short selector)
{
unsigned short idx = selector >> 3;
unsigned long base;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index ee37e28..be3e022 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -307,14 +307,26 @@ static int set_segment_reg(struct task_struct *task,
switch (offset) {
case USER_REGS_OFFSET(fs):
- task->thread.fsindex = value;
if (task == current)
- loadsegment(fs, task->thread.fsindex);
+ loadsegment(fs, value);
+ /*
+ * %fs setting goes to reload its base, when tracee
+ * resumes without FSGSBASE (legacy). Here with FSGSBASE
+ * FS base is (manually) fetched from GDT/LDT when needed.
+ */
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) &&
+ (value != 0) && (task->thread.fsindex != value))
+ task->thread.fsbase = task_seg_base(task, value);
+ task->thread.fsindex = value;
break;
case USER_REGS_OFFSET(gs):
- task->thread.gsindex = value;
if (task == current)
- load_gs_index(task->thread.gsindex);
+ load_gs_index(value);
+ /* Same as %fs handling above */
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) &&
+ (value != 0) && (task->thread.gsindex != value))
+ task->thread.gsbase = task_seg_base(task, value);
+ task->thread.gsindex = value;
break;
case USER_REGS_OFFSET(ds):
task->thread.ds = value;
@@ -433,14 +445,31 @@ static int putregs(struct task_struct *child,
((offset + count) >= USER_REGS_OFFSET(fs));
bool gs_fully_covered = (offset <= USER_REGS_OFFSET(gs_base)) &&
((offset + count) >= USER_REGS_OFFSET(gs));
+ bool fs_updated = false, gs_updated = false;
offset += count - sizeof(*v);
while (count >= sizeof(*v) && !ret) {
v--;
switch (offset) {
+ case USER_REGS_OFFSET(fs):
+ if (fs_fully_covered &&
+ static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ if (invalid_selector(*v))
+ return -EIO;
+ /*
+ * Set the flag to fetch fsbase from GDT/LDT
+ * with FSGSBASE
+ */
+ fs_updated = (*v != 0) &&
+ (child->thread.fsindex != *v);
+ child->thread.fsindex = *v;
+ break;
+ }
case USER_REGS_OFFSET(fs_base):
if (fs_fully_covered) {
+ struct thread_struct *thread = &child->thread;
+
if (unlikely(*v >= TASK_SIZE_MAX))
return -EIO;
/*
@@ -448,17 +477,38 @@ static int putregs(struct task_struct *child,
* write_task_fsbase() tends to overwrite
* task's %fs. Simply setting base only here.
*/
- if (child->thread.fsbase != *v)
- child->thread.fsbase = *v;
+ if (thread->fsbase != *v)
+ thread->fsbase = *v;
+ else if (fs_updated)
+ thread->fsbase =
+ task_seg_base(child,
+ thread->fsindex);
+ break;
+ }
+ case USER_REGS_OFFSET(gs):
+ if (gs_fully_covered &&
+ static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ if (invalid_selector(*v))
+ return -EIO;
+ /* Same here as the %fs handling above */
+ gs_updated = (*v != 0) &&
+ (child->thread.gsindex != *v);
+ child->thread.gsindex = *v;
break;
}
case USER_REGS_OFFSET(gs_base):
if (gs_fully_covered) {
+ struct thread_struct *thread = &child->thread;
+
if (unlikely(*v >= TASK_SIZE_MAX))
return -EIO;
- /* Same here as the %fs handling above */
- if (child->thread.gsbase != *v)
- child->thread.gsbase = *v;
+ /* Same here as the %fs_base handling above */
+ if (thread->gsbase != *v)
+ thread->gsbase = *v;
+ else if (gs_updated)
+ thread->gsbase =
+ task_seg_base(child,
+ thread->gsindex);
break;
}
default:
--
2.7.4
The helper functions switch on faster access to FS/GS, when
FSGSBASE enabled.
Accessing user GS base needs a couple of SWPAGS. It is avoidable
if the user GS base is copied at kernel entry and updated as
changed, and (actual) GS base is written back at kernel exit.
However, it costs more cycles to do that. The measured
overhead was (almost) offset to the benefit.
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: Any Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 17 ++++------
arch/x86/kernel/process_64.c | 75 +++++++++++++++++++++++++++++++++++------
2 files changed, 72 insertions(+), 20 deletions(-)
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index cad2831..8936b7f 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -55,6 +55,8 @@ static __always_inline void wrgsbase(unsigned long gsbase)
: "memory");
}
+#include <asm/cpufeature.h>
+
/*
* Helper functions for reading/writing FS/GS base
*/
@@ -63,20 +65,15 @@ static inline unsigned long read_fsbase(void)
{
unsigned long fsbase;
- rdmsrl(MSR_FS_BASE, fsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ fsbase = rdfsbase();
+ else
+ rdmsrl(MSR_FS_BASE, fsbase);
return fsbase;
}
void write_fsbase(unsigned long fsbase);
-
-static inline unsigned long read_shadow_gsbase(void)
-{
- unsigned long gsbase;
-
- rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
- return gsbase;
-}
-
+unsigned long read_shadow_gsbase(void);
void write_shadow_gsbase(unsigned long gsbase);
#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4488f07..877636a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -154,6 +154,38 @@ enum which_selector {
};
/*
+ * Interrupts are disabled here.
+ * Out of line to be protected from kprobes.
+ */
+static noinline __kprobes unsigned long rd_shadow_gsbase(void)
+{
+ unsigned long gsbase, flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ gsbase = rdgsbase();
+ native_swapgs();
+ local_irq_restore(flags);
+
+ return gsbase;
+}
+
+/*
+ * Interrupts are disabled here.
+ * Out of line to be protected from kprobes.
+ */
+static noinline __kprobes void wr_shadow_gsbase(unsigned long gsbase)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ wrgsbase(gsbase);
+ native_swapgs();
+ local_irq_restore(flags);
+}
+
+/*
* Saves the FS or GS base for an outgoing thread if FSGSBASE extensions are
* not available. The goal is to be reasonably fast on non-FSGSBASE systems.
* It's forcibly inlined because it'll generate better code and this function
@@ -319,16 +351,35 @@ static unsigned long task_seg_base(struct task_struct *task,
void write_fsbase(unsigned long fsbase)
{
- /* set the selector to 0 to not confuse __switch_to */
- loadseg(FS, 0);
- wrmsrl(MSR_FS_BASE, fsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ wrfsbase(fsbase);
+ } else {
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(FS, 0);
+ wrmsrl(MSR_FS_BASE, fsbase);
+ }
+}
+
+unsigned long read_shadow_gsbase(void)
+{
+ unsigned long gsbase;
+
+ if (static_cpu_has(X86_FEATURE_FSGSBASE))
+ gsbase = rd_shadow_gsbase();
+ else
+ rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ return gsbase;
}
void write_shadow_gsbase(unsigned long gsbase)
{
- /* set the selector to 0 to not confuse __switch_to */
- loadseg(GS, 0);
- wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ wr_shadow_gsbase(gsbase);
+ } else {
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(GS, 0);
+ wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ }
}
unsigned long read_task_fsbase(struct task_struct *task)
@@ -337,7 +388,8 @@ unsigned long read_task_fsbase(struct task_struct *task)
if (task == current)
fsbase = read_fsbase();
- else if (task->thread.fsindex == 0)
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) ||
+ (task->thread.fsindex == 0))
fsbase = task->thread.fsbase;
else
fsbase = task_seg_base(task, task->thread.fsindex);
@@ -351,7 +403,8 @@ unsigned long read_task_gsbase(struct task_struct *task)
if (task == current)
gsbase = read_shadow_gsbase();
- else if (task->thread.gsindex == 0)
+ else if (static_cpu_has(X86_FEATURE_FSGSBASE) ||
+ (task->thread.gsindex == 0))
gsbase = task->thread.gsbase;
else
gsbase = task_seg_base(task, task->thread.gsindex);
@@ -374,7 +427,8 @@ int write_task_fsbase(struct task_struct *task, unsigned long fsbase)
task->thread.fsbase = fsbase;
if (task == current)
write_fsbase(fsbase);
- task->thread.fsindex = 0;
+ if (!static_cpu_has(X86_FEATURE_FSGSBASE))
+ task->thread.fsindex = 0;
put_cpu();
return 0;
@@ -391,7 +445,8 @@ int write_task_gsbase(struct task_struct *task, unsigned long gsbase)
task->thread.gsbase = gsbase;
if (task == current)
write_shadow_gsbase(gsbase);
- task->thread.gsindex = 0;
+ if (!static_cpu_has(X86_FEATURE_FSGSBASE))
+ task->thread.gsindex = 0;
put_cpu();
return 0;
--
2.7.4
From: Andy Lutomirski <[email protected]>
This is temporary. It will allow the next few patches to be tested
incrementally.
Setting unsafe_fsgsbase is a root hole. Don't do it.
Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Fix the deactivated flag]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 3 +++
arch/x86/kernel/cpu/common.c | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1d1d53f..87c2260 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2567,6 +2567,9 @@
emulation library even if a 387 maths coprocessor
is present.
+ unsafe_fsgsbase [X86] Allow FSGSBASE instructions. This will be
+ replaced with a nofsgsbase flag.
+
no_console_suspend
[HW] Never suspend the console
Disable suspending of consoles during suspend and
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 348cf48..019bdc3 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -356,6 +356,23 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
}
/*
+ * Temporary hack: FSGSBASE is unsafe until a few kernel code paths are updated.
+ * This allows us to get the kernel ready incrementally. Setting
+ * unsafe_fsgsbase will allow the series to be bisected if necessary.
+ *
+ * Once all the pieces are in place, this will go away and be replaced with
+ * a nofsgsbase chicken flag.
+ */
+static bool unsafe_fsgsbase;
+
+static __init int setup_unsafe_fsgsbase(char *arg)
+{
+ unsafe_fsgsbase = true;
+ return 1;
+}
+__setup("unsafe_fsgsbase", setup_unsafe_fsgsbase);
+
+/*
* Protection Keys are not available in 32-bit mode.
*/
static bool pku_disabled;
@@ -1238,6 +1255,14 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smap(c);
setup_umip(c);
+ /* Enable FSGSBASE instructions if available. */
+ if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
+ if (unsafe_fsgsbase)
+ cr4_set_bits(X86_CR4_FSGSBASE);
+ else
+ clear_cpu_cap(c, X86_FEATURE_FSGSBASE);
+ }
+
/*
* The vendor-specific functions might have changed features.
* Now we do "generic changes."
--
2.7.4
From: Andy Lutomirski <[email protected]>
ptrace can read FS/GS base using the register access API
(PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL. Make both of these
mechanisms return the actual FS/GS base.
This will improve debuggability by providing the correct information
to ptracer (GDB and etc).
Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Rebase and revise patch description]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/process_64.c | 59 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 49 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 65be0a6..2375f10 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -265,6 +265,49 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}
+static unsigned long task_seg_base(struct task_struct *task,
+ unsigned short selector)
+{
+ unsigned short idx = selector >> 3;
+ unsigned long base;
+
+ if (likely((selector & SEGMENT_TI_MASK) == 0)) {
+ if (unlikely(idx >= GDT_ENTRIES))
+ return 0;
+
+ /*
+ * There are no user segments in the GDT with nonzero bases
+ * other than the TLS segments.
+ */
+ if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
+ return 0;
+
+ idx -= GDT_ENTRY_TLS_MIN;
+ base = get_desc_base(&task->thread.tls_array[idx]);
+ } else {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ struct ldt_struct *ldt;
+
+ /*
+ * If performance here mattered, we could protect the LDT
+ * with RCU. This is a slow path, though, so we can just
+ * take the mutex.
+ */
+ mutex_lock(&task->mm->context.lock);
+ ldt = task->mm->context.ldt;
+ if (unlikely(idx >= ldt->nr_entries))
+ base = 0;
+ else
+ base = get_desc_base(ldt->entries + idx);
+ mutex_unlock(&task->mm->context.lock);
+#else
+ base = 0;
+#endif
+ }
+
+ return base;
+}
+
void write_fsbase(unsigned long fsbase)
{
/* set the selector to 0 to not confuse __switch_to */
@@ -285,12 +328,10 @@ unsigned long read_task_fsbase(struct task_struct *task)
if (task == current)
fsbase = read_fsbase();
- else
- /*
- * XXX: This will not behave as expected if called
- * if fsindex != 0
- */
+ else if (task->thread.fsindex == 0)
fsbase = task->thread.fsbase;
+ else
+ fsbase = task_seg_base(task, task->thread.fsindex);
return fsbase;
}
@@ -301,12 +342,10 @@ unsigned long read_task_gsbase(struct task_struct *task)
if (task == current)
gsbase = read_shadow_gsbase();
- else
- /*
- * XXX: This will not behave as expected if called
- * if gsindex != 0
- */
+ else if (task->thread.gsindex == 0)
gsbase = task->thread.gsbase;
+ else
+ gsbase = task_seg_base(task, task->thread.gsindex);
return gsbase;
}
--
2.7.4
On 03/19/2018 10:49 AM, Chang S. Bae wrote:
> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
> GS base is not the kernel's.
Do you mean "SWAPGS is needed..."?
> FSGSBASE instructions allow user to write any value on GS base;
> even negative. Sign check on the current GS base is not
> sufficient. Fortunately, reading GS base is fast. Kernel GS
> base is also known from the offset table with the CPU number.
>
> GS-compatible RDPID macro is included.
This description could use some work. I think you're trying to say
that, currently, userspace can't modify GS base and the kernel's
conventions are that a negative GS base means it is a kernel value and a
positive GS base means it is a user vale. But, with your new patches,
userspace can put arbitrary data in there, breaking the exising assumption.
Correct?
This also needs to explain a bit of the theory about how we go finding
the kernel GS base value.
Also, this is expected to improve paranoid_entry speed, right? The
rdmsr goes away so this should be faster. Should that be mentioned?
> /*
> - * Save all registers in pt_regs, and switch gs if needed.
> - * Use slow, but surefire "are we in kernel?" check.
> - * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
> + * Save all registers in pt_regs.
> + *
> + * SWAPGS needs when it comes from user space.
The grammar here needs some work.
"SWAPGS is needed when entering from userspace".
> To check where-it-from,
> + * read GS base from RDMSR/MSR_GS_BASE and check if negative or not.
> + * This works without FSGSBASE.
> + * When FSGSBASE enabled, arbitrary GS base can be put by a user-level
> + * task, which means negative value is possible. Direct comparison
> + * between the current and kernel GS bases determines the necessity of
> + * SWAPGS; do if and only if unmatched.
> + *
> + * Return: ebx=0: need SWAPGS on exit, ebx=1: otherwise
> */
I don't think this really belongs in a comment above the function. I'd
just explain overall that we are trying to determine if we interrupted
userspace or not.
> ENTRY(paranoid_entry)
> UNWIND_HINT_FUNC
> cld
> PUSH_AND_CLEAR_REGS save_ret=1
> ENCODE_FRAME_POINTER 8
> - movl $1, %ebx
> - movl $MSR_GS_BASE, %ecx
> - rdmsr
> - testl %edx, %edx
> - js 1f /* negative -> in kernel */
> - SWAPGS
> - xorl %ebx, %ebx
>
> -1:
> + /*
> + * As long as this PTI macro doesn't depend on kernel GS base,
> + * we can do it early. This is because READ_KERNEL_GSBASE
> + * references data in kernel space.
> + */
> SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
We used to depend on some per-cpu data in here, but we no longer do. So
I think this is OK.
> + movl $1, %ebx
I know it wasn't commented before, but please add a comment about what
this is doing.
> + /*
> + * Read current GS base with RDGSBASE. Kernel GS base is found
> + * by CPU number and its offset value.
> + */
> + ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
> + "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
I'd rather this be:
ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "nop",\
X86_FEATURE_FSGSBASE
RDGSBASE %rdx
READ_KERNEL_GSBASE %rax
/* See if the kernel GS_BASE value is in GS base register */
cmpq %rdx, %rax
...
It's a lot more readable than what you have.
...
> + jne .Lparanoid_entry_swapgs
> + ret
> +
> +.Lparanoid_entry_no_fsgsbase:
> + /*
> + * A (slow) RDMSR is surefire without FSGSBASE.
I'd say:
FSGSBASE is not in use, so depend on the kernel-enforced
convention that a negative GS base indicates a kernel value.
> + * The READ_MSR_GSBASE macro scratches %ecx, %eax, and %edx.
"clobbers" is the normal terminology for this, not "scratches".
> + READ_MSR_GSBASE save_reg=%edx
> + testl %edx, %edx /* negative -> in kernel */
> + jns .Lparanoid_entry_swapgs
> + ret
> +
> +.Lparanoid_entry_swapgs:
> + SWAPGS
> + xorl %ebx, %ebx
> ret
> END(paranoid_entry)
^^ Please comment the xorl, especially now that it's farther away from
the comment explaining what it is for.
> diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
> index 8936b7f..76d3457 100644
> --- a/arch/x86/include/asm/fsgsbase.h
> +++ b/arch/x86/include/asm/fsgsbase.h
> @@ -140,6 +140,55 @@ void write_shadow_gsbase(unsigned long gsbase);
> MODRM 0xd0 wrgsbase_opd 1
> .endm
>
> +#if CONFIG_SMP
> +
> +.macro READ_KERNEL_GSBASE_RDPID reg:req
This needs some explanation. Maybe:
/*
* Fetch the per-cpu GSBASE value for this processor and
* put it in @reg. We normally use %GS for accessing per-cpu
* data, but we are setting up %GS here and obviously can not
* use %GS itself to access per-cpu data.
*/
> + RDPID \reg
> +
> + /*
> + * processor id is written during vDSO (virtual dynamic shared object)
> + * initialization. 12 bits for the CPU and 8 bits for the node.
> + */
> + andq $0xFFF, \reg
This begs the question: when do we initialize the VDSO? Is that before
or after the first paranoid_entry interrupt? Also, why the magic
number? Shouldn't this come out of a macro somewhere rather than having
to hard-code and spell out the convention?
> + /*
> + * Kernel GS base is looked up from the __per_cpu_offset list with
> + * the CPU number (processor id).
> + */
> + movq __per_cpu_offset(, \reg, 8), \reg
> +.endm
> +
> +.macro READ_KERNEL_GSBASE_CPU_SEG_LIMIT reg:req
> + /* CPU number is found from the limit of PER_CPU entry in GDT */
> + movq $__PER_CPU_SEG, \reg
> + lsl \reg, \reg
> +
> + /* Same as READ_KERNEL_GSBASE_RDPID */
> + andq $0xFFF, \reg
> + movq __per_cpu_offset(, \reg, 8), \reg
> +.endm
> +
> +.macro READ_KERNEL_GSBASE reg:req
> + ALTERNATIVE "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
> + "READ_KERNEL_GSBASE_RDPID \reg", X86_FEATURE_RDPID
> +.endm
Can I suggest a different indentation?
ALTERNATIVE \
"READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
"READ_KERNEL_GSBASE_RDPID \reg",
X86_FEATURE_RDPID
When FSGSBASE is enabled, SWAPGS needs if and only if (current)
GS base is not the kernel's.
FSGSBASE instructions allow user to write any value on GS base;
even negative. Sign check on the current GS base is not
sufficient. Fortunately, reading GS base is fast. Kernel GS
base is also known from the offset table with the CPU number.
GS-compatible RDPID macro is included.
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
---
arch/x86/entry/entry_64.S | 54 ++++++++++++++++++++++++++++++++---------
arch/x86/include/asm/fsgsbase.h | 49 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/inst.h | 15 ++++++++++++
3 files changed, 107 insertions(+), 11 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 805f527..51ad17e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -38,6 +38,7 @@
#include <asm/export.h>
#include <asm/frame.h>
#include <asm/nospec-branch.h>
+#include <asm/fsgsbase.h>
#include <linux/err.h>
#include "calling.h"
@@ -1159,26 +1160,57 @@ idtentry machine_check do_mce has_error_code=0 paranoid=1
#endif
/*
- * Save all registers in pt_regs, and switch gs if needed.
- * Use slow, but surefire "are we in kernel?" check.
- * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
+ * Save all registers in pt_regs.
+ *
+ * SWAPGS needs when it comes from user space. To check where-it-from,
+ * read GS base from RDMSR/MSR_GS_BASE and check if negative or not.
+ * This works without FSGSBASE.
+ *
+ * When FSGSBASE enabled, arbitrary GS base can be put by a user-level
+ * task, which means negative value is possible. Direct comparison
+ * between the current and kernel GS bases determines the necessity of
+ * SWAPGS; do if and only if unmatched.
+ *
+ * Return: ebx=0: need SWAPGS on exit, ebx=1: otherwise
*/
ENTRY(paranoid_entry)
UNWIND_HINT_FUNC
cld
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
- movl $1, %ebx
- movl $MSR_GS_BASE, %ecx
- rdmsr
- testl %edx, %edx
- js 1f /* negative -> in kernel */
- SWAPGS
- xorl %ebx, %ebx
-1:
+ /*
+ * As long as this PTI macro doesn't depend on kernel GS base,
+ * we can do it early. This is because READ_KERNEL_GSBASE
+ * references data in kernel space.
+ */
SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+ movl $1, %ebx
+ /*
+ * Read current GS base with RDGSBASE. Kernel GS base is found
+ * by CPU number and its offset value.
+ */
+ ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
+ "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
+ READ_KERNEL_GSBASE %rax
+ cmpq %rdx, %rax
+ jne .Lparanoid_entry_swapgs
+ ret
+
+.Lparanoid_entry_no_fsgsbase:
+ /*
+ * A (slow) RDMSR is surefire without FSGSBASE.
+ * The READ_MSR_GSBASE macro scratches %ecx, %eax, and %edx.
+ */
+ READ_MSR_GSBASE save_reg=%edx
+ testl %edx, %edx /* negative -> in kernel */
+ jns .Lparanoid_entry_swapgs
+ ret
+
+.Lparanoid_entry_swapgs:
+ SWAPGS
+ xorl %ebx, %ebx
ret
END(paranoid_entry)
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index 8936b7f..76d3457 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -140,6 +140,55 @@ void write_shadow_gsbase(unsigned long gsbase);
MODRM 0xd0 wrgsbase_opd 1
.endm
+#if CONFIG_SMP
+
+.macro READ_KERNEL_GSBASE_RDPID reg:req
+ RDPID \reg
+
+ /*
+ * processor id is written during vDSO (virtual dynamic shared object)
+ * initialization. 12 bits for the CPU and 8 bits for the node.
+ */
+ andq $0xFFF, \reg
+ /*
+ * Kernel GS base is looked up from the __per_cpu_offset list with
+ * the CPU number (processor id).
+ */
+ movq __per_cpu_offset(, \reg, 8), \reg
+.endm
+
+.macro READ_KERNEL_GSBASE_CPU_SEG_LIMIT reg:req
+ /* CPU number is found from the limit of PER_CPU entry in GDT */
+ movq $__PER_CPU_SEG, \reg
+ lsl \reg, \reg
+
+ /* Same as READ_KERNEL_GSBASE_RDPID */
+ andq $0xFFF, \reg
+ movq __per_cpu_offset(, \reg, 8), \reg
+.endm
+
+.macro READ_KERNEL_GSBASE reg:req
+ ALTERNATIVE "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
+ "READ_KERNEL_GSBASE_RDPID \reg", X86_FEATURE_RDPID
+.endm
+
+#else
+
+.macro READ_KERNEL_GSBASE reg:req
+ /* Tracking the base offset value */
+ movq pcpu_unit_offsets(%rip), \reg
+.endm
+
+#endif /* CONFIG_SMP */
+
+.macro READ_MSR_GSBASE save_reg:req
+ movl $MSR_GS_BASE, %ecx
+ /* Read MSR specified by %ecx into %edx:%eax */
+ rdmsr
+ .ifnc \save_reg, %edx
+ movl %edx, \save_reg
+ .endif
+.endm
#endif /* CONFIG_X86_64 */
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/inst.h b/arch/x86/include/asm/inst.h
index f5a796d..d063841 100644
--- a/arch/x86/include/asm/inst.h
+++ b/arch/x86/include/asm/inst.h
@@ -306,6 +306,21 @@
.endif
MODRM 0xc0 movq_r64_xmm_opd1 movq_r64_xmm_opd2
.endm
+
+.macro RDPID opd
+ REG_TYPE rdpid_opd_type \opd
+ .if rdpid_opd_type == REG_TYPE_R64
+ R64_NUM rdpid_opd \opd
+ .else
+ R32_NUM rdpid_opd \opd
+ .endif
+ .byte 0xf3
+ .if rdpid_opd > 7
+ PFX_REX rdpid_opd 0
+ .endif
+ .byte 0x0f, 0xc7
+ MODRM 0xc0 rdpid_opd 0x7
+.endm
#endif
#endif
--
2.7.4
Instead of open code, load_fsgs() will cleanup __switch_to
and symmetric with FS/GS segment save. When FSGSBASE
enabled, X86_FEATURE_FSGSBASE check will be incorporated.
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/process_64.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 2375f10..4488f07 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -265,6 +265,15 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}
+static __always_inline void load_fsgs(struct thread_struct *prev,
+ struct thread_struct *next)
+{
+ load_seg_legacy(prev->fsindex, prev->fsbase,
+ next->fsindex, next->fsbase, FS);
+ load_seg_legacy(prev->gsindex, prev->gsbase,
+ next->gsindex, next->gsbase, GS);
+}
+
static unsigned long task_seg_base(struct task_struct *task,
unsigned short selector)
{
@@ -574,10 +583,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);
- load_seg_legacy(prev->fsindex, prev->fsbase,
- next->fsindex, next->fsbase, FS);
- load_seg_legacy(prev->gsindex, prev->gsbase,
- next->gsindex, next->gsbase, GS);
+ load_fsgs(prev, next);
switch_fpu_finish(next_fpu, cpu);
--
2.7.4
When new FSGSBASE instructions enabled, this read will be
switched to be faster.
Based-on-code-from: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/elf.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 0d157d2..2bdd38f 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -10,6 +10,7 @@
#include <asm/ptrace.h>
#include <asm/user.h>
#include <asm/auxvec.h>
+#include <asm/fsgsbase.h>
typedef unsigned long elf_greg_t;
@@ -205,7 +206,6 @@ void set_personality_ia32(bool);
#define ELF_CORE_COPY_REGS(pr_reg, regs) \
do { \
- unsigned long base; \
unsigned v; \
(pr_reg)[0] = (regs)->r15; \
(pr_reg)[1] = (regs)->r14; \
@@ -228,8 +228,8 @@ do { \
(pr_reg)[18] = (regs)->flags; \
(pr_reg)[19] = (regs)->sp; \
(pr_reg)[20] = (regs)->ss; \
- rdmsrl(MSR_FS_BASE, base); (pr_reg)[21] = base; \
- rdmsrl(MSR_KERNEL_GS_BASE, base); (pr_reg)[22] = base; \
+ (pr_reg)[21] = read_fsbase(); \
+ (pr_reg)[22] = read_shadow_gsbase(); \
asm("movl %%ds,%0" : "=r" (v)); (pr_reg)[23] = v; \
asm("movl %%es,%0" : "=r" (v)); (pr_reg)[24] = v; \
asm("movl %%fs,%0" : "=r" (v)); (pr_reg)[25] = v; \
--
2.7.4
With new helpers, FS/GS base access is centralized.
Eventually, when FSGSBASE instruction enabled, it will
be faster.
The helpers are used on ptrace APIs (PTRACE_ARCH_PRCTL,
PTRACE_SETREG, PTRACE_GETREG, etc). Idea is to keep
the FS/GS-update mechanism organized.
Notion of "active" and "shadow" are used to distinguish
GS bases between "kernel" and "user". "shadow" GS base
refers to the GS base backed up at kernel entries;
inactive (user) task's GS base.
Based-on-code-from: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 48 +++++++++++++++
arch/x86/kernel/process_64.c | 128 +++++++++++++++++++++++++++++-----------
arch/x86/kernel/ptrace.c | 28 +++------
3 files changed, 150 insertions(+), 54 deletions(-)
create mode 100644 arch/x86/include/asm/fsgsbase.h
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
new file mode 100644
index 0000000..29249f6
--- /dev/null
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -0,0 +1,48 @@
+#ifndef _ASM_FSGSBASE_H
+#define _ASM_FSGSBASE_H 1
+
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_X86_64
+
+#include <asm/msr-index.h>
+
+/*
+ * Read/write an (inactive) task's fsbase or gsbase. This returns
+ * the value that the FS/GS base would have (if the task were to be
+ * resumed). The current task is also supported.
+ */
+unsigned long read_task_fsbase(struct task_struct *task);
+unsigned long read_task_gsbase(struct task_struct *task);
+int write_task_fsbase(struct task_struct *task, unsigned long fsbase);
+int write_task_gsbase(struct task_struct *task, unsigned long gsbase);
+
+/*
+ * Helper functions for reading/writing FS/GS base
+ */
+
+static inline unsigned long read_fsbase(void)
+{
+ unsigned long fsbase;
+
+ rdmsrl(MSR_FS_BASE, fsbase);
+ return fsbase;
+}
+
+void write_fsbase(unsigned long fsbase);
+
+static inline unsigned long read_shadow_gsbase(void)
+{
+ unsigned long gsbase;
+
+ rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ return gsbase;
+}
+
+void write_shadow_gsbase(unsigned long gsbase);
+
+#endif /* CONFIG_X86_64 */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_FSGSBASE_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9eb448c..65be0a6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -54,6 +54,7 @@
#include <asm/vdso.h>
#include <asm/intel_rdt_sched.h>
#include <asm/unistd.h>
+#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
@@ -264,6 +265,90 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}
+void write_fsbase(unsigned long fsbase)
+{
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(FS, 0);
+ wrmsrl(MSR_FS_BASE, fsbase);
+}
+
+void write_shadow_gsbase(unsigned long gsbase)
+{
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(GS, 0);
+ wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+}
+
+unsigned long read_task_fsbase(struct task_struct *task)
+{
+ unsigned long fsbase;
+
+ if (task == current)
+ fsbase = read_fsbase();
+ else
+ /*
+ * XXX: This will not behave as expected if called
+ * if fsindex != 0
+ */
+ fsbase = task->thread.fsbase;
+
+ return fsbase;
+}
+
+unsigned long read_task_gsbase(struct task_struct *task)
+{
+ unsigned long gsbase;
+
+ if (task == current)
+ gsbase = read_shadow_gsbase();
+ else
+ /*
+ * XXX: This will not behave as expected if called
+ * if gsindex != 0
+ */
+ gsbase = task->thread.gsbase;
+
+ return gsbase;
+}
+
+int write_task_fsbase(struct task_struct *task, unsigned long fsbase)
+{
+ int cpu;
+
+ /*
+ * Not strictly needed for fs, but do it for symmetry
+ * with gs
+ */
+ if (unlikely(fsbase >= TASK_SIZE_MAX))
+ return -EPERM;
+
+ cpu = get_cpu();
+ task->thread.fsbase = fsbase;
+ if (task == current)
+ write_fsbase(fsbase);
+ task->thread.fsindex = 0;
+ put_cpu();
+
+ return 0;
+}
+
+int write_task_gsbase(struct task_struct *task, unsigned long gsbase)
+{
+ int cpu;
+
+ if (unlikely(gsbase >= TASK_SIZE_MAX))
+ return -EPERM;
+
+ cpu = get_cpu();
+ task->thread.gsbase = gsbase;
+ if (task == current)
+ write_shadow_gsbase(gsbase);
+ task->thread.gsindex = 0;
+ put_cpu();
+
+ return 0;
+}
+
int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p, unsigned long tls)
{
@@ -603,54 +688,27 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
{
int ret = 0;
- int doit = task == current;
- int cpu;
switch (option) {
- case ARCH_SET_GS:
- if (arg2 >= TASK_SIZE_MAX)
- return -EPERM;
- cpu = get_cpu();
- task->thread.gsindex = 0;
- task->thread.gsbase = arg2;
- if (doit) {
- load_gs_index(0);
- ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, arg2);
- }
- put_cpu();
+ case ARCH_SET_GS: {
+ ret = write_task_gsbase(task, arg2);
break;
- case ARCH_SET_FS:
- /* Not strictly needed for fs, but do it for symmetry
- with gs */
- if (arg2 >= TASK_SIZE_MAX)
- return -EPERM;
- cpu = get_cpu();
- task->thread.fsindex = 0;
- task->thread.fsbase = arg2;
- if (doit) {
- /* set the selector to 0 to not confuse __switch_to */
- loadsegment(fs, 0);
- ret = wrmsrl_safe(MSR_FS_BASE, arg2);
- }
- put_cpu();
+ }
+ case ARCH_SET_FS: {
+ ret = write_task_fsbase(task, arg2);
break;
+ }
case ARCH_GET_FS: {
unsigned long base;
- if (doit)
- rdmsrl(MSR_FS_BASE, base);
- else
- base = task->thread.fsbase;
+ base = read_task_fsbase(task);
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
case ARCH_GET_GS: {
unsigned long base;
- if (doit)
- rdmsrl(MSR_KERNEL_GS_BASE, base);
- else
- base = task->thread.gsbase;
+ base = read_task_gsbase(task);
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index ed5c4cd..b2f0beb 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -39,6 +39,7 @@
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
#include <asm/syscall.h>
+#include <asm/fsgsbase.h>
#include "tls.h"
@@ -396,12 +397,11 @@ static int putreg(struct task_struct *child,
if (value >= TASK_SIZE_MAX)
return -EIO;
/*
- * When changing the segment base, use do_arch_prctl_64
- * to set either thread.fs or thread.fsindex and the
- * corresponding GDT slot.
+ * When changing the FS base, use the same
+ * mechanism as for do_arch_prctl_64
*/
if (child->thread.fsbase != value)
- return do_arch_prctl_64(child, ARCH_SET_FS, value);
+ return write_task_fsbase(child, value);
return 0;
case offsetof(struct user_regs_struct,gs_base):
/*
@@ -410,7 +410,7 @@ static int putreg(struct task_struct *child,
if (value >= TASK_SIZE_MAX)
return -EIO;
if (child->thread.gsbase != value)
- return do_arch_prctl_64(child, ARCH_SET_GS, value);
+ return write_task_gsbase(child, value);
return 0;
#endif
}
@@ -434,20 +434,10 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
return get_flags(task);
#ifdef CONFIG_X86_64
- case offsetof(struct user_regs_struct, fs_base): {
- /*
- * XXX: This will not behave as expected if called on
- * current or if fsindex != 0.
- */
- return task->thread.fsbase;
- }
- case offsetof(struct user_regs_struct, gs_base): {
- /*
- * XXX: This will not behave as expected if called on
- * current or if fsindex != 0.
- */
- return task->thread.gsbase;
- }
+ case offsetof(struct user_regs_struct, fs_base):
+ return read_task_fsbase(task);
+ case offsetof(struct user_regs_struct, gs_base):
+ return read_task_gsbase(task);
#endif
}
--
2.7.4
When FSGSBASE enabled, copy real FS/GS base values instead
of approximation.
Factor out to save_fsgs() does not yield the exact same
behavior, because save_base_legacy() does not copy FS/GS base
when index is zero.
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/kernel/process_64.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 7249a54..5aae132 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -491,10 +491,16 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap_ptr = NULL;
- savesegment(gs, p->thread.gsindex);
- p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
savesegment(fs, p->thread.fsindex);
- p->thread.fsbase = p->thread.fsindex ? 0 : me->thread.fsbase;
+ savesegment(gs, p->thread.gsindex);
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ p->thread.fsbase = rdfsbase();
+ p->thread.gsbase = rd_shadow_gsbase();
+ } else {
+ /* save_base_legacy() does not set base when index is zero. */
+ p->thread.fsbase = p->thread.fsindex ? 0 : me->thread.fsbase;
+ p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
+ }
savesegment(es, p->thread.es);
savesegment(ds, p->thread.ds);
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
--
2.7.4
From: Andi Kleen <[email protected]>
Add C intrinsics and assembler macros for the new RD/WR FS/GS BASE
instructions.
Very straight forward. Used in followon patch.
[luto: rename the variables from fs and gs to fsbase and gsbase and
make fsgs.h safe to include on 32-bit kernels.]
v2: Use __always_inline
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Replace new instruction macros with GAS-compatible and
renaming]
Signed-off-by: Chang S. Bae <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 102 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 102 insertions(+)
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
index 29249f6..cad2831 100644
--- a/arch/x86/include/asm/fsgsbase.h
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -18,6 +18,44 @@ int write_task_fsbase(struct task_struct *task, unsigned long fsbase);
int write_task_gsbase(struct task_struct *task, unsigned long gsbase);
/*
+ * Must be protected by X86_FEATURE_FSGSBASE check.
+ */
+
+static __always_inline unsigned long rdfsbase(void)
+{
+ unsigned long fsbase;
+
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xc0 # rdfsbaseq %%rax"
+ : "=a" (fsbase)
+ :: "memory");
+ return fsbase;
+}
+
+static __always_inline unsigned long rdgsbase(void)
+{
+ unsigned long gsbase;
+
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8 # rdgsbaseq %%rax;"
+ : "=a" (gsbase)
+ :: "memory");
+ return gsbase;
+}
+
+static __always_inline void wrfsbase(unsigned long fsbase)
+{
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xd0 # wrfsbaseq %%rax"
+ :: "a" (fsbase)
+ : "memory");
+}
+
+static __always_inline void wrgsbase(unsigned long gsbase)
+{
+ asm volatile(".byte 0xf3, 0x48, 0x0f, 0xae, 0xd8 # wrgsbaseq %%rax;"
+ :: "a" (gsbase)
+ : "memory");
+}
+
+/*
* Helper functions for reading/writing FS/GS base
*/
@@ -43,6 +81,70 @@ void write_shadow_gsbase(unsigned long gsbase);
#endif /* CONFIG_X86_64 */
+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_X86_64
+
+#include <asm/inst.h>
+
+.macro RDFSBASE opd
+ REG_TYPE rdfsbase_opd_type \opd
+ .if rdfsbase_opd_type == REG_TYPE_R64
+ R64_NUM rdfsbase_opd \opd
+ .byte 0xf3
+ PFX_REX rdfsbase_opd 0 W = 1
+ .else
+ R32_NUM rdfsbase_opd \opd
+ .byte 0xf3
+ .endif
+ .byte 0xf, 0xae
+ MODRM 0xc0 rdfsbase_opd 0
+.endm
+
+.macro WRFSBASE opd
+ REG_TYPE wrfsbase_opd_type \opd
+ .if wrfsbase_opd_type == REG_TYPE_R64
+ R64_NUM wrfsbase_opd \opd
+ .byte 0xf3
+ PFX_REX wrfsbase_opd 0 W = 1
+ .else
+ R32_NUM wrfsbase_opd \opd
+ .byte 0xf3
+ .endif
+ .byte 0xf, 0xae
+ MODRM 0xd0 wrfsbase_opd 0
+.endm
+
+.macro RDGSBASE opd
+ REG_TYPE rdgsbase_opd_type \opd
+ .if rdgsbase_opd_type == REG_TYPE_R64
+ R64_NUM rdgsbase_opd \opd
+ .byte 0xf3
+ PFX_REX rdgsbase_opd 0 W = 1
+ .else
+ R32_NUM rdgsbase_opd \opd
+ .byte 0xf3
+ .endif
+ .byte 0xf, 0xae
+ MODRM 0xc0 rdgsbase_opd 1
+.endm
+
+.macro WRGSBASE opd
+ REG_TYPE wrgsbase_opd_type \opd
+ .if wrgsbase_opd_type == REG_TYPE_R64
+ R64_NUM wrgsbase_opd \opd
+ .byte 0xf3
+ PFX_REX wrgsbase_opd 0 W = 1
+ .else
+ R32_NUM wrgsbase_opd \opd
+ .byte 0xf3
+ .endif
+ .byte 0xf, 0xae
+ MODRM 0xd0 wrgsbase_opd 1
+.endm
+
+#endif /* CONFIG_X86_64 */
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_FSGSBASE_H */
--
2.7.4
On 3/19/2018 01:05 PM, Dave Hansen wrote:
>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>> GS base is not the kernel's.
> Do you mean "SWAPGS is needed..."?
Yes, will change.
>> FSGSBASE instructions allow user to write any value on GS base;
>> even negative. Sign check on the current GS base is not
>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>> base is also known from the offset table with the CPU number.
>>
>> GS-compatible RDPID macro is included.
> This description could use some work. I think you're trying to say
> that, currently, userspace can't modify GS base and the kernel's
> conventions are that a negative GS base means it is a kernel value and a
> positive GS base means it is a user vale. But, with your new patches,
> userspace can put arbitrary data in there, breaking the exising assumption.
> Correct?
Yes.
> This also needs to explain a bit of the theory about how we go finding
> the kernel GS base value.
> Also, this is expected to improve paranoid_entry speed, right? The
> rdmsr goes away so this should be faster. Should that be mentioned?
I'm a bit reluctant to claim any performance benefit here yet (without having any strong empirical evidence).
>> + * SWAPGS needs when it comes from user space.
> The grammar here needs some work.
> "SWAPGS is needed when entering from userspace".
Okay
>> To check where-it-from,
>> + * read GS base from RDMSR/MSR_GS_BASE and check if negative or not.
>> + * This works without FSGSBASE.
>> + * When FSGSBASE enabled, arbitrary GS base can be put by a user-level
>> + * task, which means negative value is possible. Direct comparison
>> + * between the current and kernel GS bases determines the necessity of
>> + * SWAPGS; do if and only if unmatched.
>> + *
>> + * Return: ebx=0: need SWAPGS on exit, ebx=1: otherwise
>> */
> I don't think this really belongs in a comment above the function. I'd
> just explain overall that we are trying to determine if we interrupted
> userspace or not.
I'd rather sync-up with you about comments before posting a revision.
>> + movl $1, %ebx
> I know it wasn't commented before, but please add a comment about what
> this is doing.
Okay, as long as this comment doesn't go too much.
>> + /*
>> + * Read current GS base with RDGSBASE. Kernel GS base is found
>> + * by CPU number and its offset value.
>> + */
>> + ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
>> + "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
>I'd rather this be:
> ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "nop",\
> X86_FEATURE_FSGSBASE
> RDGSBASE %rdx
> READ_KERNEL_GSBASE %rax
> /* See if the kernel GS_BASE value is in GS base register */
> cmpq %rdx, %rax
> ...
>It's a lot more readable than what you have.
Yes, if it helps your readability.
>> + jne .Lparanoid_entry_swapgs
>> + ret
>> +
>> +.Lparanoid_entry_no_fsgsbase:
>> + /*
>> + * A (slow) RDMSR is surefire without FSGSBASE.
>> I'd say:
>> FSGSBASE is not in use, so depend on the kernel-enforced
>> convention that a negative GS base indicates a kernel value.
Okay, as the detail comment at the entry should be moved like that.
>> + * The READ_MSR_GSBASE macro scratches %ecx, %eax, and %edx.
> "clobbers" is the normal terminology for this, not "scratches".
Got it.
>> +.Lparanoid_entry_swapgs:
>> + SWAPGS
>> + xorl %ebx, %ebx
>> ret
>> END(paranoid_entry)
> ^^ Please comment the xorl, especially now that it's farther away from
> the comment explaining what it is for.
Okay (but if not too much)
> +.macro READ_KERNEL_GSBASE_RDPID reg:req
> This needs some explanation. Maybe:
> /*
> * Fetch the per-cpu GSBASE value for this processor and
> * put it in @reg. We normally use %GS for accessing per-cpu
> * data, but we are setting up %GS here and obviously can not
> * use %GS itself to access per-cpu data.
> */
Let me think.
>> + /*
>> + * processor id is written during vDSO (virtual dynamic shared object)
>> + * initialization. 12 bits for the CPU and 8 bits for the node.
>> + */
>> + andq $0xFFF, \reg
> This begs the question: when do we initialize the VDSO? Is that before
> or after the first paranoid_entry interrupt? Also, why the magic
> number? Shouldn't this come out of a macro somewhere rather than having
> to hard-code and spell out the convention?
Hoped the comment to be explanatory; but, agree that the hard-code anyways is bad.
>> +.macro READ_KERNEL_GSBASE reg:req
>> + ALTERNATIVE "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
>> + "READ_KERNEL_GSBASE_RDPID \reg", X86_FEATURE_RDPID
>> +.endm
> Can I suggest a different indentation?
> ALTERNATIVE \
> "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
> "READ_KERNEL_GSBASE_RDPID \reg",
> X86_FEATURE_RDPID
Sure, you can.
From: Chang S. Bae
> Sent: 19 March 2018 17:49
...
> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
> GS base is not the kernel's.
>
> FSGSBASE instructions allow user to write any value on GS base;
> even negative. Sign check on the current GS base is not
> sufficient. Fortunately, reading GS base is fast. Kernel GS
> base is also known from the offset table with the CPU number.
...
Use code might want to put a negative value into GSBASE.
While it is normal to put a valid address into GSBASE there
is no reason why the code can't put an offset into GSBASE,
in which case it might be negative.
Yes, I know you can't put arbitrary 64bit values into GSBASE.
But the difference between 2 user pointers will always be valid.
David
> On Mar 19, 2018, at 10:49 AM, Chang S. Bae <[email protected]> wrote:
>
> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
> GS base is not the kernel's.
>
> FSGSBASE instructions allow user to write any value on GS base;
> even negative. Sign check on the current GS base is not
> sufficient. Fortunately, reading GS base is fast. Kernel GS
> base is also known from the offset table with the CPU number.
The original version of these patches (mine and Andi’s) didn’t have
this comparison, didn’t need RDMSR, and didn’t allow malicious user
programs to cause the kernel to run decently large chunks of code with
the reverse of the expected GS convention. Why did you change it?
I really really don't like having a corner case like this that can and
will be triggered by malicious user code but that is hard to write a
self-test for because it involves guessing a 64-bit magic number.
Untestable corner cases in the x86 entry code are bad.
On Mon, Mar 19, 2018 at 5:49 PM, Chang S. Bae <[email protected]> wrote:
> When FSGSBASE enabled, ptracer's FS/GS selector update
> fetches FS/GS base from GDT/LDT. This emulation of FS/GS
> segment loading provides backward compatibility for the
> legacy ptracers.
>
> When ptracer sets FS/GS selector, its base is going to be
> (accordingly) reloaded as the tracee resumes. This is without
> FSGSBASE. With FSGSBASE, FS/GS base is preserved regardless
> of its selector. Thus, emulating FS/GS load in ptrace is
> requested to keep compatible with what has been with FS/GS
> setting.
>
> Additionally, whenever a new base value is written, the
> FSGSBASE-enabled kernel allows the tracee effectively carry
> on. This also means that when both selector and base are
> changed, the base is not fetched from GDT/LDT, but
> preserved as given.
>
>
> Suggested-by: Markus T. Metzger <[email protected]>
> Suggested-by: H. Peter Anvin <[email protected]>
I've also suggested something like this myself, but this approach is
far more complicated than the older approach. Was there something
that the old approach would break? If so, what?
> + /*
> + * %fs setting goes to reload its base, when tracee
> + * resumes without FSGSBASE (legacy). Here with FSGSBASE
> + * FS base is (manually) fetched from GDT/LDT when needed.
> + */
> + else if (static_cpu_has(X86_FEATURE_FSGSBASE) &&
> + (value != 0) && (task->thread.fsindex != value))
> + task->thread.fsbase = task_seg_base(task, value);
The comment above should explain why you're checking this particular
condition. I find the fsindex != value check to be *very* surprising.
On a real CPU, writing some nonzero value to %fs does the same thing
regardless of what the old value of %fs was.
> + case USER_REGS_OFFSET(fs):
> + if (fs_fully_covered &&
This is_fully_covered thing is IMO overcomplicated. Why not just make
a separate helper set_fsgs_index_and_base() and have putregs() call it
when both are set at once?
> + static_cpu_has(X86_FEATURE_FSGSBASE)) {
> + if (invalid_selector(*v))
> + return -EIO;
> + /*
> + * Set the flag to fetch fsbase from GDT/LDT
> + * with FSGSBASE
> + */
> + fs_updated = (*v != 0) &&
> + (child->thread.fsindex != *v);
Same here. Why do you care if fs was changed?
On Mon, Mar 19, 2018 at 5:49 PM, Chang S. Bae <[email protected]> wrote:
> FSGSBASE is 64-bit instruction set to allow read/write
> FS/GS base from any privilege. As introduced from
> Ivybridge, enabling effort has been revolving quite long
> [2,3,4] for various reasons. After extended discussions [1],
> this patchset is proposed to introduce new ABIs of
> customizing FS/GS base (separate from its selector).
>
> FSGSBASE-enabled VM can be located on hosts with
> either HW virtualization or SW emulation. KVM advertises
> FSGSBASE when physical CPU has and emulation is
> supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
> may disable FSGSBASE for seamless VM migrations [6].
>
> A couple of major benefits are expected. Kernel will have
> performance improvement in context switch by skipping MSR
> write for GS base. User-level programs (such as JAVA-based)
> benefit from avoiding system calls to edit FS/GS base.
Can you describe what changed since the last submission? It looks
like a lot has changed and this series is much more complicated and
much more fragile than it used to be. Why?
On Tue, Mar 20, 2018 at 3:05 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Mar 19, 2018 at 5:49 PM, Chang S. Bae <[email protected]> wrote:
>> FSGSBASE is 64-bit instruction set to allow read/write
>> FS/GS base from any privilege. As introduced from
>> Ivybridge, enabling effort has been revolving quite long
>> [2,3,4] for various reasons. After extended discussions [1],
>> this patchset is proposed to introduce new ABIs of
>> customizing FS/GS base (separate from its selector).
>>
>> FSGSBASE-enabled VM can be located on hosts with
>> either HW virtualization or SW emulation. KVM advertises
>> FSGSBASE when physical CPU has and emulation is
>> supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
>> may disable FSGSBASE for seamless VM migrations [6].
>>
>> A couple of major benefits are expected. Kernel will have
>> performance improvement in context switch by skipping MSR
>> write for GS base. User-level programs (such as JAVA-based)
>> benefit from avoiding system calls to edit FS/GS base.
>
> Can you describe what changed since the last submission? It looks
> like a lot has changed and this series is much more complicated and
> much more fragile than it used to be. Why?
Also, I've been getting lots of kbuild bot complaints about something
that looks like this patches from your git tree as recently as March
17. Have you fixed those problems?
On 3/20/18, 07:58, "Andy Lutomirski" <[email protected]> wrote:
>> On Mar 19, 2018, at 10:49 AM, Chang S. Bae <[email protected]> wrote:
>>
>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>> GS base is not the kernel's.
>>
>> FSGSBASE instructions allow user to write any value on GS base;
>> even negative. Sign check on the current GS base is not
>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>> base is also known from the offset table with the CPU number.
> The original version of these patches (mine and Andi’s) didn’t have
> this comparison, didn’t need RDMSR, and didn’t allow malicious user
> programs to cause the kernel to run decently large chunks of code with
> the reverse of the expected GS convention. Why did you change it?
> I really really don't like having a corner case like this that can and
> will be triggered by malicious user code but that is hard to write a
> self-test for because it involves guessing a 64-bit magic number.
> Untestable corner cases in the x86 entry code are bad.
Originally, I took it. But since it keeps kernel GS base on the (IST) stack,
it is thought as fragile [1] AFAIK. If you really don't like this "comparison"
then GS base can be written (unconditionally) while the original GS
base stitched (like the original approach did) IMO.
[1] https://marc.info/?l=linux-kernel&m=151725088506036&w=3
On 3/20/18, 08:07, "Andy Lutomirski" <[email protected]> wrote:
> Can you describe what changed since the last submission? It looks
> like a lot has changed and this series is much more complicated and
> much more fragile than it used to be. Why?
According to your overall comments, most concerning part you have right now
is paranoid entry and ptrace legacy support. (correct me if I'm wong)
Design-wise, I think they're major changes -- I can follow-up with
a summary if you want. I responded about paranoid entry. ptrace
legacy support is simply new that never before AFAIK.
> Also, I've been getting lots of kbuild bot complaints about something
> that looks like this patches from your git tree as recently as March
> 17. Have you fixed those problems?
Last week, I don't have any major issue at all. I don't know what major
issues you saw. Most likely it is just silly rebase issue I bet.
On 3/20/18, 08:05, "Andy Lutomirski" <[email protected]> wrote:
> I've also suggested something like this myself, but this approach is
> far more complicated than the older approach. Was there something
> that the old approach would break? If so, what?
Sorry, I don't know your suggestion. Can you elaborate your suggestion?
>> + /*
>> + * %fs setting goes to reload its base, when tracee
>> + * resumes without FSGSBASE (legacy). Here with FSGSBASE
>> + * FS base is (manually) fetched from GDT/LDT when needed.
>> + */
>> + else if (static_cpu_has(X86_FEATURE_FSGSBASE) &&
>> + (value != 0) && (task->thread.fsindex != value))
>> + task->thread.fsbase = task_seg_base(task, value);
> The comment above should explain why you're checking this particular
> condition. I find the fsindex != value check to be *very* surprising.
> On a real CPU, writing some nonzero value to %fs does the same thing
> regardless of what the old value of %fs was.
With FSGSBASE, when both index and base are not changed, base will
be (always) fetched from GDT/LDT. This is not thought as legacy behavior
we need to support, AFAIK.
> This is_fully_covered thing is IMO overcomplicated. Why not just make
> a separate helper set_fsgs_index_and_base() and have putregs() call it
> when both are set at once?
Using helper function here is exactly what I did at first. I thought this
tag is simple enough and straightforward at the end. But I'm open to
factor it out.
On 3/20/18, 07:58, "Andy Lutomirski" <[email protected]> wrote:
> The original version of these patches (mine and Andi’s) didn’t have
> this comparison, didn’t need RDMSR, and didn’t allow malicious user
To be clear, RDMSR is not needed (with FSGSBASE)
On 03/20/18 03:16, David Laight wrote:
> From: Chang S. Bae
>> Sent: 19 March 2018 17:49
> ...
>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>> GS base is not the kernel's.
>>
>> FSGSBASE instructions allow user to write any value on GS base;
>> even negative. Sign check on the current GS base is not
>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>> base is also known from the offset table with the CPU number.
> ...
>
> Use code might want to put a negative value into GSBASE.
> While it is normal to put a valid address into GSBASE there
> is no reason why the code can't put an offset into GSBASE,
> in which case it might be negative.
>
> Yes, I know you can't put arbitrary 64bit values into GSBASE.
> But the difference between 2 user pointers will always be valid.
>
You don't have a choice: you can't control what userspace puts in there.
Anything that depends on a specific value is inherently unsafe.
But we also don't need swapgs when we have rdgsbase/wrgsbase available.
We can indeed just unconditionally save it (via rdgsbase) into the stack
frame and wrgsbase the correct percpu value. In that case it might be
necessary in order to avoid insane complexity to also save/restore the
gs selector.
Is it going to be faster? *Probably* not as swapgs is designed to be
fast; it does, however, eliminate the need to RDMSR/WRMSR inside the
kernel task switch as the user space gsbase will simply live on the
stack. (This is assuming we do this unconditionally on every method of
kernel entry, including non-paranoid. I'm not sure if we ever care
about the userspace GS/GSBASE inside a paranoid handler, but if we do it
would be rather messy to find if we do this conditionally.
Now...
+ ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
+ "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
+ READ_KERNEL_GSBASE %rax
READ_KERNEL_GSBASE here seems like a Really Bad Name[TM] for this macro,
since it seems to imply reading MSR_KERNEL_GS_BASE, rather than finding
the current percpu offset. I would prefer calling it something like
FIND_PERCPU_BASE or something like that.
-hpa
> I would prefer calling it something like FIND_PERCPU_BASE or something like that.
Thanks for the suggestion.
Chang
On Tue, Mar 20, 2018 at 8:07 PM, H. Peter Anvin <[email protected]> wrote:
> On 03/20/18 03:16, David Laight wrote:
>> From: Chang S. Bae
>>> Sent: 19 March 2018 17:49
>> ...
>>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>>> GS base is not the kernel's.
>>>
>>> FSGSBASE instructions allow user to write any value on GS base;
>>> even negative. Sign check on the current GS base is not
>>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>>> base is also known from the offset table with the CPU number.
>> ...
>>
>> Use code might want to put a negative value into GSBASE.
>> While it is normal to put a valid address into GSBASE there
>> is no reason why the code can't put an offset into GSBASE,
>> in which case it might be negative.
>>
>> Yes, I know you can't put arbitrary 64bit values into GSBASE.
>> But the difference between 2 user pointers will always be valid.
>>
>
> You don't have a choice: you can't control what userspace puts in there.
> Anything that depends on a specific value is inherently unsafe.
>
> But we also don't need swapgs when we have rdgsbase/wrgsbase available.
> We can indeed just unconditionally save it (via rdgsbase) into the stack
> frame and wrgsbase the correct percpu value. In that case it might be
> necessary in order to avoid insane complexity to also save/restore the
> gs selector.
This is exactly what the old code did. I liked the old code better.
>
> Is it going to be faster? *Probably* not as swapgs is designed to be
> fast; it does, however, eliminate the need to RDMSR/WRMSR inside the
> kernel task switch as the user space gsbase will simply live on the
> stack. (This is assuming we do this unconditionally on every method of
> kernel entry, including non-paranoid. I'm not sure if we ever care
> about the userspace GS/GSBASE inside a paranoid handler, but if we do it
> would be rather messy to find if we do this conditionally.
>
> Now...
>
> + ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
> + "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
> + READ_KERNEL_GSBASE %rax
>
> READ_KERNEL_GSBASE here seems like a Really Bad Name[TM] for this macro,
> since it seems to imply reading MSR_KERNEL_GS_BASE, rather than finding
> the current percpu offset. I would prefer calling it something like
> FIND_PERCPU_BASE or something like that.
I think we should revert to what the old patches did here.
On Tue, Mar 20, 2018 at 4:33 PM, Bae, Chang Seok
<[email protected]> wrote:
> On 3/20/18, 08:07, "Andy Lutomirski" <[email protected]> wrote:
>> Can you describe what changed since the last submission? It looks
>> like a lot has changed and this series is much more complicated and
>> much more fragile than it used to be. Why?
>
> According to your overall comments, most concerning part you have right now
> is paranoid entry and ptrace legacy support. (correct me if I'm wong)
> Design-wise, I think they're major changes -- I can follow-up with
> a summary if you want. I responded about paranoid entry. ptrace
> legacy support is simply new that never before AFAIK.
Summary please.
On Tue, Mar 20, 2018 at 4:33 PM, Bae, Chang Seok
<[email protected]> wrote:
> On 3/20/18, 08:05, "Andy Lutomirski" <[email protected]> wrote:
>> I've also suggested something like this myself, but this approach is
>> far more complicated than the older approach. Was there something
>> that the old approach would break? If so, what?
> Sorry, I don't know your suggestion. Can you elaborate your suggestion?
What the old code did.
If I've understood all your emails right, when you looked at existing
ptrace users, you found that all of them that write to gs and/or
gs_base do it as part of a putregs call that writes them at the same
time. If so, then your patch does exactly the same thing that my old
patches did, but your patch is much more complicated. So why did you
add all that complexity?
>
>>> + /*
>>> + * %fs setting goes to reload its base, when tracee
>>> + * resumes without FSGSBASE (legacy). Here with FSGSBASE
>>> + * FS base is (manually) fetched from GDT/LDT when needed.
>>> + */
>>> + else if (static_cpu_has(X86_FEATURE_FSGSBASE) &&
>>> + (value != 0) && (task->thread.fsindex != value))
>>> + task->thread.fsbase = task_seg_base(task, value);
>
>> The comment above should explain why you're checking this particular
>> condition. I find the fsindex != value check to be *very* surprising.
>> On a real CPU, writing some nonzero value to %fs does the same thing
>> regardless of what the old value of %fs was.
>
> With FSGSBASE, when both index and base are not changed, base will
> be (always) fetched from GDT/LDT. This is not thought as legacy behavior
> we need to support, AFAIK.
>
>> This is_fully_covered thing is IMO overcomplicated. Why not just make
>> a separate helper set_fsgs_index_and_base() and have putregs() call it
>> when both are set at once?
>
> Using helper function here is exactly what I did at first. I thought this
> tag is simple enough and straightforward at the end. But I'm open to
> factor it out.
>
>
I retract this particular comment. But I still think that all this
complexity needs to be more clearly justified. My objection to the
old approach wasn't that I thought it was obviously wrong -- I thought
that someone needed to survey existing ptrace() users and see if
anyone needed the fancier code that you're adding. Did you find
something that needs this fancy code?
> -----Original Message-----
> From: Andy Lutomirski [mailto:[email protected]]
> Sent: 21 March 2018 01:47
Hello Andy,
> I retract this particular comment. But I still think that all this complexity needs to
> be more clearly justified. My objection to the old approach wasn't that I thought
> it was obviously wrong -- I thought that someone needed to survey existing
> ptrace() users and see if anyone needed the fancier code that you're adding. Did
> you find something that needs this fancy code?
There are 3 cases:
- only FS changed, e.g. "p $fs = ..."
- only FS_BASE changed, e.g. "p $fs_base = ..."
- both change, e.g. "p foo()" when restoring the original register state on return
from the inferior call
The ptracer may use SETREGS in all 3 cases, even though only a single register changed.
For case 1, it might make sense to change FS_BASE as a side-effect.
For case 2, we'd only want to change FS_BASE and leave FS.
For case 3, we'd want both FS and FS_BASE to be set to the ptracer-provided values.
Does that make sense?
Thanks,
Markus.
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
On 3/20/18, 17:47, "Andy Lutomirski" <[email protected]> wrote:
> If I've understood all your emails right, when you looked at existing
> ptrace users, you found that all of them that write to gs and/or
> gs_base do it as part of a putregs call that writes them at the same
> time. If so, then your patch does exactly the same thing that my old
> patches did, but your patch is much more complicated. So why did you
> add all that complexity?
What is tried to be provided is backward compatibility by emulating
“mov gs (fs) …” when index is only changed and preserve a (given) base value
in other cases. If ptracer changes GS index between PTRACE_GETREGS
and PTRACE_SETREG, the tracee does not have GS base from GDT/LDT
when it resumes, with the patch [1]. Task’s GS base is preserved at schedule-in,
when FSGSBASE enabled.
Code-wise, no clean implementation was found from the existing putreg().
putreg() goes in an independently isolated way, while the backward
compatibility needs something to put one from the other value.
I would like to know a better way without introducing putregs() (and
reversely setting them).
[1] https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fsgsbase&id=cfaf9911b88930ca6e4d0173fe8a58d2ea4ee6fb
> Summary please.
These deltas I found as comparing with your old codes [4]
* Paranoid entry changes to avoid vulnerability
* Ptrace changes to support the legacy behavior (conditionally)
* Dropped [1] to be aligned with what is concluded in [3]
* Dropped [2] since save_fsgs() doesn’t copy base when index is zero; now open code with FSGSBASE instructions.
* Re-wrote assembly macros and dropped (redundant) SWAPGS intrinsic
* Revised helper functions -- dropped unused helpers and factor out write_task_fs/gsbase
* Introducing all of helper functions at the beginning of patch series
* Factor out load_fsgs()
* One of the test cases will be included later (as published 15 patches already)
[1] Refresh FS and GS when modify_ldt changes an entry
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fsgsbase&id=fbe2c2a0ba5b879f8229ca7dfc0d434dbe31a805
[2] When copying a thread, safe FS and GS for real
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fsgsbase&id=e51eceeefbd587b514ca357b745b46e8e2746d34
[3] FSGSBASE ABI discussion
https://marc.info/?l=linux-kernel&m=150212735223392&w=2
[4] https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
On 03/20/18 17:36, Andy Lutomirski wrote:
>>
>> But we also don't need swapgs when we have rdgsbase/wrgsbase available.
>> We can indeed just unconditionally save it (via rdgsbase) into the stack
>> frame and wrgsbase the correct percpu value. In that case it might be
>> necessary in order to avoid insane complexity to also save/restore the
>> gs selector.
>
> This is exactly what the old code did. I liked the old code better.
>
>>
>> Is it going to be faster? *Probably* not as swapgs is designed to be
>> fast; it does, however, eliminate the need to RDMSR/WRMSR inside the
>> kernel task switch as the user space gsbase will simply live on the
>> stack. (This is assuming we do this unconditionally on every method of
>> kernel entry, including non-paranoid. I'm not sure if we ever care
>> about the userspace GS/GSBASE inside a paranoid handler, but if we do it
>> would be rather messy to find if we do this conditionally.
>>
>> Now...
>>
>> + ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
>> + "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
>> + READ_KERNEL_GSBASE %rax
>>
>> READ_KERNEL_GSBASE here seems like a Really Bad Name[TM] for this macro,
>> since it seems to imply reading MSR_KERNEL_GS_BASE, rather than finding
>> the current percpu offset. I would prefer calling it something like
>> FIND_PERCPU_BASE or something like that.
>
> I think we should revert to what the old patches did here.
>
I don't really understand why you want to do it this way.
1. It means that the location of the user space gs_base is ill-defined,
whereas with the SWAPGS it is *always* in the same place.
2. It is most likely slower, although I obviously haven't benchmarked
it.
-hpa
On 03/20/18 07:58, Andy Lutomirski wrote:
>> On Mar 19, 2018, at 10:49 AM, Chang S. Bae <[email protected]> wrote:
>>
>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>> GS base is not the kernel's.
>>
>> FSGSBASE instructions allow user to write any value on GS base;
>> even negative. Sign check on the current GS base is not
>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>> base is also known from the offset table with the CPU number.
>
> The original version of these patches (mine and Andi’s) didn’t have
> this comparison, didn’t need RDMSR, and didn’t allow malicious user
> programs to cause the kernel to run decently large chunks of code with
> the reverse of the expected GS convention. Why did you change it?
>
> I really really don't like having a corner case like this that can and
> will be triggered by malicious user code but that is hard to write a
> self-test for because it involves guessing a 64-bit magic number.
> Untestable corner cases in the x86 entry code are bad.
>
What corner case are you talking about?
If user GS_BASE and kernel GS_BASE happen to be identical, then SWAPGS
is a nop and it does not matter one iota which is is user space and
which is kernel space. They are just numbers, and swapping one number
with itself doesn't do anything (in fact, opcode 0x90 was used for NOP
because it aliased to xchg [e]ax,[e]ax on pre-64-bit hardware.)
As far as running "decently large chunks" this is *exactly* the same
amount of code that is run on non-FSGSBASE hardware, so it Has To
Work[TM] anyway. If anything this may help catch problems sooner.
This minimizes the difference between the FSGSBASE and non-FSGSBASE
flows, which I would consider a significant win.
-hpa
On Wed, Mar 21, 2018 at 10:03 PM, H. Peter Anvin <[email protected]> wrote:
> On 03/20/18 07:58, Andy Lutomirski wrote:
>>> On Mar 19, 2018, at 10:49 AM, Chang S. Bae <[email protected]> wrote:
>>>
>>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>>> GS base is not the kernel's.
>>>
>>> FSGSBASE instructions allow user to write any value on GS base;
>>> even negative. Sign check on the current GS base is not
>>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>>> base is also known from the offset table with the CPU number.
>>
>> The original version of these patches (mine and Andi’s) didn’t have
>> this comparison, didn’t need RDMSR, and didn’t allow malicious user
>> programs to cause the kernel to run decently large chunks of code with
>> the reverse of the expected GS convention. Why did you change it?
>>
>> I really really don't like having a corner case like this that can and
>> will be triggered by malicious user code but that is hard to write a
>> self-test for because it involves guessing a 64-bit magic number.
>> Untestable corner cases in the x86 entry code are bad.
>>
>
> What corner case are you talking about?
>
> If user GS_BASE and kernel GS_BASE happen to be identical, then SWAPGS
> is a nop and it does not matter one iota which is is user space and
> which is kernel space. They are just numbers, and swapping one number
> with itself doesn't do anything (in fact, opcode 0x90 was used for NOP
> because it aliased to xchg [e]ax,[e]ax on pre-64-bit hardware.)
>
On current kernels, MSR_GS_BASE points to kernel percpu data and
MSR_KERNEL_GS_BASE is the user's GSBASE value. If you *write* to
MSR_KERNEL_GS_BASE, you modify the user's value.
With Andi's/my patches, it works exactly the same way on !FSGSBASE
and, if FSGSBASE is on, then, when we're in a paranoid entry,
MSR_GS_BASE is the live kernel value, the value upon return is stashed
in an entry asm register, and MSR_KERNEL_GS_BASE is whatever it was
before we entered. Sure, it's more complicated, but if something
starts writing to MSR_KERNEL_GS_BASE in the context of a paranoid
entry, the behavior will at least be consistently screwy.
With these patches, MSR_GS_BASE points to the kernel percpu data and
MSR_KERNEL_GS_BASE is the user's value, but writing to
MSR_KERNEL_GS_BASE will change the *kernel* value if we happen to be
in a paranoid context while running malicious user code. But only
when running malicious user code.
In the absence of some compelling reason why #3 is better than #2, I
don't like it.
If you want to argue for using rdpid or lsl to find the kernel gs base
and then load it unconditionally with wrgsbase, I'd be fine with that.
But this compare-and-swapgs just seems unnecessarily subject to
manipulation.
On Wed, Mar 21, 2018 at 7:01 AM, Metzger, Markus T
<[email protected]> wrote:
>> -----Original Message-----
>> From: Andy Lutomirski [mailto:[email protected]]
>> Sent: 21 March 2018 01:47
>
> Hello Andy,
>
>> I retract this particular comment. But I still think that all this complexity needs to
>> be more clearly justified. My objection to the old approach wasn't that I thought
>> it was obviously wrong -- I thought that someone needed to survey existing
>> ptrace() users and see if anyone needed the fancier code that you're adding. Did
>> you find something that needs this fancy code?
>
> There are 3 cases:
> - only FS changed, e.g. "p $fs = ..."
If a person literally types that, I have no problem if they get
different behavior on different systems. If it's some existing
program that needs to keep working, then we may be out of luck, since
gdb seems to use SETREGS.
> - only FS_BASE changed, e.g. "p $fs_base = ..."
That's fine no matter what, I think.
> - both change, e.g. "p foo()" when restoring the original register state on return
> from the inferior call
This is the interesting case. If everything that matters uses
SETREGS, then I think we're in good shape, and we should probably just
make writing to FS using ptrace on FSGSBASE systems *not* change
FSBASE, since that way is much simpler.
On Wed, Mar 21, 2018 at 3:11 PM, Bae, Chang Seok
<[email protected]> wrote:
> On 3/20/18, 17:47, "Andy Lutomirski" <[email protected]> wrote:
>> If I've understood all your emails right, when you looked at existing
>> ptrace users, you found that all of them that write to gs and/or
>> gs_base do it as part of a putregs call that writes them at the same
>> time. If so, then your patch does exactly the same thing that my old
>> patches did, but your patch is much more complicated. So why did you
>> add all that complexity?
>
> What is tried to be provided is backward compatibility by emulating
> “mov gs (fs) …” when index is only changed and preserve a (given) base value
> in other cases.
mov to gs changes GSBASE even if GS was unchanged.
But it's not clear to me that you've identified any case where
emulating this behavior is useful.
On 3/21/18, 18:41, "Andy Lutomirski" <[email protected]> wrote:
> mov to gs changes GSBASE even if GS was unchanged.
In GDB, ptrace (syscall) doesn't happen when FS/GS unchanged as
its (context) cache seems to be first checked. This does not allow to
preserve GSBASE as you know.
> But it's not clear to me that you've identified any case where
> emulating this behavior is useful.
One argument I heard is (if debugging a legacy application) user
might want to (indirectly) access LDT during inferior call and this
mov to fs/gs has been useful (maybe needed).
On Thu, Mar 22, 2018 at 3:45 PM, Bae, Chang Seok
<[email protected]> wrote:
> On 3/21/18, 18:41, "Andy Lutomirski" <[email protected]> wrote:
>> mov to gs changes GSBASE even if GS was unchanged.
> In GDB, ptrace (syscall) doesn't happen when FS/GS unchanged as
> its (context) cache seems to be first checked. This does not allow to
> preserve GSBASE as you know.
>
>> But it's not clear to me that you've identified any case where
>> emulating this behavior is useful.
> One argument I heard is (if debugging a legacy application) user
> might want to (indirectly) access LDT during inferior call and this
> mov to fs/gs has been useful (maybe needed).
>
>
But your patch doesn't actually do this, since gdb will just do
SETREGS anyway, right?
> But your patch doesn't actually do this, since gdb will just do
> SETREGS anyway, right?
GDB does SETREGS on any exclusive (FS/GS) updates in inferior call.
On Thu, Mar 22, 2018 at 4:46 PM, Bae, Chang Seok
<[email protected]> wrote:
>> But your patch doesn't actually do this, since gdb will just do
>> SETREGS anyway, right?
> GDB does SETREGS on any exclusive (FS/GS) updates in inferior call.
>
>
>
This means that your patch has exactly the same effect as the code in
my git tree, right? Then let's stick with something like what's in my
git tree, since it's far simpler.
________________________________________
From: Andy Lutomirski [[email protected]]
Sent: Thursday, March 22, 2018 09:53
>>> But your patch doesn't actually do this, since gdb will just do
>>> SETREGS anyway, right?
>> GDB does SETREGS on any exclusive (FS/GS) updates in inferior call.
> This means that your patch has exactly the same effect as the code in
> my git tree, right? Then let's stick with something like what's in my
> git tree, since it's far simpler.
Difference is if flipping FS/GS multiple times, user may check the base from LDT.
But I don't have strong will to keep arguing this; Markus or somebody might
want to say something.
The whole point as I understand is to avoid any regression on legacy ptracers.
If a strong confidence lies on the simple version, let me. My first thought bought
this in fact.
Thanks,
Chang
On Thu, Mar 22, 2018 at 9:17 PM, Bae, Chang Seok
<[email protected]> wrote:
>
> ________________________________________
> From: Andy Lutomirski [[email protected]]
> Sent: Thursday, March 22, 2018 09:53
>>>> But your patch doesn't actually do this, since gdb will just do
>>>> SETREGS anyway, right?
>>> GDB does SETREGS on any exclusive (FS/GS) updates in inferior call.
>
>> This means that your patch has exactly the same effect as the code in
>> my git tree, right? Then let's stick with something like what's in my
>> git tree, since it's far simpler.
>
> Difference is if flipping FS/GS multiple times, user may check the base from LDT.
> But I don't have strong will to keep arguing this; Markus or somebody might
> want to say something.
>
> The whole point as I understand is to avoid any regression on legacy ptracers.
> If a strong confidence lies on the simple version, let me. My first thought bought
> this in fact.
I agree that we want to avoid regressions, but you seem to have
discovered that basically all ptracers() use SETREGS. Your patch
behaves the same as the simple patch if SETREGS is used.
>
> Thanks,
> Chang