2016-03-06 05:52:31

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups

hpa asked me to get rid of the ASM_CLAC at the beginning of the SYSENTER
path. Little did he know...

This series makes the observed behavior of SYSENTER wrt flags the same
for all sane flags and kernel bitnesses. That is, SYSENTER preserves
flags now unless you do a syscall that explicitly changes flags, and
the HW flags that the syscall executes with are sanitized. This
includes NT, TF, AC and all arithmetic flags. Prior to this series,
32-bit kernels clobbered TF and the arithmetic flags and behaved
highly erratically if NT was set. (If IF is cleared by evil userspace
when SYSENTER starts, IF will be set again on return. There's nothing
the kernel can do about this -- SYSENTER inherently forgets the state
of IF.)

This series speeds up SYSENTER on all kernels by a surprisingly large
amount on Skylake because it eliminates an unconditional CLAC.

While SYSENTER used to handle TF correctly as far as I can tell on
64-bit kernels, the means by which it did so was heavily tangled up in
the ptrace single-step logic. It now works just like all the other
kernel entries except insofar as do_debug has a simple special case
for it. Relatedly, the bizarre and poorly explained old fixup in
do_debug is now hidden behind a WARN_ON_ONCE in preparation for
deleting it at some point.

The code that fixed up NMI and #DB early in SYSENTER in 32-bit kernels
used to be both terrifying and incorrect. (It doesn't appear to have
been exploitably bad, but the reason for that is subtle, and the code
was certainy more fragile than it deserved to me.) We still need a
special fixup, but it's much simpler now.

While I was doing all this, I also noticed that DR6 and BTF handling
in do_debug was a bit off. Two of the patches in here try to fix it
up.

Have fun!

tl;dr: Cleanups and sanity fixes here, but no security fixes, and I
don't think anything needs to be backported or put in x86/urgent.

This series applies to the result of merging tip:x86/asm and
tip:x86/urgent. I've been testing on a somewhat bastardized base,
because tip currently doesn't work on my laptop in 32-bit mode. (That
bug is fixed in Linus' tree.)

Changes from v1:
- s/Sysenter/SYSENTER in two places (Borislav)
- int main() -> int main(void) (Borislav)

Andy Lutomirski (10):
selftests/x86: In syscall_nt, test NT|TF as well
x86/entry/compat: In SYSENTER, sink AC clearing below the existing
FLAGS test
x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
x86/entry/32: Restore FLAGS on SYSEXIT
x86/traps: Clear TIF_BLOCKSTEP on all debug exceptions
x86/traps: Clear DR6 early in do_debug and improve the comment
x86/entry: Vastly simplify SYSENTER TF handling
x86/entry: Only allocate space for SYSENTER_stack if needed
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
x86/entry/32: Add and check a stack canary for the SYSENTER stack

arch/x86/entry/entry_32.S | 182 ++++++++++++++++++-------------
arch/x86/entry/entry_64_compat.S | 15 ++-
arch/x86/include/asm/processor.h | 5 +-
arch/x86/include/asm/proto.h | 15 ++-
arch/x86/kernel/asm-offsets_32.c | 5 +
arch/x86/kernel/process.c | 3 +
arch/x86/kernel/traps.c | 87 ++++++++++++---
tools/testing/selftests/x86/syscall_nt.c | 57 ++++++++--
8 files changed, 263 insertions(+), 106 deletions(-)

--
2.5.0


2016-03-06 05:52:35

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 02/10] x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test

CLAC is slow, and the SYSENTER code already has an unlikely path
that runs if unusual flags are set. Drop the CLAC and instead rely
on the unlikely path to clear AC.

This seems to save ~24 cycles on my Skylake laptop. (Hey, Intel,
make this faster please!)

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64_compat.S | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 89bcb4979e7a..8d3728131809 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -66,8 +66,6 @@ ENTRY(entry_SYSENTER_compat)
*/
pushfq /* pt_regs->flags (except IF = 0) */
orl $X86_EFLAGS_IF, (%rsp) /* Fix saved flags */
- ASM_CLAC /* Clear AC after saving FLAGS */
-
pushq $__USER32_CS /* pt_regs->cs */
xorq %r8,%r8
pushq %r8 /* pt_regs->ip = 0 (placeholder) */
@@ -90,9 +88,9 @@ ENTRY(entry_SYSENTER_compat)
cld

/*
- * Sysenter doesn't filter flags, so we need to clear NT
+ * SYSENTER doesn't filter flags, so we need to clear NT and AC
* ourselves. To save a few cycles, we can check whether
- * NT was set instead of doing an unconditional popfq.
+ * either was set instead of doing an unconditional popfq.
* This needs to happen before enabling interrupts so that
* we don't get preempted with NT set.
*
@@ -102,7 +100,7 @@ ENTRY(entry_SYSENTER_compat)
* we're keeping that code behind a branch which will predict as
* not-taken and therefore its instructions won't be fetched.
*/
- testl $X86_EFLAGS_NT, EFLAGS(%rsp)
+ testl $X86_EFLAGS_NT|X86_EFLAGS_AC, EFLAGS(%rsp)
jnz .Lsysenter_fix_flags
.Lsysenter_flags_fixed:

--
2.5.0

2016-03-06 05:52:53

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 03/10] x86/entry/32: Filter NT and speed up AC filtering in SYSENTER

This makes the 32-bit code work just like the 64-bit code. It should
speed up syscalls on 32-bit kernels on Skylake by something like 20
cycles (by analogy to the 64-bit compat case).

It also cleans up NT just like we do for the 64-bit case.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_32.S | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ab710eee4308..289a17bf0c71 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -294,7 +294,6 @@ sysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */
pushfl /* pt_regs->flags (except IF = 0) */
- ASM_CLAC /* Clear AC after saving FLAGS */
orl $X86_EFLAGS_IF, (%esp) /* Fix IF */
pushl $__USER_CS /* pt_regs->cs */
pushl $0 /* pt_regs->ip = 0 (placeholder) */
@@ -302,6 +301,23 @@ sysenter_past_esp:
SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */

/*
+ * SYSENTER doesn't filter flags, so we need to clear NT and AC
+ * ourselves. To save a few cycles, we can check whether
+ * either was set instead of doing an unconditional popfq.
+ * This needs to happen before enabling interrupts so that
+ * we don't get preempted with NT set.
+ *
+ * NB.: .Lsysenter_fix_flags is a label with the code under it moved
+ * out-of-line as an optimization: NT is unlikely to be set in the
+ * majority of the cases and instead of polluting the I$ unnecessarily,
+ * we're keeping that code behind a branch which will predict as
+ * not-taken and therefore its instructions won't be fetched.
+ */
+ testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp)
+ jnz .Lsysenter_fix_flags
+.Lsysenter_flags_fixed:
+
+ /*
* User mode is traced as though IRQs are on, and SYSENTER
* turned them off.
*/
@@ -339,6 +355,11 @@ sysenter_past_esp:
.popsection
_ASM_EXTABLE(1b, 2b)
PTGS_TO_GS_EX
+
+.Lsysenter_fix_flags:
+ pushl $X86_EFLAGS_FIXED
+ popfl
+ jmp .Lsysenter_flags_fixed
ENDPROC(entry_SYSENTER_32)

# system call handler stub
--
2.5.0

2016-03-06 05:53:00

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 06/10] x86/traps: Clear DR6 early in do_debug and improve the comment

Leaving any bits set in DR6 on return from a debug exception is
asking for trouble. Prevent it by writing zero right away and
clarify the comment.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/traps.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 19e6cfa501e3..6dddc220e3ed 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -593,6 +593,18 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
ist_enter(regs);

get_debugreg(dr6, 6);
+ /*
+ * The Intel SDM says:
+ *
+ * Certain debug exceptions may clear bits 0-3. The remaining
+ * contents of the DR6 register are never cleared by the
+ * processor. To avoid confusion in identifying debug
+ * exceptions, debug handlers should clear the register before
+ * returning to the interrupted task.
+ *
+ * Keep it simple: clear DR6 immediately.
+ */
+ set_debugreg(0, 6);

/* Filter out all the reserved bits which are preset to 1 */
dr6 &= ~DR6_RESERVED;
@@ -616,9 +628,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
goto exit;

- /* DR6 may or may not be cleared by the CPU */
- set_debugreg(0, 6);
-
/* Store the virtualized DR6 value */
tsk->thread.debugreg6 = dr6;

--
2.5.0

2016-03-06 05:53:09

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
if a user does SYSENTER with TF set, we will single-step through the
kernel until something clears TF. There is absolutely nothing we can
do to prevent this short of turning off SYSENTER [1].

Simplify the handling considerably with two changes:

1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
add TF to that list of flags to sanitize with no overhead whatsoever.

2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.

That's all we need to do.

Don't get too excited -- our handling is still buggy on 32-bit
kernels. There's nothing wrong with the SYSENTER code itself, but
the #DB prologue has a clever fixup for traps on the very first
instruction of entry_SYSENTER_32, and the fixup doesn't work quite
correctly. The next two patches will fix that.

[1] We could probably prevent it by forcing BTF on at all times and
making sure we clear TF before any branches in the SYSENTER
code. Needless to say, this is a bad idea.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_32.S | 42 ++++++++++++++++++++++----------
arch/x86/entry/entry_64_compat.S | 9 ++++++-
arch/x86/include/asm/proto.h | 15 ++++++++++--
arch/x86/kernel/traps.c | 52 +++++++++++++++++++++++++++++++++-------
4 files changed, 94 insertions(+), 24 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index a8c3424c3392..7700cf695d8c 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -287,7 +287,26 @@ need_resched:
END(resume_kernel)
#endif

- # SYSENTER call handler stub
+GLOBAL(__begin_SYSENTER_singlestep_region)
+/*
+ * All code from here through __end_SYSENTER_singlestep_region is subject
+ * to being single-stepped if a user program sets TF and executes SYSENTER.
+ * There is absolutely nothing that we can do to prevent this from happening
+ * (thanks Intel!). To keep our handling of this situation as simple as
+ * possible, we handle TF just like AC and NT, except that our #DB handler
+ * will ignore all of the single-step traps generated in this range.
+ */
+
+#ifdef CONFIG_XEN
+/*
+ * Xen doesn't set %esp to be precisely what the normal SYSENTER
+ * entry point expects, so fix it up before using the normal path.
+ */
+ENTRY(xen_sysenter_target)
+ addl $5*4, %esp /* remove xen-provided frame */
+ jmp sysenter_past_esp
+#endif
+
ENTRY(entry_SYSENTER_32)
movl TSS_sysenter_sp0(%esp), %esp
sysenter_past_esp:
@@ -301,19 +320,25 @@ sysenter_past_esp:
SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */

/*
- * SYSENTER doesn't filter flags, so we need to clear NT and AC
- * ourselves. To save a few cycles, we can check whether
+ * SYSENTER doesn't filter flags, so we need to clear NT, AC
+ * and TF ourselves. To save a few cycles, we can check whether
* either was set instead of doing an unconditional popfq.
* This needs to happen before enabling interrupts so that
* we don't get preempted with NT set.
*
+ * If TF is set, we will single-step all the way to here -- do_debug
+ * will ignore all the traps. (Yes, this is slow, but so is
+ * single-stepping in general. This allows us to avoid having
+ * a more complicated code to handle the case where a user program
+ * forces us to single-step through the SYSENTER entry code.)
+ *
* NB.: .Lsysenter_fix_flags is a label with the code under it moved
* out-of-line as an optimization: NT is unlikely to be set in the
* majority of the cases and instead of polluting the I$ unnecessarily,
* we're keeping that code behind a branch which will predict as
* not-taken and therefore its instructions won't be fetched.
*/
- testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp)
+ testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp)
jnz .Lsysenter_fix_flags
.Lsysenter_flags_fixed:

@@ -369,6 +394,7 @@ sysenter_past_esp:
pushl $X86_EFLAGS_FIXED
popfl
jmp .Lsysenter_flags_fixed
+GLOBAL(__end_SYSENTER_singlestep_region)
ENDPROC(entry_SYSENTER_32)

# system call handler stub
@@ -662,14 +688,6 @@ ENTRY(spurious_interrupt_bug)
END(spurious_interrupt_bug)

#ifdef CONFIG_XEN
-/*
- * Xen doesn't set %esp to be precisely what the normal SYSENTER
- * entry point expects, so fix it up before using the normal path.
- */
-ENTRY(xen_sysenter_target)
- addl $5*4, %esp /* remove xen-provided frame */
- jmp sysenter_past_esp
-
ENTRY(xen_hypervisor_callback)
pushl $-1 /* orig_ax = -1 => not a system call */
SAVE_ALL
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 8d3728131809..26469c11796d 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -94,13 +94,19 @@ ENTRY(entry_SYSENTER_compat)
* This needs to happen before enabling interrupts so that
* we don't get preempted with NT set.
*
+ * If TF is set, we will single-step all the way to here -- do_debug
+ * will ignore all the traps. (Yes, this is slow, but so is
+ * single-stepping in general. This allows us to avoid having
+ * a more complicated code to handle the case where a user program
+ * forces us to single-step through the SYSENTER entry code.)
+ *
* NB.: .Lsysenter_fix_flags is a label with the code under it moved
* out-of-line as an optimization: NT is unlikely to be set in the
* majority of the cases and instead of polluting the I$ unnecessarily,
* we're keeping that code behind a branch which will predict as
* not-taken and therefore its instructions won't be fetched.
*/
- testl $X86_EFLAGS_NT|X86_EFLAGS_AC, EFLAGS(%rsp)
+ testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, EFLAGS(%rsp)
jnz .Lsysenter_fix_flags
.Lsysenter_flags_fixed:

@@ -121,6 +127,7 @@ ENTRY(entry_SYSENTER_compat)
pushq $X86_EFLAGS_FIXED
popfq
jmp .Lsysenter_flags_fixed
+GLOBAL(__end_entry_SYSENTER_compat)
ENDPROC(entry_SYSENTER_compat)

/*
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index a4a77286cb1d..9b9b30b19441 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -7,12 +7,23 @@

void syscall_init(void);

+#ifdef CONFIG_X86_64
void entry_SYSCALL_64(void);
-void entry_SYSCALL_compat(void);
+#endif
+
+#ifdef CONFIG_X86_32
void entry_INT80_32(void);
-void entry_INT80_compat(void);
void entry_SYSENTER_32(void);
+void __begin_SYSENTER_singlestep_region(void);
+void __end_SYSENTER_singlestep_region(void);
+#endif
+
+#ifdef CONFIG_IA32_EMULATION
void entry_SYSENTER_compat(void);
+void __end_entry_SYSENTER_compat(void);
+void entry_SYSCALL_compat(void);
+void entry_INT80_compat(void);
+#endif

void x86_configure_nx(void);
void x86_report_nx(void);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6dddc220e3ed..80928ea78373 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -559,6 +559,29 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
NOKPROBE_SYMBOL(fixup_bad_iret);
#endif

+static bool is_sysenter_singlestep(struct pt_regs *regs)
+{
+ /*
+ * We don't try for precision here. If we're anywhere in the region of
+ * code that can be single-stepped in the SYSENTER entry path, then
+ * assume that this is a useless single-step trap due to SYSENTER
+ * being invoked with TF set. (We don't know in advance exactly
+ * which instructions will be hit because BTF could plausibly
+ * be set.
+ */
+#ifdef CONFIG_X86_32
+ return (regs->ip - (unsigned long)__begin_SYSENTER_singlestep_region) <
+ (unsigned long)__end_SYSENTER_singlestep_region -
+ (unsigned long)__begin_SYSENTER_singlestep_region;
+#elif defined(CONFIG_IA32_EMULATION)
+ return (regs->ip - (unsigned long)entry_SYSENTER_compat) <
+ (unsigned long)__end_entry_SYSENTER_compat -
+ (unsigned long)entry_SYSENTER_compat;
+#else
+ return false;
+#endif
+}
+
/*
* Our handling of the processor debug registers is non-trivial.
* We do not clear them on entry and exit from the kernel. Therefore
@@ -616,6 +639,18 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
*/
clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);

+ if (unlikely(!user_mode(regs) && (dr6 & DR_STEP) &&
+ is_sysenter_singlestep(regs))) {
+ dr6 &= ~DR_STEP;
+ if (!dr6)
+ goto exit;
+ /*
+ * else we might have gotten a single-step trap and hit a
+ * watchpoint at the same time, in which case we should fall
+ * through and handle the watchpoint.
+ */
+ }
+
/*
* If dr6 has no reason to give us about the origin of this trap,
* then it's very likely the result of an icebp/int01 trap.
@@ -624,7 +659,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
if (!dr6 && user_mode(regs))
user_icebp = 1;

- /* Catch kmemcheck conditions first of all! */
+ /* Catch kmemcheck conditions! */
if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
goto exit;

@@ -659,14 +694,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
goto exit;
}

- /*
- * Single-stepping through system calls: ignore any exceptions in
- * kernel space, but re-enable TF when returning to user mode.
- *
- * We already checked v86 mode above, so we can check for kernel mode
- * by just checking the CPL of CS.
- */
- if ((dr6 & DR_STEP) && !user_mode(regs)) {
+ if (WARN_ON_ONCE((dr6 & DR_STEP) && !user_mode(regs))) {
+ /*
+ * Historical junk that used to handle SYSENTER single-stepping.
+ * This should be unreachable now. If we survive for a while
+ * without anyone hitting this warning, we'll turn this into
+ * an oops.
+ */
tsk->thread.debugreg6 &= ~DR_STEP;
set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
regs->flags &= ~X86_EFLAGS_TF;
--
2.5.0

2016-03-06 05:53:16

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 08/10] x86/entry: Only allocate space for SYSENTER_stack if needed

The SYSENTER stack is only used on 32-bit kernels. Remove it in
64-bit kernels.

(We may end up using it down the road on 64-bit kernels. If so,
we'll re-enable it for CONFIG_IA32_EMULATION.)

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index ecb410310e70..7cd01b71b5bd 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -297,10 +297,12 @@ struct tss_struct {
*/
unsigned long io_bitmap[IO_BITMAP_LONGS + 1];

+#ifdef CONFIG_X86_32
/*
* Space for the temporary SYSENTER stack:
*/
unsigned long SYSENTER_stack[64];
+#endif

} ____cacheline_aligned;

--
2.5.0

2016-03-06 05:53:22

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 05/10] x86/traps: Clear TIF_BLOCKSTEP on all debug exceptions

The SDM says that debug exceptions clear BTF, and we need to keep
TIF_BLOCKSTEP in sync with BTF. Clear it unconditionally and improve
the comment.

I suspect that the fact that kmemcheck could cause TIF_BLOCKSTEP not
to be cleared was just an oversight.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/traps.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index dd2c2e66c2e1..19e6cfa501e3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -598,6 +598,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
dr6 &= ~DR6_RESERVED;

/*
+ * The SDM says "The processor clears the BTF flag when it
+ * generates a debug exception." Clear TIF_BLOCKSTEP to keep
+ * TIF_BLOCKSTEP in sync with the hardware BTF flag.
+ */
+ clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);
+
+ /*
* If dr6 has no reason to give us about the origin of this trap,
* then it's very likely the result of an icebp/int01 trap.
* User wants a sigtrap for that.
@@ -612,11 +619,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
/* DR6 may or may not be cleared by the CPU */
set_debugreg(0, 6);

- /*
- * The processor cleared BTF, so don't mark that we need it set.
- */
- clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);
-
/* Store the virtualized DR6 value */
tsk->thread.debugreg6 = dr6;

--
2.5.0

2016-03-06 05:53:29

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 09/10] x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup

Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.

Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:

1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].

2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.

3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.

Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.

With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.

[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_32.S | 114 ++++++++++++++++++---------------------
arch/x86/kernel/asm-offsets_32.c | 5 ++
2 files changed, 56 insertions(+), 63 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7700cf695d8c..ad9a85e62128 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -987,51 +987,48 @@ error_code:
jmp ret_from_exception
END(page_fault)

-/*
- * Debug traps and NMI can happen at the one SYSENTER instruction
- * that sets up the real kernel stack. Check here, since we can't
- * allow the wrong stack to be used.
- *
- * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have
- * already pushed 3 words if it hits on the sysenter instruction:
- * eflags, cs and eip.
- *
- * We just load the right stack, and push the three (known) values
- * by hand onto the new stack - while updating the return eip past
- * the instruction that would have done it for sysenter.
- */
-.macro FIX_STACK offset ok label
- cmpw $__KERNEL_CS, 4(%esp)
- jne \ok
-\label:
- movl TSS_sysenter_sp0 + \offset(%esp), %esp
- pushfl
- pushl $__KERNEL_CS
- pushl $sysenter_past_esp
-.endm
-
ENTRY(debug)
+ /*
+ * #DB can happen at the first instruction of
+ * entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this
+ * happens, then we will be running on a very small stack. We
+ * need to detect this condition and switch to the thread
+ * stack before calling any C code at all.
+ *
+ * If you edit this code, keep in mind that NMIs can happen in here.
+ */
ASM_CLAC
- cmpl $entry_SYSENTER_32, (%esp)
- jne debug_stack_correct
- FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn
-debug_stack_correct:
pushl $-1 # mark this as an int
SAVE_ALL
- TRACE_IRQS_OFF
xorl %edx, %edx # error code 0
movl %esp, %eax # pt_regs pointer
+
+ /* Are we currently on the SYSENTER stack? */
+ PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+ subl %eax, %ecx /* ecx = (end of SYENTER_stack) - esp */
+ cmpl $SIZEOF_SYSENTER_stack, %ecx
+ jb .Ldebug_from_sysenter_stack
+
+ TRACE_IRQS_OFF
+ call do_debug
+ jmp ret_from_exception
+
+.Ldebug_from_sysenter_stack:
+ /* We're on the SYSENTER stack. Switch off. */
+ movl %esp, %ebp
+ movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
+ TRACE_IRQS_OFF
call do_debug
+ movl %ebp, %esp
jmp ret_from_exception
END(debug)

/*
- * NMI is doubly nasty. It can happen _while_ we're handling
- * a debug fault, and the debug fault hasn't yet been able to
- * clear up the stack. So we first check whether we got an
- * NMI on the sysenter entry path, but after that we need to
- * check whether we got an NMI on the debug path where the debug
- * fault happened on the sysenter path.
+ * NMI is doubly nasty. It can happen on the first instruction of
+ * entry_SYSENTER_32 (just like #DB), but it can also interrupt the beginning
+ * of the #DB handler even if that #DB in turn hit before entry_SYSENTER_32
+ * switched stacks. We handle both conditions by simply checking whether we
+ * interrupted kernel code running on the SYSENTER stack.
*/
ENTRY(nmi)
ASM_CLAC
@@ -1042,41 +1039,32 @@ ENTRY(nmi)
popl %eax
je nmi_espfix_stack
#endif
- cmpl $entry_SYSENTER_32, (%esp)
- je nmi_stack_fixup
- pushl %eax
- movl %esp, %eax
- /*
- * Do not access memory above the end of our stack page,
- * it might not exist.
- */
- andl $(THREAD_SIZE-1), %eax
- cmpl $(THREAD_SIZE-20), %eax
- popl %eax
- jae nmi_stack_correct
- cmpl $entry_SYSENTER_32, 12(%esp)
- je nmi_debug_stack_check
-nmi_stack_correct:
- pushl %eax
+
+ pushl %eax # pt_regs->orig_ax
SAVE_ALL
xorl %edx, %edx # zero error code
movl %esp, %eax # pt_regs pointer
+
+ /* Are we currently on the SYSENTER stack? */
+ PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+ subl %eax, %ecx /* ecx = (end of SYENTER_stack) - esp */
+ cmpl $SIZEOF_SYSENTER_stack, %ecx
+ jb .Lnmi_from_sysenter_stack
+
+ /* Not on SYSENTER stack. */
call do_nmi
jmp restore_all_notrace

-nmi_stack_fixup:
- FIX_STACK 12, nmi_stack_correct, 1
- jmp nmi_stack_correct
-
-nmi_debug_stack_check:
- cmpw $__KERNEL_CS, 16(%esp)
- jne nmi_stack_correct
- cmpl $debug, (%esp)
- jb nmi_stack_correct
- cmpl $debug_esp_fix_insn, (%esp)
- ja nmi_stack_correct
- FIX_STACK 24, nmi_stack_correct, 1
- jmp nmi_stack_correct
+.Lnmi_from_sysenter_stack:
+ /*
+ * We're on the SYSENTER stack. Switch off. No one (not even debug)
+ * is using the thread stack right now, so it's safe for us to use it.
+ */
+ movl %esp, %ebp
+ movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
+ call do_nmi
+ movl %ebp, %esp
+ jmp restore_all_notrace

#ifdef CONFIG_X86_ESPFIX32
nmi_espfix_stack:
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index fdeb0ce07c16..ecdc1d217dc0 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -52,6 +52,11 @@ void foo(void)
DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
offsetofend(struct tss_struct, SYSENTER_stack));

+ /* Offset from cpu_tss to SYSENTER_stack */
+ OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+ /* Size of SYSENTER_stack */
+ DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
#if defined(CONFIG_LGUEST) || defined(CONFIG_LGUEST_GUEST) || defined(CONFIG_LGUEST_MODULE)
BLANK();
OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
--
2.5.0

2016-03-06 05:54:17

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 10/10] x86/entry/32: Add and check a stack canary for the SYSENTER stack

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 3 ++-
arch/x86/kernel/process.c | 3 +++
arch/x86/kernel/traps.c | 8 ++++++++
3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7cd01b71b5bd..50a6dc871cc0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -299,8 +299,9 @@ struct tss_struct {

#ifdef CONFIG_X86_32
/*
- * Space for the temporary SYSENTER stack:
+ * Space for the temporary SYSENTER stack.
*/
+ unsigned long SYSENTER_stack_canary;
unsigned long SYSENTER_stack[64];
#endif

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9f7c21c22477..ee9a9792caeb 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -57,6 +57,9 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
*/
.io_bitmap = { [0 ... IO_BITMAP_LONGS] = ~0 },
#endif
+#ifdef CONFIG_X86_32
+ .SYSENTER_stack_canary = STACK_END_MAGIC,
+#endif
};
EXPORT_PER_CPU_SYMBOL(cpu_tss);

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 80928ea78373..590110119e6a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -713,6 +713,14 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
debug_stack_usage_dec();

exit:
+#if defined(CONFIG_X86_32)
+ /*
+ * This is the most likely code path that involves non-trivial use
+ * of the SYSENTER stack. Check that we haven't overrun it.
+ */
+ WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
+ "Overran or corrupted SYSENTER stack\n");
+#endif
ist_exit(regs);
}
NOKPROBE_SYMBOL(do_debug);
--
2.5.0

2016-03-06 05:54:28

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 04/10] x86/entry/32: Restore FLAGS on SYSEXIT

We weren't restoring FLAGS at all on SYSEXIT. Apparently no one cared.

With this patch applied, native kernels should always honor
task_pt_regs()->flags, which opens the door for some sys_iopl
cleanups. I'll do those as a separate series, though, since getting
it right will involve tweaking some paravirt ops.

(The short version is that, before this patch, sys_iopl, invoked via
SYSENTER, wasn't guaranteed to ever transfer the updated
regs->flags, so sys_iopl had to change the hardware flags register
as well.)

Reported-by: Brian Gerst <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_32.S | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 289a17bf0c71..a8c3424c3392 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -343,6 +343,15 @@ sysenter_past_esp:
popl %eax /* pt_regs->ax */

/*
+ * Restore all flags except IF (we restore IF separately because
+ * STI gives a one-instruction window in which we won't be interrupted,
+ * whereas POPF does not.
+ */
+ addl $PT_EFLAGS-PT_DS, %esp /* point esp at pt_regs->flags */
+ btr $X86_EFLAGS_IF_BIT, (%esp)
+ popfl
+
+ /*
* Return back to the vDSO, which will pop ecx and edx.
* Don't bother with DS and ES (they already contain __USER_DS).
*/
--
2.5.0

2016-03-06 05:54:39

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 01/10] selftests/x86: In syscall_nt, test NT|TF as well

Setting TF prevents fastpath returns in most cases, which causes the
test to fail on 32-bit kernels because 32-bit kernels do not, in
fact, handle NT correctly on SYSENTER entries.

The next patch will fix 32-bit kernels.

Signed-off-by: Andy Lutomirski <[email protected]>
---
tools/testing/selftests/x86/syscall_nt.c | 57 +++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/x86/syscall_nt.c b/tools/testing/selftests/x86/syscall_nt.c
index 60c06af4646a..43fcab367fb0 100644
--- a/tools/testing/selftests/x86/syscall_nt.c
+++ b/tools/testing/selftests/x86/syscall_nt.c
@@ -17,6 +17,9 @@

#include <stdio.h>
#include <unistd.h>
+#include <string.h>
+#include <signal.h>
+#include <err.h>
#include <sys/syscall.h>
#include <asm/processor-flags.h>

@@ -26,6 +29,8 @@
# define WIDTH "l"
#endif

+static unsigned int nerrs;
+
static unsigned long get_eflags(void)
{
unsigned long eflags;
@@ -39,16 +44,52 @@ static void set_eflags(unsigned long eflags)
: : "rm" (eflags) : "flags");
}

-int main()
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+ int flags)
{
- printf("[RUN]\tSet NT and issue a syscall\n");
- set_eflags(get_eflags() | X86_EFLAGS_NT);
+ struct sigaction sa;
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_sigaction = handler;
+ sa.sa_flags = SA_SIGINFO | flags;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static void sigtrap(int sig, siginfo_t *si, void *ctx_void)
+{
+}
+
+static void do_it(unsigned long extraflags)
+{
+ unsigned long flags;
+
+ set_eflags(get_eflags() | extraflags);
syscall(SYS_getpid);
- if (get_eflags() & X86_EFLAGS_NT) {
- printf("[OK]\tThe syscall worked and NT is still set\n");
- return 0;
+ flags = get_eflags();
+ if ((flags & extraflags) == extraflags) {
+ printf("[OK]\tThe syscall worked and flags are still set\n");
} else {
- printf("[FAIL]\tThe syscall worked but NT was cleared\n");
- return 1;
+ printf("[FAIL]\tThe syscall worked but flags were cleared (flags = 0x%lx but expected 0x%lx set)\n",
+ flags, extraflags);
+ nerrs++;
}
}
+
+int main(void)
+{
+ printf("[RUN]\tSet NT and issue a syscall\n");
+ do_it(X86_EFLAGS_NT);
+
+ /*
+ * Now try it again with TF set -- TF forces returns via IRET in all
+ * cases except non-ptregs-using 64-bit full fast path syscalls.
+ */
+
+ sethandler(SIGTRAP, sigtrap, 0);
+
+ printf("[RUN]\tSet NT|TF and issue a syscall\n");
+ do_it(X86_EFLAGS_NT | X86_EFLAGS_TF);
+
+ return nerrs == 0 ? 0 : 1;
+}
--
2.5.0

2016-03-06 08:22:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups


* Andy Lutomirski <[email protected]> wrote:

> This series applies to the result of merging tip:x86/asm and
> tip:x86/urgent. I've been testing on a somewhat bastardized base,
> because tip currently doesn't work on my laptop in 32-bit mode. (That
> bug is fixed in Linus' tree.)

Hm, which fix is that?

Generally, tip:master closely tracks Linus's latest (so it should work as-is), but
individual sub-trees are only updated when necessary - but a refresh can be done
anytime if requested.

Thanks,

Ingo

2016-03-06 08:31:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups


* Andy Lutomirski <[email protected]> wrote:

> hpa asked me to get rid of the ASM_CLAC at the beginning of the SYSENTER
> path. Little did he know...

Btw., before we further change this code, something else I think would be very
useful. We have countless system call entry points on x86 CPUs, and they are now
consistently named and are very easy to grep for:

triton:~/tip> git grep 'ENTRY(entry_' arch/x86/entry/
arch/x86/entry/entry_32.S:ENTRY(entry_SYSENTER_32)
arch/x86/entry/entry_32.S:ENTRY(entry_INT80_32)
arch/x86/entry/entry_64.S:ENTRY(entry_SYSCALL_64)
arch/x86/entry/entry_64_compat.S:ENTRY(entry_SYSENTER_compat)
arch/x86/entry/entry_64_compat.S:ENTRY(entry_SYSCALL_compat)
arch/x86/entry/entry_64_compat.S:ENTRY(entry_INT80_compat)

Furthermore, each entry point has extensive comments, except one important detail:
none of the comments really explains the circumstances under which the entry
points are _used_ by user-space.

I'd like to see something like:

arch/x86/entry/entry_64.S:ENTRY(entry_SYSCALL_64)

*
* The 64-bit SYSCALL instruction is used by all modern 64-bit user-space
* code to execute most system calls: this instruction is the fastest and
* sanest implementation on modern Intel and AMD CPUs.
*

... and we should add similar explanations for all of the 6 entry points, with
caveats and limitations listed generously.

Especially valuable would be to list eventual 'strange' usages of the various
syscall instructions, used by rare packages, compatibility layers, emulators,
embedded libraries, etc. (To the extent we know about them, obviously.)

I.e. it would be very nice to do a full documentation of our current system call
usage patterns, as utilized by user-space. Beyond the documentation value this
will also help people prioritize optimizations between the various entry points -
which should be optimized more, which entry point matters less, etc.

Thanks,

Ingo

2016-03-06 16:16:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups

On Mar 6, 2016 12:31 AM, "Ingo Molnar" <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > hpa asked me to get rid of the ASM_CLAC at the beginning of the SYSENTER
> > path. Little did he know...
>
> Btw., before we further change this code, something else I think would be very
> useful. We have countless system call entry points on x86 CPUs, and they are now
> consistently named and are very easy to grep for:
>
> triton:~/tip> git grep 'ENTRY(entry_' arch/x86/entry/
> arch/x86/entry/entry_32.S:ENTRY(entry_SYSENTER_32)
> arch/x86/entry/entry_32.S:ENTRY(entry_INT80_32)
> arch/x86/entry/entry_64.S:ENTRY(entry_SYSCALL_64)
> arch/x86/entry/entry_64_compat.S:ENTRY(entry_SYSENTER_compat)
> arch/x86/entry/entry_64_compat.S:ENTRY(entry_SYSCALL_compat)
> arch/x86/entry/entry_64_compat.S:ENTRY(entry_INT80_compat)
>
> Furthermore, each entry point has extensive comments, except one important detail:
> none of the comments really explains the circumstances under which the entry
> points are _used_ by user-space.
>
> I'd like to see something like:
>
> arch/x86/entry/entry_64.S:ENTRY(entry_SYSCALL_64)
>
> *
> * The 64-bit SYSCALL instruction is used by all modern 64-bit user-space
> * code to execute most system calls: this instruction is the fastest and
> * sanest implementation on modern Intel and AMD CPUs.
> *
>
> ... and we should add similar explanations for all of the 6 entry points, with
> caveats and limitations listed generously.
>
> Especially valuable would be to list eventual 'strange' usages of the various
> syscall instructions, used by rare packages, compatibility layers, emulators,
> embedded libraries, etc. (To the extent we know about them, obviously.)

I'll send a follow-up patch.

--Andy

2016-03-06 16:17:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups

On Mar 6, 2016 12:22 AM, "Ingo Molnar" <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > This series applies to the result of merging tip:x86/asm and
> > tip:x86/urgent. I've been testing on a somewhat bastardized base,
> > because tip currently doesn't work on my laptop in 32-bit mode. (That
> > bug is fixed in Linus' tree.)
>
> Hm, which fix is that?

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de9e478b9d49f3a0214310d921450cf5bb4a21e6

2016-03-06 16:56:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Sat, Mar 05, 2016 at 09:52:20PM -0800, Andy Lutomirski wrote:
> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
> if a user does SYSENTER with TF set, we will single-step through the
> kernel until something clears TF. There is absolutely nothing we can
> do to prevent this short of turning off SYSENTER [1].
>
> Simplify the handling considerably with two changes:
>
> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
> add TF to that list of flags to sanitize with no overhead whatsoever.
>
> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>
> That's all we need to do.
>
> Don't get too excited -- our handling is still buggy on 32-bit
> kernels. There's nothing wrong with the SYSENTER code itself, but
> the #DB prologue has a clever fixup for traps on the very first
> instruction of entry_SYSENTER_32, and the fixup doesn't work quite
> correctly. The next two patches will fix that.
>
> [1] We could probably prevent it by forcing BTF on at all times and
> making sure we clear TF before any branches in the SYSENTER
> code. Needless to say, this is a bad idea.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 42 ++++++++++++++++++++++----------
> arch/x86/entry/entry_64_compat.S | 9 ++++++-
> arch/x86/include/asm/proto.h | 15 ++++++++++--
> arch/x86/kernel/traps.c | 52 +++++++++++++++++++++++++++++++++-------
> 4 files changed, 94 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index a8c3424c3392..7700cf695d8c 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -287,7 +287,26 @@ need_resched:
> END(resume_kernel)
> #endif
>
> - # SYSENTER call handler stub
> +GLOBAL(__begin_SYSENTER_singlestep_region)
> +/*
> + * All code from here through __end_SYSENTER_singlestep_region is subject
> + * to being single-stepped if a user program sets TF and executes SYSENTER.
> + * There is absolutely nothing that we can do to prevent this from happening
> + * (thanks Intel!). To keep our handling of this situation as simple as
> + * possible, we handle TF just like AC and NT, except that our #DB handler
> + * will ignore all of the single-step traps generated in this range.
> + */
> +
> +#ifdef CONFIG_XEN
> +/*
> + * Xen doesn't set %esp to be precisely what the normal SYSENTER
> + * entry point expects, so fix it up before using the normal path.
> + */
> +ENTRY(xen_sysenter_target)
> + addl $5*4, %esp /* remove xen-provided frame */
> + jmp sysenter_past_esp
> +#endif
> +
> ENTRY(entry_SYSENTER_32)
> movl TSS_sysenter_sp0(%esp), %esp
> sysenter_past_esp:

Can you do this below (ontop of your diff) and get rid of those
__{begin,end}_SYSENTER_singlestep_region and __end_entry_SYSENTER_compat
globals and use the entry_SYSENTER_{32|compat} symbols instead?

>From a quick scan, kallsyms can give you the symbol size too so that you
can compute where it ends:

readelf -a vmlinux | grep entry_SYSENTER
55454: ffffffff8170aff0 99 FUNC GLOBAL DEFAULT 1 entry_SYSENTER_compat
62596: ffffffff8170b053 0 NOTYPE GLOBAL DEFAULT 1 __end_entry_SYSENTER_comp

0xffffffff8170aff0 + 99 = 0xffffffff8170b053

---
Index: b/arch/x86/entry/entry_32.S
===================================================================
--- a/arch/x86/entry/entry_32.S 2016-03-06 17:47:10.059733163 +0100
+++ b/arch/x86/entry/entry_32.S 2016-03-06 17:46:52.235733410 +0100
@@ -297,18 +297,13 @@ GLOBAL(__begin_SYSENTER_singlestep_regio
* will ignore all of the single-step traps generated in this range.
*/

+ENTRY(entry_SYSENTER_32)
#ifdef CONFIG_XEN
-/*
- * Xen doesn't set %esp to be precisely what the normal SYSENTER
- * entry point expects, so fix it up before using the normal path.
- */
-ENTRY(xen_sysenter_target)
addl $5*4, %esp /* remove xen-provided frame */
jmp sysenter_past_esp
-#endif
-
-ENTRY(entry_SYSENTER_32)
+#else
movl TSS_sysenter_sp0(%esp), %esp
+#endif
sysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-06 17:30:37

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Sun, Mar 6, 2016 at 11:55 AM, Borislav Petkov <[email protected]> wrote:
> On Sat, Mar 05, 2016 at 09:52:20PM -0800, Andy Lutomirski wrote:
>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>> if a user does SYSENTER with TF set, we will single-step through the
>> kernel until something clears TF. There is absolutely nothing we can
>> do to prevent this short of turning off SYSENTER [1].
>>
>> Simplify the handling considerably with two changes:
>>
>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>
>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>>
>> That's all we need to do.
>>
>> Don't get too excited -- our handling is still buggy on 32-bit
>> kernels. There's nothing wrong with the SYSENTER code itself, but
>> the #DB prologue has a clever fixup for traps on the very first
>> instruction of entry_SYSENTER_32, and the fixup doesn't work quite
>> correctly. The next two patches will fix that.
>>
>> [1] We could probably prevent it by forcing BTF on at all times and
>> making sure we clear TF before any branches in the SYSENTER
>> code. Needless to say, this is a bad idea.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/entry/entry_32.S | 42 ++++++++++++++++++++++----------
>> arch/x86/entry/entry_64_compat.S | 9 ++++++-
>> arch/x86/include/asm/proto.h | 15 ++++++++++--
>> arch/x86/kernel/traps.c | 52 +++++++++++++++++++++++++++++++++-------
>> 4 files changed, 94 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>> index a8c3424c3392..7700cf695d8c 100644
>> --- a/arch/x86/entry/entry_32.S
>> +++ b/arch/x86/entry/entry_32.S
>> @@ -287,7 +287,26 @@ need_resched:
>> END(resume_kernel)
>> #endif
>>
>> - # SYSENTER call handler stub
>> +GLOBAL(__begin_SYSENTER_singlestep_region)
>> +/*
>> + * All code from here through __end_SYSENTER_singlestep_region is subject
>> + * to being single-stepped if a user program sets TF and executes SYSENTER.
>> + * There is absolutely nothing that we can do to prevent this from happening
>> + * (thanks Intel!). To keep our handling of this situation as simple as
>> + * possible, we handle TF just like AC and NT, except that our #DB handler
>> + * will ignore all of the single-step traps generated in this range.
>> + */
>> +
>> +#ifdef CONFIG_XEN
>> +/*
>> + * Xen doesn't set %esp to be precisely what the normal SYSENTER
>> + * entry point expects, so fix it up before using the normal path.
>> + */
>> +ENTRY(xen_sysenter_target)
>> + addl $5*4, %esp /* remove xen-provided frame */
>> + jmp sysenter_past_esp
>> +#endif
>> +
>> ENTRY(entry_SYSENTER_32)
>> movl TSS_sysenter_sp0(%esp), %esp
>> sysenter_past_esp:
>
> Can you do this below (ontop of your diff) and get rid of those
> __{begin,end}_SYSENTER_singlestep_region and __end_entry_SYSENTER_compat
> globals and use the entry_SYSENTER_{32|compat} symbols instead?
>
> From a quick scan, kallsyms can give you the symbol size too so that you
> can compute where it ends:
>
> readelf -a vmlinux | grep entry_SYSENTER
> 55454: ffffffff8170aff0 99 FUNC GLOBAL DEFAULT 1 entry_SYSENTER_compat
> 62596: ffffffff8170b053 0 NOTYPE GLOBAL DEFAULT 1 __end_entry_SYSENTER_comp
>
> 0xffffffff8170aff0 + 99 = 0xffffffff8170b053
>
> ---
> Index: b/arch/x86/entry/entry_32.S
> ===================================================================
> --- a/arch/x86/entry/entry_32.S 2016-03-06 17:47:10.059733163 +0100
> +++ b/arch/x86/entry/entry_32.S 2016-03-06 17:46:52.235733410 +0100
> @@ -297,18 +297,13 @@ GLOBAL(__begin_SYSENTER_singlestep_regio
> * will ignore all of the single-step traps generated in this range.
> */
>
> +ENTRY(entry_SYSENTER_32)
> #ifdef CONFIG_XEN
> -/*
> - * Xen doesn't set %esp to be precisely what the normal SYSENTER
> - * entry point expects, so fix it up before using the normal path.
> - */
> -ENTRY(xen_sysenter_target)
> addl $5*4, %esp /* remove xen-provided frame */
> jmp sysenter_past_esp
> -#endif
> -
> -ENTRY(entry_SYSENTER_32)
> +#else
> movl TSS_sysenter_sp0(%esp), %esp
> +#endif
> sysenter_past_esp:
> pushl $__USER_DS /* pt_regs->ss */
> pushl %ebp /* pt_regs->sp (stashed in bp) */


This won't work if you run a Xen enabled kernel on bare metal. This
would work though:

ALTERNATIVE "movl TSS_sysenter_sp0(%esp), %esp", "addl $5*4, %esp",
X86_FEATURE_XENPV

I haven't read the Xen hypervisor code, but what are those 5 words
that were pushed on the stack by the hypervisor? It suspiciously is
the size of an IRET frame. Considering that we don't use SYSEXIT on
Xen anymore, can we just redirect SYSENTER to the INT80 handler?
Perhaps even just disable SYSENTER support in the VDSO on Xen. I
can't imagine SYSENTER is any faster than INT80 on Xen, because it has
to trap to the hypervisor first.

--
Brian Gerst

2016-03-06 17:37:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Sun, Mar 6, 2016 at 9:30 AM, Brian Gerst <[email protected]> wrote:
> On Sun, Mar 6, 2016 at 11:55 AM, Borislav Petkov <[email protected]> wrote:
>> On Sat, Mar 05, 2016 at 09:52:20PM -0800, Andy Lutomirski wrote:
>>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>>> if a user does SYSENTER with TF set, we will single-step through the
>>> kernel until something clears TF. There is absolutely nothing we can
>>> do to prevent this short of turning off SYSENTER [1].
>>>
>>> Simplify the handling considerably with two changes:
>>>
>>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>>
>>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>>>
>>> That's all we need to do.
>>>
>>> Don't get too excited -- our handling is still buggy on 32-bit
>>> kernels. There's nothing wrong with the SYSENTER code itself, but
>>> the #DB prologue has a clever fixup for traps on the very first
>>> instruction of entry_SYSENTER_32, and the fixup doesn't work quite
>>> correctly. The next two patches will fix that.
>>>
>>> [1] We could probably prevent it by forcing BTF on at all times and
>>> making sure we clear TF before any branches in the SYSENTER
>>> code. Needless to say, this is a bad idea.
>>>
>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>> ---
>>> arch/x86/entry/entry_32.S | 42 ++++++++++++++++++++++----------
>>> arch/x86/entry/entry_64_compat.S | 9 ++++++-
>>> arch/x86/include/asm/proto.h | 15 ++++++++++--
>>> arch/x86/kernel/traps.c | 52 +++++++++++++++++++++++++++++++++-------
>>> 4 files changed, 94 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>>> index a8c3424c3392..7700cf695d8c 100644
>>> --- a/arch/x86/entry/entry_32.S
>>> +++ b/arch/x86/entry/entry_32.S
>>> @@ -287,7 +287,26 @@ need_resched:
>>> END(resume_kernel)
>>> #endif
>>>
>>> - # SYSENTER call handler stub
>>> +GLOBAL(__begin_SYSENTER_singlestep_region)
>>> +/*
>>> + * All code from here through __end_SYSENTER_singlestep_region is subject
>>> + * to being single-stepped if a user program sets TF and executes SYSENTER.
>>> + * There is absolutely nothing that we can do to prevent this from happening
>>> + * (thanks Intel!). To keep our handling of this situation as simple as
>>> + * possible, we handle TF just like AC and NT, except that our #DB handler
>>> + * will ignore all of the single-step traps generated in this range.
>>> + */
>>> +
>>> +#ifdef CONFIG_XEN
>>> +/*
>>> + * Xen doesn't set %esp to be precisely what the normal SYSENTER
>>> + * entry point expects, so fix it up before using the normal path.
>>> + */
>>> +ENTRY(xen_sysenter_target)
>>> + addl $5*4, %esp /* remove xen-provided frame */
>>> + jmp sysenter_past_esp
>>> +#endif
>>> +
>>> ENTRY(entry_SYSENTER_32)
>>> movl TSS_sysenter_sp0(%esp), %esp
>>> sysenter_past_esp:
>>
>> Can you do this below (ontop of your diff) and get rid of those
>> __{begin,end}_SYSENTER_singlestep_region and __end_entry_SYSENTER_compat
>> globals and use the entry_SYSENTER_{32|compat} symbols instead?
>>
>> From a quick scan, kallsyms can give you the symbol size too so that you
>> can compute where it ends:
>>
>> readelf -a vmlinux | grep entry_SYSENTER
>> 55454: ffffffff8170aff0 99 FUNC GLOBAL DEFAULT 1 entry_SYSENTER_compat
>> 62596: ffffffff8170b053 0 NOTYPE GLOBAL DEFAULT 1 __end_entry_SYSENTER_comp
>>
>> 0xffffffff8170aff0 + 99 = 0xffffffff8170b053
>>
>> ---
>> Index: b/arch/x86/entry/entry_32.S
>> ===================================================================
>> --- a/arch/x86/entry/entry_32.S 2016-03-06 17:47:10.059733163 +0100
>> +++ b/arch/x86/entry/entry_32.S 2016-03-06 17:46:52.235733410 +0100
>> @@ -297,18 +297,13 @@ GLOBAL(__begin_SYSENTER_singlestep_regio
>> * will ignore all of the single-step traps generated in this range.
>> */
>>
>> +ENTRY(entry_SYSENTER_32)
>> #ifdef CONFIG_XEN
>> -/*
>> - * Xen doesn't set %esp to be precisely what the normal SYSENTER
>> - * entry point expects, so fix it up before using the normal path.
>> - */
>> -ENTRY(xen_sysenter_target)
>> addl $5*4, %esp /* remove xen-provided frame */
>> jmp sysenter_past_esp
>> -#endif
>> -
>> -ENTRY(entry_SYSENTER_32)
>> +#else
>> movl TSS_sysenter_sp0(%esp), %esp
>> +#endif
>> sysenter_past_esp:
>> pushl $__USER_DS /* pt_regs->ss */
>> pushl %ebp /* pt_regs->sp (stashed in bp) */
>
>
> This won't work if you run a Xen enabled kernel on bare metal. This
> would work though:
>
> ALTERNATIVE "movl TSS_sysenter_sp0(%esp), %esp", "addl $5*4, %esp",
> X86_FEATURE_XENPV

That might work.

>
> I haven't read the Xen hypervisor code, but what are those 5 words
> that were pushed on the stack by the hypervisor? It suspiciously is
> the size of an IRET frame.

I think it is nominally an IRET frame. One might wonder what's in it,
given that the hypervisor doesn't know what the old IP or SP was.

> Considering that we don't use SYSEXIT on
> Xen anymore, can we just redirect SYSENTER to the INT80 handler?
> Perhaps even just disable SYSENTER support in the VDSO on Xen. I
> can't imagine SYSENTER is any faster than INT80 on Xen, because it has
> to trap to the hypervisor first.
>

I think we should leave it enabled -- having user programs on Xen PV
trap into the hypervisor for a system call using SYSENTER will still
be much faster than using INT $0x80 as long as the hypervisor does
something reasonable.

I think that Andrew Cooper has some patches in the works to clean a
bunch of the Xen PV entry prologue stuff up.

--Andy

2016-03-06 18:01:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Sun, Mar 06, 2016 at 09:36:42AM -0800, Andy Lutomirski wrote:
> > ALTERNATIVE "movl TSS_sysenter_sp0(%esp), %esp", "addl $5*4, %esp",
> > X86_FEATURE_XENPV
>
> That might work.

Yap, and as we said on IRC, KALLSYMS can be off - even if it is "if
EXPERT" so we need to think of a better way to compute start address and
size of entry_SYSENTER_32 and the _compat one.

We probably could reuse some of the ELF parsing machinery of relocs.c or
so...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-06 18:20:53

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On 06/03/16 17:36, Andy Lutomirski wrote:
>
>> I haven't read the Xen hypervisor code, but what are those 5 words
>> that were pushed on the stack by the hypervisor? It suspiciously is
>> the size of an IRET frame.
> I think it is nominally an IRET frame.

It is a notminal IRET frame. Even to this day, Xen doesn't support
anything other "making it appear as if an interrupt/exception occurred",
even with the syscall/sysenter and artificial entrypoints.

The Xen entrypoint logic predates the introduction of the
syscall/sysenter support in Linux. At the point where your hammer is
already iret shaped and you have a forked version of Linux for running
as a guest, modifying sysenter to be iret shaped is an easy option. For
better or worse, this is now the ABI.

> One might wonder what's in it, given that the hypervisor doesn't know what the old IP or SP was.

Looking at the code which synthesizes the iret frame

pushq $FLAT_USER_SS
pushq $0
pushfq
pushq $3 /* ring 3 null cs */
pushq $0 /* null rip */

Completely ignoring it definitely the best course of action.

>
>> Considering that we don't use SYSEXIT on
>> Xen anymore, can we just redirect SYSENTER to the INT80 handler?
>> Perhaps even just disable SYSENTER support in the VDSO on Xen. I
>> can't imagine SYSENTER is any faster than INT80 on Xen, because it has
>> to trap to the hypervisor first.
>>
> I think we should leave it enabled -- having user programs on Xen PV
> trap into the hypervisor for a system call using SYSENTER will still
> be much faster than using INT $0x80 as long as the hypervisor does
> something reasonable.

The trap into Xen has to happen either way (even for the INT $0x80
path). There is almost certainly room for improvement in both paths,
but in principle the sysenter path will be faster.

~Andrew

2016-03-07 08:29:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups


* Andy Lutomirski <[email protected]> wrote:

> On Mar 6, 2016 12:22 AM, "Ingo Molnar" <[email protected]> wrote:
> >
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> > > This series applies to the result of merging tip:x86/asm and
> > > tip:x86/urgent. I've been testing on a somewhat bastardized base,
> > > because tip currently doesn't work on my laptop in 32-bit mode. (That
> > > bug is fixed in Linus' tree.)
> >
> > Hm, which fix is that?
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de9e478b9d49f3a0214310d921450cf5bb4a21e6

ah, indeed.

I've merged v4.5-rc7 into tip:x86/asm, which includes this fix, to make the topic
branch bisectable on SkyLake as well.

Thanks,

Ingo

2016-03-07 17:17:56

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Sun, Mar 6, 2016 at 12:52 AM, Andy Lutomirski <[email protected]> wrote:
> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
> if a user does SYSENTER with TF set, we will single-step through the
> kernel until something clears TF. There is absolutely nothing we can
> do to prevent this short of turning off SYSENTER [1].
>
> Simplify the handling considerably with two changes:
>
> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
> add TF to that list of flags to sanitize with no overhead whatsoever.
>
> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.

What is wrong with the current method of clearing TF and setting
TIF_SINGLESTEP on the first debug trap? This patch actually increases
complexity because it has to check for a range of addresses rather
than just the first instruction, plus it has to singlestep all the way
through the SYSENTER prologue.

Unless there is an actual issue with TIF_SINGLESTEP, I don't think
this patch is an improvement.

--
Brian Gerst

2016-03-07 18:03:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Mon, Mar 7, 2016 at 9:17 AM, Brian Gerst <[email protected]> wrote:
> On Sun, Mar 6, 2016 at 12:52 AM, Andy Lutomirski <[email protected]> wrote:
>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>> if a user does SYSENTER with TF set, we will single-step through the
>> kernel until something clears TF. There is absolutely nothing we can
>> do to prevent this short of turning off SYSENTER [1].
>>
>> Simplify the handling considerably with two changes:
>>
>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>
>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>
> What is wrong with the current method of clearing TF and setting
> TIF_SINGLESTEP on the first debug trap? This patch actually increases
> complexity because it has to check for a range of addresses rather
> than just the first instruction, plus it has to singlestep all the way
> through the SYSENTER prologue.
>
> Unless there is an actual issue with TIF_SINGLESTEP, I don't think
> this patch is an improvement.

TIF_SINGLESTEP has bizarrely overloaded semantics.

There's this:

/*
* If we stepped into a sysenter/syscall insn, it trapped in
* kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
* If user-mode had set TF itself, then it's still clear from
* do_debug() and we need to set it again to restore the user
* state so we don't wrongly set TIF_FORCED_TF below.
* If enable_single_step() was used last and that is what
* set TIF_SINGLESTEP, then both TF and TIF_FORCED_TF are
* already set and our bookkeeping is fine.
*/
if (unlikely(test_tsk_thread_flag(child, TIF_SINGLESTEP)))
regs->flags |= X86_EFLAGS_TF;

but TIF_SINGLESTEP is also used for other things. (And I need to
follow up with a patch to remove that comment.) This results in
incomprehensible behavior: if a user program sets TF and does
SYSENTER, then TIF_SINGLESTEP gets set (and stays set!). This does
not happen if a user sets TF and does INT80 or SYSCALL. There was at
least one bug in here that Oleg fixed a while back, and I wouldn't be
at all surprised if there were others.

I don't know what would happen if TF were set and SYSENTER were used
to do a syscall where the __get_user to load the syscall nr failed.
That happens before the TIF_SINGLESTEP fixup.

Basically, the overloaded use of TIF_SINGLESTEP was complicated and
hard to understand, and the new behavior is straightforward and
consistent with other entries, even if it's a bit slower.

We could introduce a new TIF_SYSENTER_TF and use it directly, or we
could accelerate the TF fixup in regs->flags by switching to a
different entry path (I had a draft patch to do that), but I tend to
favor simplicity for things that aren't performance-critical.

2016-03-07 18:41:40

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Mon, Mar 7, 2016 at 1:03 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Mar 7, 2016 at 9:17 AM, Brian Gerst <[email protected]> wrote:
>> On Sun, Mar 6, 2016 at 12:52 AM, Andy Lutomirski <[email protected]> wrote:
>>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>>> if a user does SYSENTER with TF set, we will single-step through the
>>> kernel until something clears TF. There is absolutely nothing we can
>>> do to prevent this short of turning off SYSENTER [1].
>>>
>>> Simplify the handling considerably with two changes:
>>>
>>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>>
>>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>>
>> What is wrong with the current method of clearing TF and setting
>> TIF_SINGLESTEP on the first debug trap? This patch actually increases
>> complexity because it has to check for a range of addresses rather
>> than just the first instruction, plus it has to singlestep all the way
>> through the SYSENTER prologue.
>>
>> Unless there is an actual issue with TIF_SINGLESTEP, I don't think
>> this patch is an improvement.
>
> TIF_SINGLESTEP has bizarrely overloaded semantics.
>
> There's this:
>
> /*
> * If we stepped into a sysenter/syscall insn, it trapped in
> * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
> * If user-mode had set TF itself, then it's still clear from
> * do_debug() and we need to set it again to restore the user
> * state so we don't wrongly set TIF_FORCED_TF below.
> * If enable_single_step() was used last and that is what
> * set TIF_SINGLESTEP, then both TF and TIF_FORCED_TF are
> * already set and our bookkeeping is fine.
> */
> if (unlikely(test_tsk_thread_flag(child, TIF_SINGLESTEP)))
> regs->flags |= X86_EFLAGS_TF;
>
> but TIF_SINGLESTEP is also used for other things. (And I need to
> follow up with a patch to remove that comment.) This results in
> incomprehensible behavior: if a user program sets TF and does
> SYSENTER, then TIF_SINGLESTEP gets set (and stays set!). This does
> not happen if a user sets TF and does INT80 or SYSCALL. There was at
> least one bug in here that Oleg fixed a while back, and I wouldn't be
> at all surprised if there were others.
>
> I don't know what would happen if TF were set and SYSENTER were used
> to do a syscall where the __get_user to load the syscall nr failed.
> That happens before the TIF_SINGLESTEP fixup.
>
> Basically, the overloaded use of TIF_SINGLESTEP was complicated and
> hard to understand, and the new behavior is straightforward and
> consistent with other entries, even if it's a bit slower.
>
> We could introduce a new TIF_SYSENTER_TF and use it directly, or we
> could accelerate the TF fixup in regs->flags by switching to a
> different entry path (I had a draft patch to do that), but I tend to
> favor simplicity for things that aren't performance-critical.

The alternate entry path wouldn't be very big, just 5 instructions
with the OR being the only difference.

Another option is to use a per-cpu var instead of a TIF flag.

--
Brian Gerst

2016-03-07 18:47:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

On Mon, Mar 7, 2016 at 10:41 AM, Brian Gerst <[email protected]> wrote:
> On Mon, Mar 7, 2016 at 1:03 PM, Andy Lutomirski <[email protected]> wrote:
>> On Mon, Mar 7, 2016 at 9:17 AM, Brian Gerst <[email protected]> wrote:
>>> On Sun, Mar 6, 2016 at 12:52 AM, Andy Lutomirski <[email protected]> wrote:
>>>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>>>> if a user does SYSENTER with TF set, we will single-step through the
>>>> kernel until something clears TF. There is absolutely nothing we can
>>>> do to prevent this short of turning off SYSENTER [1].
>>>>
>>>> Simplify the handling considerably with two changes:
>>>>
>>>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>>>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>>>
>>>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>>>
>>> What is wrong with the current method of clearing TF and setting
>>> TIF_SINGLESTEP on the first debug trap? This patch actually increases
>>> complexity because it has to check for a range of addresses rather
>>> than just the first instruction, plus it has to singlestep all the way
>>> through the SYSENTER prologue.
>>>
>>> Unless there is an actual issue with TIF_SINGLESTEP, I don't think
>>> this patch is an improvement.
>>
>> TIF_SINGLESTEP has bizarrely overloaded semantics.
>>
>> There's this:
>>
>> /*
>> * If we stepped into a sysenter/syscall insn, it trapped in
>> * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
>> * If user-mode had set TF itself, then it's still clear from
>> * do_debug() and we need to set it again to restore the user
>> * state so we don't wrongly set TIF_FORCED_TF below.
>> * If enable_single_step() was used last and that is what
>> * set TIF_SINGLESTEP, then both TF and TIF_FORCED_TF are
>> * already set and our bookkeeping is fine.
>> */
>> if (unlikely(test_tsk_thread_flag(child, TIF_SINGLESTEP)))
>> regs->flags |= X86_EFLAGS_TF;
>>
>> but TIF_SINGLESTEP is also used for other things. (And I need to
>> follow up with a patch to remove that comment.) This results in
>> incomprehensible behavior: if a user program sets TF and does
>> SYSENTER, then TIF_SINGLESTEP gets set (and stays set!). This does
>> not happen if a user sets TF and does INT80 or SYSCALL. There was at
>> least one bug in here that Oleg fixed a while back, and I wouldn't be
>> at all surprised if there were others.
>>
>> I don't know what would happen if TF were set and SYSENTER were used
>> to do a syscall where the __get_user to load the syscall nr failed.
>> That happens before the TIF_SINGLESTEP fixup.
>>
>> Basically, the overloaded use of TIF_SINGLESTEP was complicated and
>> hard to understand, and the new behavior is straightforward and
>> consistent with other entries, even if it's a bit slower.
>>
>> We could introduce a new TIF_SYSENTER_TF and use it directly, or we
>> could accelerate the TF fixup in regs->flags by switching to a
>> different entry path (I had a draft patch to do that), but I tend to
>> favor simplicity for things that aren't performance-critical.
>
> The alternate entry path wouldn't be very big, just 5 instructions
> with the OR being the only difference.

Agreed. It's just more code to maintain and test. I can certainly do it.

I scrapped it the first time because it was messy on Xen PV. It was
extra messy because I did this part of the series before I cleaned up
the garbage ENTRY(debug) code, and that got in the way as well.

>
> Another option is to use a per-cpu var instead of a TIF flag.

Then we'd need to add a branch in the TF-not-set path, which would be
a bit unfortunate.

As a third option, we could force BTF on, which would reduce the
number of useless traps.

In any case, I'd be glad to speed this up as a follow-up patch if
anyone cares about performance.

--Andy

2016-03-08 06:37:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup

On Sat, Mar 05, 2016 at 09:52:22PM -0800, Andy Lutomirski wrote:
> Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
> so the exception handler is invoked on the temporary SYSENTER stack.
>
> Because the SYSENTER stack is very small, we have a fixup to switch
> off the stack quickly when this happens. The old fixup had several issues:
>
> 1. It checked the interrupt frame's CS and EIP. This wasn't
> obviously correct on Xen or if vm86 mode was in use [1].
>
> 2. In the NMI handler, it did some frightening digging into the
> stack frame. I'm not convinced this digging was correct.
>
> 3. The fixup didn't switch stacks and then switch back. Instead, it
> synthesized a brand new stack frame that would redirect the IRET
> back to the SYSENTER code. That frame was highly questionable.
> For one thing, if NMI nested inside #DB, we would effectively
> abort the #DB prologue, which was probably safe but was
> frightening. For another, the code used PUSHFL to write the
> FLAGS portion of the frame, which was simply bogus -- by the time
> PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
> flags were clobbered.
>
> Simplify this considerably. Instead of looking at the saved frame
> to see where we came from, check the hardware ESP register against
> the SYSENTER stack directly. Malicious user code cannot spoof the
> kernel ESP register, and by moving the check after SAVE_ALL, we can
> use normal PER_CPU accesses to find all the relevant addresses.
>
> With this patch applied, the improved syscall_nt_32 test finally
> passes on 32-bit kernels.
>
> [1] It isn't obviously correct, but it is nonetheless safe from vm86
> shenanigans as far as I can tell. A user can't point EIP at
> entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
> like all kernel addresses, is greater than 0xffff and would thus
> violate the CS segment limit.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 114 ++++++++++++++++++---------------------
> arch/x86/kernel/asm-offsets_32.c | 5 ++
> 2 files changed, 56 insertions(+), 63 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 7700cf695d8c..ad9a85e62128 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -987,51 +987,48 @@ error_code:
> jmp ret_from_exception
> END(page_fault)
>
> -/*
> - * Debug traps and NMI can happen at the one SYSENTER instruction
> - * that sets up the real kernel stack. Check here, since we can't
> - * allow the wrong stack to be used.
> - *
> - * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have
> - * already pushed 3 words if it hits on the sysenter instruction:
> - * eflags, cs and eip.
> - *
> - * We just load the right stack, and push the three (known) values
> - * by hand onto the new stack - while updating the return eip past
> - * the instruction that would have done it for sysenter.
> - */
> -.macro FIX_STACK offset ok label
> - cmpw $__KERNEL_CS, 4(%esp)
> - jne \ok
> -\label:
> - movl TSS_sysenter_sp0 + \offset(%esp), %esp
> - pushfl
> - pushl $__KERNEL_CS
> - pushl $sysenter_past_esp
> -.endm
> -
> ENTRY(debug)
> + /*
> + * #DB can happen at the first instruction of
> + * entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this
> + * happens, then we will be running on a very small stack. We
> + * need to detect this condition and switch to the thread
> + * stack before calling any C code at all.
> + *
> + * If you edit this code, keep in mind that NMIs can happen in here.
> + */

Btw, I think it is a bit more readable if this comment is over ENTRY(debug) like
the one over ENTRY(nmi) below.

> ASM_CLAC
> - cmpl $entry_SYSENTER_32, (%esp)
> - jne debug_stack_correct
> - FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn
> -debug_stack_correct:
> pushl $-1 # mark this as an int
> SAVE_ALL
> - TRACE_IRQS_OFF
> xorl %edx, %edx # error code 0
> movl %esp, %eax # pt_regs pointer
> +
> + /* Are we currently on the SYSENTER stack? */
> + PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
> + subl %eax, %ecx /* ecx = (end of SYENTER_stack) - esp */
^^^^^^^
Just a typo for the committer of this patch to fix: SYSENTER_stack

> + cmpl $SIZEOF_SYSENTER_stack, %ecx
> + jb .Ldebug_from_sysenter_stack
> +
> + TRACE_IRQS_OFF
> + call do_debug
> + jmp ret_from_exception
> +
> +.Ldebug_from_sysenter_stack:
> + /* We're on the SYSENTER stack. Switch off. */
> + movl %esp, %ebp
> + movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
> + TRACE_IRQS_OFF
> call do_debug
> + movl %ebp, %esp
> jmp ret_from_exception
> END(debug)
>
> /*
> - * NMI is doubly nasty. It can happen _while_ we're handling
> - * a debug fault, and the debug fault hasn't yet been able to
> - * clear up the stack. So we first check whether we got an
> - * NMI on the sysenter entry path, but after that we need to
> - * check whether we got an NMI on the debug path where the debug
> - * fault happened on the sysenter path.
> + * NMI is doubly nasty. It can happen on the first instruction of
> + * entry_SYSENTER_32 (just like #DB), but it can also interrupt the beginning
> + * of the #DB handler even if that #DB in turn hit before entry_SYSENTER_32
> + * switched stacks. We handle both conditions by simply checking whether we
> + * interrupted kernel code running on the SYSENTER stack.
> */
> ENTRY(nmi)
> ASM_CLAC
> @@ -1042,41 +1039,32 @@ ENTRY(nmi)
> popl %eax
> je nmi_espfix_stack
> #endif
> - cmpl $entry_SYSENTER_32, (%esp)
> - je nmi_stack_fixup
> - pushl %eax
> - movl %esp, %eax
> - /*
> - * Do not access memory above the end of our stack page,
> - * it might not exist.
> - */
> - andl $(THREAD_SIZE-1), %eax
> - cmpl $(THREAD_SIZE-20), %eax
> - popl %eax
> - jae nmi_stack_correct
> - cmpl $entry_SYSENTER_32, 12(%esp)
> - je nmi_debug_stack_check
> -nmi_stack_correct:
> - pushl %eax
> +
> + pushl %eax # pt_regs->orig_ax
> SAVE_ALL
> xorl %edx, %edx # zero error code
> movl %esp, %eax # pt_regs pointer
> +
> + /* Are we currently on the SYSENTER stack? */
> + PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
> + subl %eax, %ecx /* ecx = (end of SYENTER_stack) - esp */

Typo copied :)

> + cmpl $SIZEOF_SYSENTER_stack, %ecx
> + jb .Lnmi_from_sysenter_stack
> +
> + /* Not on SYSENTER stack. */
> call do_nmi
> jmp restore_all_notrace
>
> -nmi_stack_fixup:
> - FIX_STACK 12, nmi_stack_correct, 1
> - jmp nmi_stack_correct
> -
> -nmi_debug_stack_check:
> - cmpw $__KERNEL_CS, 16(%esp)
> - jne nmi_stack_correct
> - cmpl $debug, (%esp)
> - jb nmi_stack_correct
> - cmpl $debug_esp_fix_insn, (%esp)
> - ja nmi_stack_correct
> - FIX_STACK 24, nmi_stack_correct, 1
> - jmp nmi_stack_correct
> +.Lnmi_from_sysenter_stack:
> + /*
> + * We're on the SYSENTER stack. Switch off. No one (not even debug)
> + * is using the thread stack right now, so it's safe for us to use it.
> + */
> + movl %esp, %ebp
> + movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
> + call do_nmi
> + movl %ebp, %esp
> + jmp restore_all_notrace
>
> #ifdef CONFIG_X86_ESPFIX32
> nmi_espfix_stack:

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-08 06:40:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] x86/entry/32: Add and check a stack canary for the SYSENTER stack

On Sat, Mar 05, 2016 at 09:52:23PM -0800, Andy Lutomirski wrote:

<--- ?

So all patches before are nice and verbose in explaining what they do.
Where did the commit message of this one disappear?

:-)

Or is it that obvious that we should have a stack canary? How about some
text *why* we need it?

> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/processor.h | 3 ++-
> arch/x86/kernel/process.c | 3 +++
> arch/x86/kernel/traps.c | 8 ++++++++
> 3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7cd01b71b5bd..50a6dc871cc0 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -299,8 +299,9 @@ struct tss_struct {
>
> #ifdef CONFIG_X86_32
> /*
> - * Space for the temporary SYSENTER stack:
> + * Space for the temporary SYSENTER stack.
> */
> + unsigned long SYSENTER_stack_canary;
> unsigned long SYSENTER_stack[64];
> #endif
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9f7c21c22477..ee9a9792caeb 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -57,6 +57,9 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
> */
> .io_bitmap = { [0 ... IO_BITMAP_LONGS] = ~0 },
> #endif
> +#ifdef CONFIG_X86_32
> + .SYSENTER_stack_canary = STACK_END_MAGIC,
> +#endif
> };
> EXPORT_PER_CPU_SYMBOL(cpu_tss);
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 80928ea78373..590110119e6a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -713,6 +713,14 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> debug_stack_usage_dec();
>
> exit:
> +#if defined(CONFIG_X86_32)
> + /*
> + * This is the most likely code path that involves non-trivial use
> + * of the SYSENTER stack. Check that we haven't overrun it.
> + */
> + WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
> + "Overran or corrupted SYSENTER stack\n");
> +#endif
> ist_exit(regs);
> }
> NOKPROBE_SYMBOL(do_debug);
> --
> 2.5.0
>
>

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-09 21:22:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] x86/entry/32: Add and check a stack canary for the SYSENTER stack

On Mon, Mar 7, 2016 at 10:40 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Mar 05, 2016 at 09:52:23PM -0800, Andy Lutomirski wrote:
>
> <--- ?
>
> So all patches before are nice and verbose in explaining what they do.
> Where did the commit message of this one disappear?
>
> :-)
>
> Or is it that obvious that we should have a stack canary? How about some
> text *why* we need it?

I'll send a v3 with this and the other changes you requested.

--Andy