2015-07-03 19:44:48

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 00/17] x86: Rewrite exit-to-userspace code

This is the first big batch of x86 asm-to-C conversion patches.

The exit-to-usermode code is copied in several places and is written
in a nasty combination of asm and C. It's not at all clear what
it's supposed to do, and the way it's structured makes it very hard
to work with. For example, it's not even clear why syscall exit
hooks are called only once per syscall right now. (It seems to be a
side effect of the way that rdi and rdx are handled in the asm loop,
and it seems reliable, but it's still pointlessly complicated.) The
existing code also makes context tracking overly complicated and
hard to understand. Finally, it's nearly impossible for anyone to
change what happens on exit to usermode, since the existing code is
so fragile.

I tried to clean it up incrementally, but I decided it was too hard.
Instead, this series just replaces the code. It seems to work.

Context tracking in particular works very differently now. The
low-level entry code checks that we're in CONTEXT_USER and switches
to CONTEXT_KERNEL. The exit code does the reverse. There is no
need to track what CONTEXT_XYZ state we came from, because we
already know. Similarly, SCHEDULE_USER is gone, since we can
reschedule if needed by simply calling schedule() from C code.

The main things that are missing are that I haven't done the 32-bit
parts (anyone want to help?) and therefore I haven't deleted the old
C code. I also think this may break UML for trivial reasons.

IRQ context tracking is still messy. One the cleanup progresses
to the point that we can enter CONTEXT_KERNEL in syscalls before
enabling interrupts, we can fully clean up IRQ context tracking.

Once these land, I'll send some more :)

Note: we might want to backport patches 1 and 2.

Changes from v4:
- Remove now-unused SAVE_EXTRA_REGS_RBP macro
- Fix comment at the top of common.c
- Decorate internal labels in error_entry with .L
- Fix two mis-formatted asm lines
- Undo inadvertent removal of R11 initialization in the sysexit path

I didn't rename the error_entry labels.

Changes from v3:
- Add the syscall_arg_fault_32 test.
- Fix a pre-existing bad syscall arg buglet.
- Fix an asm glitch due to a bad rebase.
- Fix a CONFIG_PROVE_LOCKDEP warning.
Borislav: the end result of this series differs from the v3.91 that I
only in the removal of a single trailing tab. The badarg patch is in
a different place now, though, since we might want to backport it.

Changes from v2: Misplaced the actual list -- sorry.

Changes from v1:
- Fix bisection failure by squashing the 64-bit native and compat syscall
conversions together. The intermediate state didn't built, and fixing
it isn't worthwhile (the results will be harder to understand).
- Replace context_tracking_assert_state with CT_WARN_ON and ct_state.
- The last two patches are now. I incorrectly thought that we weren't
ready for them yet on 32-bit kernels, but I was wrong.

Andy Lutomirski (16):
selftests/x86: Add a test for 32-bit fast syscall arg faults
x86/entry/64/compat: Fix bad fast syscall arg failure path
context_tracking: Add ct_state and CT_WARN_ON
notifiers: Assert that RCU is watching in notify_die
x86: Move C entry and exit code to arch/x86/entry/common.c
x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
x86/entry: Add enter_from_user_mode and use it in syscalls
x86/entry: Add new, comprehensible entry and exit hooks
x86/entry/64: Really create an error-entry-from-usermode code path
x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
x86/asm/entry/64: Save all regs on interrupt entry
x86/asm/entry/64: Simplify irq stack pt_regs handling
x86/asm/entry/64: Migrate error and interrupt exit work to C
x86/entry: Remove exception_enter from most trap handlers
x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h
x86/irq: Document how IRQ context tracking works and add an assertion

Ingo Molnar (1):
uml: Fix do_signal() prototype

arch/um/include/shared/kern_util.h | 3 +-
arch/um/kernel/process.c | 6 +-
arch/um/kernel/signal.c | 8 +-
arch/um/kernel/tlb.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/x86/entry/Makefile | 1 +
arch/x86/entry/calling.h | 3 -
arch/x86/entry/common.c | 374 ++++++++++++++++++++++++
arch/x86/entry/entry_64.S | 197 ++++---------
arch/x86/entry/entry_64_compat.S | 46 ++-
arch/x86/include/asm/context_tracking.h | 10 -
arch/x86/include/asm/signal.h | 1 +
arch/x86/include/asm/traps.h | 4 +-
arch/x86/kernel/cpu/mcheck/mce.c | 5 +-
arch/x86/kernel/cpu/mcheck/p5.c | 5 +-
arch/x86/kernel/cpu/mcheck/winchip.c | 4 +-
arch/x86/kernel/irq.c | 15 +
arch/x86/kernel/ptrace.c | 202 +------------
arch/x86/kernel/signal.c | 28 +-
arch/x86/kernel/traps.c | 87 ++----
include/linux/context_tracking.h | 15 +
include/linux/context_tracking_state.h | 1 +
kernel/notifier.c | 2 +
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/syscall_arg_fault.c | 130 ++++++++
25 files changed, 681 insertions(+), 472 deletions(-)
create mode 100644 arch/x86/entry/common.c
delete mode 100644 arch/x86/include/asm/context_tracking.h
create mode 100644 tools/testing/selftests/x86/syscall_arg_fault.c

--
2.4.3


2015-07-03 19:45:15

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 01/17] selftests/x86: Add a test for 32-bit fast syscall arg faults

This test passes on 4.0 and fails on some newer kernels. Fortunately,
the failure is likely not a big deal. This test will make sure that
we don't break it further (e.g. OOPSing) as we clean up the entry
code and that we eventually fix the regression.

There's arguably no need to preserve the old ABI here -- anything
that makes it into a fast (vDSO) syscall with a bad stack is about
to crash no matter what we do.

Signed-off-by: Andy Lutomirski <[email protected]>
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/syscall_arg_fault.c | 130 ++++++++++++++++++++++++
2 files changed, 131 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/syscall_arg_fault.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..e8df47e6326c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -5,7 +5,7 @@ include ../lib.mk
.PHONY: all all_32 all_64 warn_32bit_failure clean

TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
-TARGETS_C_32BIT_ONLY := entry_from_vm86
+TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault

TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
BINARIES_32 := $(TARGETS_C_32BIT_ALL:%=%_32)
diff --git a/tools/testing/selftests/x86/syscall_arg_fault.c b/tools/testing/selftests/x86/syscall_arg_fault.c
new file mode 100644
index 000000000000..7db4fc9fa09f
--- /dev/null
+++ b/tools/testing/selftests/x86/syscall_arg_fault.c
@@ -0,0 +1,130 @@
+/*
+ * syscall_arg_fault.c - tests faults 32-bit fast syscall stack args
+ * Copyright (c) 2015 Andrew Lutomirski
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/signal.h>
+#include <sys/ucontext.h>
+#include <err.h>
+#include <setjmp.h>
+#include <errno.h>
+
+/* Our sigaltstack scratch space. */
+static unsigned char altstack_data[SIGSTKSZ];
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+ int flags)
+{
+ struct sigaction sa;
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_sigaction = handler;
+ sa.sa_flags = SA_SIGINFO | flags;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static volatile sig_atomic_t sig_traps;
+static sigjmp_buf jmpbuf;
+
+static volatile sig_atomic_t n_errs;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+ ucontext_t *ctx = (ucontext_t*)ctx_void;
+
+ if (ctx->uc_mcontext.gregs[REG_EAX] != -EFAULT) {
+ printf("[FAIL]\tAX had the wrong value: 0x%x\n",
+ ctx->uc_mcontext.gregs[REG_EAX]);
+ n_errs++;
+ } else {
+ printf("[OK]\tSeems okay\n");
+ }
+
+ siglongjmp(jmpbuf, 1);
+}
+
+static void sigill(int sig, siginfo_t *info, void *ctx_void)
+{
+ printf("[SKIP]\tIllegal instruction\n");
+ siglongjmp(jmpbuf, 1);
+}
+
+int main()
+{
+ stack_t stack = {
+ .ss_sp = altstack_data,
+ .ss_size = SIGSTKSZ,
+ };
+ if (sigaltstack(&stack, NULL) != 0)
+ err(1, "sigaltstack");
+
+ sethandler(SIGSEGV, sigsegv, SA_ONSTACK);
+ sethandler(SIGILL, sigill, SA_ONSTACK);
+
+ /*
+ * Exercise another nasty special case. The 32-bit SYSCALL
+ * and SYSENTER instructions (even in compat mode) each
+ * clobber one register. A Linux system call has a syscall
+ * number and six arguments, and the user stack pointer
+ * needs to live in some register on return. That means
+ * that we need eight registers, but SYSCALL and SYSENTER
+ * only preserve seven registers. As a result, one argument
+ * ends up on the stack. The stack is user memory, which
+ * means that the kernel can fail to read it.
+ *
+ * The 32-bit fast system calls don't have a defined ABI:
+ * we're supposed to invoke them through the vDSO. So we'll
+ * fudge it: we set all regs to invalid pointer values and
+ * invoke the entry instruction. The return will fail no
+ * matter what, and we completely lose our program state,
+ * but we can fix it up with a signal handler.
+ */
+
+ printf("[RUN]\tSYSENTER with invalid state\n");
+ if (sigsetjmp(jmpbuf, 1) == 0) {
+ asm volatile (
+ "movl $-1, %%eax\n\t"
+ "movl $-1, %%ebx\n\t"
+ "movl $-1, %%ecx\n\t"
+ "movl $-1, %%edx\n\t"
+ "movl $-1, %%esi\n\t"
+ "movl $-1, %%edi\n\t"
+ "movl $-1, %%ebp\n\t"
+ "movl $-1, %%esp\n\t"
+ "sysenter"
+ : : : "memory", "flags");
+ }
+
+ printf("[RUN]\tSYSCALL with invalid state\n");
+ if (sigsetjmp(jmpbuf, 1) == 0) {
+ asm volatile (
+ "movl $-1, %%eax\n\t"
+ "movl $-1, %%ebx\n\t"
+ "movl $-1, %%ecx\n\t"
+ "movl $-1, %%edx\n\t"
+ "movl $-1, %%esi\n\t"
+ "movl $-1, %%edi\n\t"
+ "movl $-1, %%ebp\n\t"
+ "movl $-1, %%esp\n\t"
+ "syscall\n\t"
+ "pushl $0" /* make sure we segfault cleanly */
+ : : : "memory", "flags");
+ }
+
+ return 0;
+}
--
2.4.3

2015-07-03 19:45:09

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path

If user code does SYSCALL32 or SYSENTER without a valid stack, then
our attempt to determine the syscall args will result in a failed
uaccess fault. Previously, we would try to recover by jumping to
the syscall exit code, but we'd run the syscall exit work even
though we never made it to the syscall entry work.

Clean it up by treating the failure path as a non-syscall entry and
exit pair.

This fixes strace's output when running the syscall_arg_fault test.
Without this fix, strace would get out of sync and would fail to
associate syscall entries with syscall exits.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 2 +-
arch/x86/entry/entry_64_compat.S | 35 +++++++++++++++++++++++++++++++++--
2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3bb2c4302df1..141a5d49dddc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -613,7 +613,7 @@ ret_from_intr:
testb $3, CS(%rsp)
jz retint_kernel
/* Interrupt came from user space */
-retint_user:
+GLOBAL(retint_user)
GET_THREAD_INFO(%rcx)

/* %rcx: thread info. Interrupts are off. */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index bb187a6a877c..efe0b1e499fa 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -425,8 +425,39 @@ cstar_tracesys:
END(entry_SYSCALL_compat)

ia32_badarg:
- ASM_CLAC
- movq $-EFAULT, RAX(%rsp)
+ /*
+ * So far, we've entered kernel mode, set AC, turned on IRQs, and
+ * saved C regs except r8-r11. We haven't done any of the other
+ * standard entry work, though. We want to bail, but we shouldn't
+ * treat this as a syscall entry since we don't even know what the
+ * args are. Instead, treat this as a non-syscall entry, finish
+ * the entry work, and immediately exit after setting AX = -EFAULT.
+ *
+ * We're really just being polite here. Killing the task outright
+ * would be a reasonable action, too. Given that the only valid
+ * way to have gotten here is through the vDSO, and we already know
+ * that the stack pointer is bad, the task isn't going to survive
+ * for long no matter what we do.
+ */
+
+ ASM_CLAC /* undo STAC */
+ movq $-EFAULT, RAX(%rsp) /* return -EFAULT if possible */
+
+ /* Fill in the rest of pt_regs */
+ xorl %eax, %eax
+ movq %rax, R11(%rsp)
+ movq %rax, R10(%rsp)
+ movq %rax, R9(%rsp)
+ movq %rax, R8(%rsp)
+ SAVE_EXTRA_REGS
+
+ /* Turn IRQs back off. */
+ DISABLE_INTERRUPTS(CLBR_NONE)
+ TRACE_IRQS_OFF
+
+ /* And exit again. */
+ jmp retint_user
+
ia32_ret_from_sys_call:
xorl %eax, %eax /* Do not leak kernel information */
movq %rax, R11(%rsp)
--
2.4.3

2015-07-03 19:45:00

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 03/17] uml: Fix do_signal() prototype

From: Ingo Molnar <[email protected]>

Once x86 exports its do_signal(), the prototypes will clash.

Fix the clash and also improve the code a bit: remove the unnecessary
kern_do_signal() indirection. This allows interrupt_end() to share
the 'regs' parameter calculation.

Also remove the unused return code to match x86.

Minimally build and boot tested.

Cc: Richard Weinberger <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
[Adjusted the commit message because I reordered the patch. --Andy]
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/um/include/shared/kern_util.h | 3 ++-
arch/um/kernel/process.c | 6 ++++--
arch/um/kernel/signal.c | 8 +-------
arch/um/kernel/tlb.c | 2 +-
arch/um/kernel/trap.c | 2 +-
5 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 83a91f976330..35ab97e4bb9b 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -22,7 +22,8 @@ extern int kmalloc_ok;
extern unsigned long alloc_stack(int order, int atomic);
extern void free_stack(unsigned long stack, int order);

-extern int do_signal(void);
+struct pt_regs;
+extern void do_signal(struct pt_regs *regs);
extern void interrupt_end(void);
extern void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs);

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 68b9119841cd..a6d922672b9f 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -90,12 +90,14 @@ void *__switch_to(struct task_struct *from, struct task_struct *to)

void interrupt_end(void)
{
+ struct pt_regs *regs = &current->thread.regs;
+
if (need_resched())
schedule();
if (test_thread_flag(TIF_SIGPENDING))
- do_signal();
+ do_signal(regs);
if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(&current->thread.regs);
+ tracehook_notify_resume(regs);
}

void exit_thread(void)
diff --git a/arch/um/kernel/signal.c b/arch/um/kernel/signal.c
index 4f60e4aad790..57acbd67d85d 100644
--- a/arch/um/kernel/signal.c
+++ b/arch/um/kernel/signal.c
@@ -64,7 +64,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
signal_setup_done(err, ksig, singlestep);
}

-static int kern_do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
{
struct ksignal ksig;
int handled_sig = 0;
@@ -110,10 +110,4 @@ static int kern_do_signal(struct pt_regs *regs)
*/
if (!handled_sig)
restore_saved_sigmask();
- return handled_sig;
-}
-
-int do_signal(void)
-{
- return kern_do_signal(&current->thread.regs);
}
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index f1b3eb14b855..2077248e8a72 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -291,7 +291,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
/* We are under mmap_sem, release it such that current can terminate */
up_write(&current->mm->mmap_sem);
force_sig(SIGKILL, current);
- do_signal();
+ do_signal(&current->thread.regs);
}
}

diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 47ff9b7f3e5d..1b0f5c59d522 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -173,7 +173,7 @@ static void bad_segv(struct faultinfo fi, unsigned long ip)
void fatal_sigsegv(void)
{
force_sigsegv(SIGSEGV, current);
- do_signal();
+ do_signal(&current->thread.regs);
/*
* This is to tell gcc that we're not returning - do_signal
* can, in general, return, but in this case, it's not, since
--
2.4.3

2015-07-03 19:45:27

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 04/17] context_tracking: Add ct_state and CT_WARN_ON

This will let us sprinkle sanity checks around the kernel without
making too much of a mess.

Signed-off-by: Andy Lutomirski <[email protected]>
---
include/linux/context_tracking.h | 15 +++++++++++++++
include/linux/context_tracking_state.h | 1 +
2 files changed, 16 insertions(+)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..008fc67d0d96 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -49,13 +49,28 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}
}

+
+/**
+ * ct_state() - return the current context tracking state if known
+ *
+ * Returns the current cpu's context tracking state if context tracking
+ * is enabled. If context tracking is disabled, returns
+ * CONTEXT_DISABLED. This should be used primarily for debugging.
+ */
+static inline enum ctx_state ct_state(void)
+{
+ return context_tracking_is_enabled() ?
+ this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
+}
#else
static inline void user_enter(void) { }
static inline void user_exit(void) { }
static inline enum ctx_state exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
+static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
#endif /* !CONFIG_CONTEXT_TRACKING */

+#define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))

#ifdef CONFIG_CONTEXT_TRACKING_FORCE
extern void context_tracking_init(void);
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 678ecdf90cf6..ee956c528fab 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -14,6 +14,7 @@ struct context_tracking {
bool active;
int recursion;
enum ctx_state {
+ CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */
CONTEXT_KERNEL = 0,
CONTEXT_USER,
CONTEXT_GUEST,
--
2.4.3

2015-07-03 19:49:12

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 05/17] notifiers: Assert that RCU is watching in notify_die

Low-level arch entries often call notify_die, and it's easy for arch
code to fail to exit an RCU quiescent state first. Assert that
we're not quiescent in notify_die.

Signed-off-by: Andy Lutomirski <[email protected]>
---
kernel/notifier.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index ae9fc7cc360e..980e4330fb59 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -544,6 +544,8 @@ int notrace notify_die(enum die_val val, const char *str,
.signr = sig,

};
+ rcu_lockdep_assert(rcu_is_watching(),
+ "notify_die called but RCU thinks we're quiescent");
return atomic_notifier_call_chain(&die_chain, val, &args);
}
NOKPROBE_SYMBOL(notify_die);
--
2.4.3

2015-07-03 19:48:40

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c

The entry and exit C helpers were confusingly scattered between
ptrace.c and signal.c, even though they aren't specific to ptrace or
signal handling. Move them together in a new file.

This change just moves code around. It doesn't change anything.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/Makefile | 1 +
arch/x86/entry/common.c | 253 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/signal.h | 1 +
arch/x86/kernel/ptrace.c | 202 +--------------------------------
arch/x86/kernel/signal.c | 28 +----
5 files changed, 257 insertions(+), 228 deletions(-)
create mode 100644 arch/x86/entry/common.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7a144971db79..bd55dedd7614 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,6 +2,7 @@
# Makefile for the x86 low level entry code
#
obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
+obj-y += common.o

obj-y += vdso/
obj-y += vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..917d0c3cb851
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,253 @@
+/*
+ * common.c - C code for kernel entry and exit
+ * Copyright (c) 2015 Andrew Lutomirski
+ * GPL v2
+ *
+ * Based on asm and ptrace code by many authors. The code here originated
+ * in ptrace.c and signal.c.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/errno.h>
+#include <linux/ptrace.h>
+#include <linux/tracehook.h>
+#include <linux/audit.h>
+#include <linux/seccomp.h>
+#include <linux/signal.h>
+#include <linux/export.h>
+#include <linux/context_tracking.h>
+#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
+
+#include <asm/desc.h>
+#include <asm/traps.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+ if (arch == AUDIT_ARCH_X86_64) {
+ audit_syscall_entry(regs->orig_ax, regs->di,
+ regs->si, regs->dx, regs->r10);
+ } else
+#endif
+ {
+ audit_syscall_entry(regs->orig_ax, regs->bx,
+ regs->cx, regs->dx, regs->si);
+ }
+}
+
+/*
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2. If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0: resume the syscall
+ * 1: go to phase 2; no seccomp phase 2 needed
+ * anything else: go to phase 2; pass return value to seccomp
+ */
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+ unsigned long ret = 0;
+ u32 work;
+
+ BUG_ON(regs != task_pt_regs(current));
+
+ work = ACCESS_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_SYSCALL_ENTRY;
+
+ /*
+ * If TIF_NOHZ is set, we are required to call user_exit() before
+ * doing anything that could touch RCU.
+ */
+ if (work & _TIF_NOHZ) {
+ user_exit();
+ work &= ~_TIF_NOHZ;
+ }
+
+#ifdef CONFIG_SECCOMP
+ /*
+ * Do seccomp first -- it should minimize exposure of other
+ * code, and keeping seccomp fast is probably more valuable
+ * than the rest of this.
+ */
+ if (work & _TIF_SECCOMP) {
+ struct seccomp_data sd;
+
+ sd.arch = arch;
+ sd.nr = regs->orig_ax;
+ sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+ if (arch == AUDIT_ARCH_X86_64) {
+ sd.args[0] = regs->di;
+ sd.args[1] = regs->si;
+ sd.args[2] = regs->dx;
+ sd.args[3] = regs->r10;
+ sd.args[4] = regs->r8;
+ sd.args[5] = regs->r9;
+ } else
+#endif
+ {
+ sd.args[0] = regs->bx;
+ sd.args[1] = regs->cx;
+ sd.args[2] = regs->dx;
+ sd.args[3] = regs->si;
+ sd.args[4] = regs->di;
+ sd.args[5] = regs->bp;
+ }
+
+ BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+ BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+ ret = seccomp_phase1(&sd);
+ if (ret == SECCOMP_PHASE1_SKIP) {
+ regs->orig_ax = -1;
+ ret = 0;
+ } else if (ret != SECCOMP_PHASE1_OK) {
+ return ret; /* Go directly to phase 2 */
+ }
+
+ work &= ~_TIF_SECCOMP;
+ }
+#endif
+
+ /* Do our best to finish without phase 2. */
+ if (work == 0)
+ return ret; /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+ if (work == _TIF_SYSCALL_AUDIT) {
+ /*
+ * If there is no more work to be done except auditing,
+ * then audit in phase 1. Phase 2 always audits, so, if
+ * we audit here, then we can't go on to phase 2.
+ */
+ do_audit_syscall_entry(regs, arch);
+ return 0;
+ }
+#endif
+
+ return 1; /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+ unsigned long phase1_result)
+{
+ long ret = 0;
+ u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_SYSCALL_ENTRY;
+
+ BUG_ON(regs != task_pt_regs(current));
+
+ /*
+ * If we stepped into a sysenter/syscall insn, it trapped in
+ * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
+ * If user-mode had set TF itself, then it's still clear from
+ * do_debug() and we need to set it again to restore the user
+ * state. If we entered on the slow path, TF was already set.
+ */
+ if (work & _TIF_SINGLESTEP)
+ regs->flags |= X86_EFLAGS_TF;
+
+#ifdef CONFIG_SECCOMP
+ /*
+ * Call seccomp_phase2 before running the other hooks so that
+ * they can see any changes made by a seccomp tracer.
+ */
+ if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
+ /* seccomp failures shouldn't expose any additional code. */
+ return -1;
+ }
+#endif
+
+ if (unlikely(work & _TIF_SYSCALL_EMU))
+ ret = -1L;
+
+ if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
+ tracehook_report_syscall_entry(regs))
+ ret = -1L;
+
+ if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+ trace_sys_enter(regs, regs->orig_ax);
+
+ do_audit_syscall_entry(regs, arch);
+
+ return ret ?: regs->orig_ax;
+}
+
+long syscall_trace_enter(struct pt_regs *regs)
+{
+ u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+ unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+ if (phase1_result == 0)
+ return regs->orig_ax;
+ else
+ return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
+void syscall_trace_leave(struct pt_regs *regs)
+{
+ bool step;
+
+ /*
+ * We may come here right after calling schedule_user()
+ * or do_notify_resume(), in which case we can be in RCU
+ * user mode.
+ */
+ user_exit();
+
+ audit_syscall_exit(regs);
+
+ if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+ trace_sys_exit(regs, regs->ax);
+
+ /*
+ * If TIF_SYSCALL_EMU is set, we only get here because of
+ * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+ * We already reported this syscall instruction in
+ * syscall_trace_enter().
+ */
+ step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
+ !test_thread_flag(TIF_SYSCALL_EMU);
+ if (step || test_thread_flag(TIF_SYSCALL_TRACE))
+ tracehook_report_syscall_exit(regs, step);
+
+ user_enter();
+}
+
+/*
+ * notification of userspace execution resumption
+ * - triggered by the TIF_WORK_MASK flags
+ */
+__visible void
+do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
+{
+ user_exit();
+
+ if (thread_info_flags & _TIF_UPROBE)
+ uprobe_notify_resume(regs);
+
+ /* deal with pending signal delivery */
+ if (thread_info_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+ if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();
+
+ user_enter();
+}
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 31eab867e6d3..b42408bcf6b5 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -30,6 +30,7 @@ typedef sigset_t compat_sigset_t;
#endif /* __ASSEMBLY__ */
#include <uapi/asm/signal.h>
#ifndef __ASSEMBLY__
+extern void do_signal(struct pt_regs *regs);
extern void do_notify_resume(struct pt_regs *, void *, __u32);

#define __ARCH_HAS_SA_RESTORER
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..4aa1ab6435d3 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -37,12 +37,10 @@
#include <asm/proto.h>
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
+#include <asm/syscall.h>

#include "tls.h"

-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
enum x86_regset {
REGSET_GENERAL,
REGSET_FP,
@@ -1434,201 +1432,3 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
/* Send us the fake SIGTRAP */
force_sig_info(SIGTRAP, &info, tsk);
}
-
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
- if (arch == AUDIT_ARCH_X86_64) {
- audit_syscall_entry(regs->orig_ax, regs->di,
- regs->si, regs->dx, regs->r10);
- } else
-#endif
- {
- audit_syscall_entry(regs->orig_ax, regs->bx,
- regs->cx, regs->dx, regs->si);
- }
-}
-
-/*
- * We can return 0 to resume the syscall or anything else to go to phase
- * 2. If we resume the syscall, we need to put something appropriate in
- * regs->orig_ax.
- *
- * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
- * are fully functional.
- *
- * For phase 2's benefit, our return value is:
- * 0: resume the syscall
- * 1: go to phase 2; no seccomp phase 2 needed
- * anything else: go to phase 2; pass return value to seccomp
- */
-unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
-{
- unsigned long ret = 0;
- u32 work;
-
- BUG_ON(regs != task_pt_regs(current));
-
- work = ACCESS_ONCE(current_thread_info()->flags) &
- _TIF_WORK_SYSCALL_ENTRY;
-
- /*
- * If TIF_NOHZ is set, we are required to call user_exit() before
- * doing anything that could touch RCU.
- */
- if (work & _TIF_NOHZ) {
- user_exit();
- work &= ~_TIF_NOHZ;
- }
-
-#ifdef CONFIG_SECCOMP
- /*
- * Do seccomp first -- it should minimize exposure of other
- * code, and keeping seccomp fast is probably more valuable
- * than the rest of this.
- */
- if (work & _TIF_SECCOMP) {
- struct seccomp_data sd;
-
- sd.arch = arch;
- sd.nr = regs->orig_ax;
- sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
- if (arch == AUDIT_ARCH_X86_64) {
- sd.args[0] = regs->di;
- sd.args[1] = regs->si;
- sd.args[2] = regs->dx;
- sd.args[3] = regs->r10;
- sd.args[4] = regs->r8;
- sd.args[5] = regs->r9;
- } else
-#endif
- {
- sd.args[0] = regs->bx;
- sd.args[1] = regs->cx;
- sd.args[2] = regs->dx;
- sd.args[3] = regs->si;
- sd.args[4] = regs->di;
- sd.args[5] = regs->bp;
- }
-
- BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
- BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
-
- ret = seccomp_phase1(&sd);
- if (ret == SECCOMP_PHASE1_SKIP) {
- regs->orig_ax = -1;
- ret = 0;
- } else if (ret != SECCOMP_PHASE1_OK) {
- return ret; /* Go directly to phase 2 */
- }
-
- work &= ~_TIF_SECCOMP;
- }
-#endif
-
- /* Do our best to finish without phase 2. */
- if (work == 0)
- return ret; /* seccomp and/or nohz only (ret == 0 here) */
-
-#ifdef CONFIG_AUDITSYSCALL
- if (work == _TIF_SYSCALL_AUDIT) {
- /*
- * If there is no more work to be done except auditing,
- * then audit in phase 1. Phase 2 always audits, so, if
- * we audit here, then we can't go on to phase 2.
- */
- do_audit_syscall_entry(regs, arch);
- return 0;
- }
-#endif
-
- return 1; /* Something is enabled that we can't handle in phase 1 */
-}
-
-/* Returns the syscall nr to run (which should match regs->orig_ax). */
-long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
- unsigned long phase1_result)
-{
- long ret = 0;
- u32 work = ACCESS_ONCE(current_thread_info()->flags) &
- _TIF_WORK_SYSCALL_ENTRY;
-
- BUG_ON(regs != task_pt_regs(current));
-
- /*
- * If we stepped into a sysenter/syscall insn, it trapped in
- * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
- * If user-mode had set TF itself, then it's still clear from
- * do_debug() and we need to set it again to restore the user
- * state. If we entered on the slow path, TF was already set.
- */
- if (work & _TIF_SINGLESTEP)
- regs->flags |= X86_EFLAGS_TF;
-
-#ifdef CONFIG_SECCOMP
- /*
- * Call seccomp_phase2 before running the other hooks so that
- * they can see any changes made by a seccomp tracer.
- */
- if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
- /* seccomp failures shouldn't expose any additional code. */
- return -1;
- }
-#endif
-
- if (unlikely(work & _TIF_SYSCALL_EMU))
- ret = -1L;
-
- if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
- tracehook_report_syscall_entry(regs))
- ret = -1L;
-
- if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
- trace_sys_enter(regs, regs->orig_ax);
-
- do_audit_syscall_entry(regs, arch);
-
- return ret ?: regs->orig_ax;
-}
-
-long syscall_trace_enter(struct pt_regs *regs)
-{
- u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
- unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
-
- if (phase1_result == 0)
- return regs->orig_ax;
- else
- return syscall_trace_enter_phase2(regs, arch, phase1_result);
-}
-
-void syscall_trace_leave(struct pt_regs *regs)
-{
- bool step;
-
- /*
- * We may come here right after calling schedule_user()
- * or do_notify_resume(), in which case we can be in RCU
- * user mode.
- */
- user_exit();
-
- audit_syscall_exit(regs);
-
- if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
- trace_sys_exit(regs, regs->ax);
-
- /*
- * If TIF_SYSCALL_EMU is set, we only get here because of
- * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
- * We already reported this syscall instruction in
- * syscall_trace_enter().
- */
- step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
- !test_thread_flag(TIF_SYSCALL_EMU);
- if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
-
- user_enter();
-}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 206996c1669d..197c44e8ff8b 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -701,7 +701,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
* want to handle. Thus you cannot kill init even with a SIGKILL even by
* mistake.
*/
-static void do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
{
struct ksignal ksig;

@@ -736,32 +736,6 @@ static void do_signal(struct pt_regs *regs)
restore_saved_sigmask();
}

-/*
- * notification of userspace execution resumption
- * - triggered by the TIF_WORK_MASK flags
- */
-__visible void
-do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
-{
- user_exit();
-
- if (thread_info_flags & _TIF_UPROBE)
- uprobe_notify_resume(regs);
-
- /* deal with pending signal delivery */
- if (thread_info_flags & _TIF_SIGPENDING)
- do_signal(regs);
-
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- }
- if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
- fire_user_return_notifiers();
-
- user_enter();
-}
-
void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
{
struct task_struct *me = current;
--
2.4.3

2015-07-03 19:48:30

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries

Other than the super-atomic exception entries, all exception entries
are supposed to switch our context tracking state to CONTEXT_KERNEL.
Assert that they do. These assertions appear trivial at this point,
as exception_enter is the function responsible for switching
context, but I'm planning on reworking x86's exception context
tracking, and these assertions will help make sure that all of this
code keeps working.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/traps.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f5791927aa64..2a783c4fe0e9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,6 +292,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
enum ctx_state prev_state = exception_enter();
siginfo_t info;

+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
NOTIFY_STOP) {
conditional_sti(regs);
@@ -376,6 +378,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
siginfo_t *info;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
goto exit;
@@ -457,6 +460,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
conditional_sti(regs);

if (v8086_mode(regs)) {
@@ -514,6 +518,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
return;

prev_state = ist_enter(regs);
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
SIGTRAP) == NOTIFY_STOP)
@@ -750,6 +755,7 @@ dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_MF);
exception_exit(prev_state);
}
@@ -760,6 +766,7 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_XF);
exception_exit(prev_state);
}
@@ -776,6 +783,7 @@ do_device_not_available(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
BUG_ON(use_eager_fpu());

#ifdef CONFIG_MATH_EMULATION
@@ -805,6 +813,7 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
local_irq_enable();

info.si_signo = SIGILL;
--
2.4.3

2015-07-03 19:48:10

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls

Changing the x86 context tracking hooks is dangerous because there
are no good checks that we track our context correctly. Add a
helper to check that we're actually in CONTEXT_USER when we enter
from user mode and wire it up for syscall entries.

Subsequent patches will wire this up for all non-NMI entries as
well. NMIs are their own special beast and cannot currently switch
overall context tracking state. Instead, they have their own
special RCU hooks.

This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer
of indirection). Eventually, we should fix up the core context
tracking code to supply a function that does what we want (and can
be much simpler than user_exit), which will enable us to get rid of
the extra call.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/common.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 917d0c3cb851..9a327ee24eef 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -28,6 +28,15 @@
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

+#ifdef CONFIG_CONTEXT_TRACKING
+/* Called on entry from user mode with IRQs off. */
+__visible void enter_from_user_mode(void)
+{
+ CT_WARN_ON(ct_state() != CONTEXT_USER);
+ user_exit();
+}
+#endif
+
static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
{
#ifdef CONFIG_X86_64
@@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
work = ACCESS_ONCE(current_thread_info()->flags) &
_TIF_WORK_SYSCALL_ENTRY;

+#ifdef CONFIG_CONTEXT_TRACKING
/*
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
if (work & _TIF_NOHZ) {
- user_exit();
+ enter_from_user_mode();
work &= ~_TIF_NOHZ;
}
+#endif

#ifdef CONFIG_SECCOMP
/*
--
2.4.3

2015-07-03 19:48:04

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 09/17] x86/entry: Add new, comprehensible entry and exit hooks

The current entry and exit code is incomprehensible, appears to work
primary by luck, and is very difficult to incrementally improve. Add
new code in preparation for simply deleting the old code.

prepare_exit_to_usermode is a new function that will handle all slow
path exits to user mode. It is called with IRQs disabled and it
leaves us in a state in which it is safe to immediately return to
user mode. IRQs must not be re-enabled at any point after
prepare_exit_to_usermode returns and user mode is actually entered.
(We can, of course, fail to enter user mode and treat that failure
as a fresh entry to kernel mode.) All callers of do_notify_resume
will be migrated to call prepare_exit_to_usermode instead;
prepare_exit_to_usermode needs to do everything that
do_notify_resume does, but it also takes care of scheduling and
context tracking. Unlike do_notify_resume, it does not need to be
called in a loop.

syscall_return_slowpath is exactly what it sounds like. It will be
called on any syscall exit slow path. It will replaces
syscall_trace_leave and it calls prepare_exit_to_usermode on the way
out.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/common.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 111 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9a327ee24eef..febc53086a69 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
return syscall_trace_enter_phase2(regs, arch, phase1_result);
}

+/* Deprecated. */
void syscall_trace_leave(struct pt_regs *regs)
{
bool step;
@@ -237,8 +238,117 @@ void syscall_trace_leave(struct pt_regs *regs)
user_enter();
}

+static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
+{
+ unsigned long top_of_stack =
+ (unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
+ return (struct thread_info *)(top_of_stack - THREAD_SIZE);
+}
+
+/* Called with IRQs disabled. */
+__visible void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+ if (WARN_ON(!irqs_disabled()))
+ local_irq_disable();
+
+ /*
+ * In order to return to user mode, we need to have IRQs off with
+ * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
+ * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags
+ * can be set at any time on preemptable kernels if we have IRQs on,
+ * so we need to loop. Disabling preemption wouldn't help: doing the
+ * work to clear some of the flags can sleep.
+ */
+ while (true) {
+ u32 cached_flags =
+ READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+
+ if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+ _TIF_UPROBE | _TIF_NEED_RESCHED)))
+ break;
+
+ /* We have work to do. */
+ local_irq_enable();
+
+ if (cached_flags & _TIF_NEED_RESCHED)
+ schedule();
+
+ if (cached_flags & _TIF_UPROBE)
+ uprobe_notify_resume(regs);
+
+ /* deal with pending signal delivery */
+ if (cached_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (cached_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();
+
+ /* Disable IRQs and retry */
+ local_irq_disable();
+ }
+
+ user_enter();
+}
+
+/*
+ * Called with IRQs on and fully valid regs. Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+__visible void syscall_return_slowpath(struct pt_regs *regs)
+{
+ struct thread_info *ti = pt_regs_to_thread_info(regs);
+ u32 cached_flags = READ_ONCE(ti->flags);
+ bool step;
+
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
+ if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
+ regs->orig_ax))
+ local_irq_enable();
+
+ /*
+ * First do one-time work. If these work items are enabled, we
+ * want to run them exactly once per syscall exit with IRQs on.
+ */
+ if (cached_flags & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |
+ _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)) {
+ audit_syscall_exit(regs);
+
+ if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+ trace_sys_exit(regs, regs->ax);
+
+ /*
+ * If TIF_SYSCALL_EMU is set, we only get here because of
+ * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+ * We already reported this syscall instruction in
+ * syscall_trace_enter().
+ */
+ step = unlikely(
+ (cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
+ == _TIF_SINGLESTEP);
+ if (step || cached_flags & _TIF_SYSCALL_TRACE)
+ tracehook_report_syscall_exit(regs, step);
+ }
+
+#ifdef CONFIG_COMPAT
+ /*
+ * Compat syscalls set TS_COMPAT. Make sure we clear it before
+ * returning to user mode.
+ */
+ ti->status &= ~TS_COMPAT;
+#endif
+
+ local_irq_disable();
+ prepare_exit_to_usermode(regs);
+}
+
/*
- * notification of userspace execution resumption
+ * Deprecated notification of userspace execution resumption
* - triggered by the TIF_WORK_MASK flags
*/
__visible void
--
2.4.3

2015-07-03 19:47:47

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 10/17] x86/entry/64: Really create an error-entry-from-usermode code path

In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
gsbase/ebx/usermode code"), I arranged the code slightly wrong --
IRET faults would skip the code path that was intended to execute on
all error entries from user mode. Fix it up.

While we're at it, make all the labels in error_entry local.

This does not fix a bug, but we'll need it, and it slightly shrinks
the code.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 141a5d49dddc..ccfcba90de6e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1143,12 +1143,17 @@ ENTRY(error_entry)
SAVE_EXTRA_REGS 8
xorl %ebx, %ebx
testb $3, CS+8(%rsp)
- jz error_kernelspace
+ jz .Lerror_kernelspace

- /* We entered from user mode */
+.Lerror_entry_from_usermode_swapgs:
+ /*
+ * We entered from user mode or we're pretending to have entered
+ * from user mode due to an IRET fault.
+ */
SWAPGS

-error_entry_done:
+.Lerror_entry_from_usermode_after_swapgs:
+.Lerror_entry_done:
TRACE_IRQS_OFF
ret

@@ -1158,31 +1163,30 @@ error_entry_done:
* truncated RIP for IRET exceptions returning to compat mode. Check
* for these here too.
*/
-error_kernelspace:
+.Lerror_kernelspace:
incl %ebx
leaq native_irq_return_iret(%rip), %rcx
cmpq %rcx, RIP+8(%rsp)
- je error_bad_iret
+ je .Lerror_bad_iret
movl %ecx, %eax /* zero extend */
cmpq %rax, RIP+8(%rsp)
- je bstep_iret
+ je .Lbstep_iret
cmpq $gs_change, RIP+8(%rsp)
- jne error_entry_done
+ jne .Lerror_entry_done

/*
* hack: gs_change can fail with user gsbase. If this happens, fix up
* gsbase and proceed. We'll fix up the exception and land in
* gs_change's error handler with kernel gsbase.
*/
- SWAPGS
- jmp error_entry_done
+ jmp .Lerror_entry_from_usermode_swapgs

-bstep_iret:
+.Lbstep_iret:
/* Fix truncated RIP */
movq %rcx, RIP+8(%rsp)
/* fall through */

-error_bad_iret:
+.Lerror_bad_iret:
/*
* We came from an IRET to user mode, so we have user gsbase.
* Switch to kernel gsbase:
@@ -1198,7 +1202,7 @@ error_bad_iret:
call fixup_bad_iret
mov %rax, %rsp
decl %ebx
- jmp error_entry_done
+ jmp .Lerror_entry_from_usermode_after_swapgs
END(error_entry)


--
2.4.3

2015-07-03 19:47:23

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks

These need to be migrated together, as the compat case used to jump
into the middle of the 64-bit exit code.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 69 +++++-----------------------------------
arch/x86/entry/entry_64_compat.S | 6 ++--
2 files changed, 11 insertions(+), 64 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ccfcba90de6e..4ca5b782ed70 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
*/
USERGS_SYSRET64

+GLOBAL(int_ret_from_sys_call_irqs_off)
+ TRACE_IRQS_ON
+ ENABLE_INTERRUPTS(CLBR_NONE)
+ jmp int_ret_from_sys_call
+
/* Do syscall entry tracing */
tracesys:
movq %rsp, %rdi
@@ -272,69 +277,11 @@ tracesys_phase2:
* Has correct iret frame.
*/
GLOBAL(int_ret_from_sys_call)
- DISABLE_INTERRUPTS(CLBR_NONE)
-int_ret_from_sys_call_irqs_off: /* jumps come here from the irqs-off SYSRET path */
- TRACE_IRQS_OFF
- movl $_TIF_ALLWORK_MASK, %edi
- /* edi: mask to check */
-GLOBAL(int_with_check)
- LOCKDEP_SYS_EXIT_IRQ
- GET_THREAD_INFO(%rcx)
- movl TI_flags(%rcx), %edx
- andl %edi, %edx
- jnz int_careful
- andl $~TS_COMPAT, TI_status(%rcx)
- jmp syscall_return
-
- /*
- * Either reschedule or signal or syscall exit tracking needed.
- * First do a reschedule test.
- * edx: work, edi: workmask
- */
-int_careful:
- bt $TIF_NEED_RESCHED, %edx
- jnc int_very_careful
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq %rdi
- SCHEDULE_USER
- popq %rdi
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp int_with_check
-
- /* handle signals and tracing -- both require a full pt_regs */
-int_very_careful:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_EXTRA_REGS
- /* Check for syscall exit trace */
- testl $_TIF_WORK_SYSCALL_EXIT, %edx
- jz int_signal
- pushq %rdi
- leaq 8(%rsp), %rdi /* &ptregs -> arg1 */
- call syscall_trace_leave
- popq %rdi
- andl $~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU), %edi
- jmp int_restore_rest
-
-int_signal:
- testl $_TIF_DO_NOTIFY_MASK, %edx
- jz 1f
- movq %rsp, %rdi /* &ptregs -> arg1 */
- xorl %esi, %esi /* oldset -> arg2 */
- call do_notify_resume
-1: movl $_TIF_WORK_MASK, %edi
-int_restore_rest:
+ movq %rsp, %rdi
+ call syscall_return_slowpath /* returns with IRQs disabled */
RESTORE_EXTRA_REGS
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp int_with_check
-
-syscall_return:
- /* The IRETQ could re-enable interrupts: */
- DISABLE_INTERRUPTS(CLBR_ANY)
- TRACE_IRQS_IRETQ
+ TRACE_IRQS_IRETQ /* we're about to change IF */

/*
* Try to use SYSRET instead of IRET if we're returning to
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index efe0b1e499fa..204528cf4359 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -209,10 +209,10 @@ sysexit_from_sys_call:
.endm

.macro auditsys_exit exit
- testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
- jnz ia32_ret_from_sys_call
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+ jnz ia32_ret_from_sys_call
movl %eax, %esi /* second arg, syscall return value */
cmpl $-MAX_ERRNO, %eax /* is it an error ? */
jbe 1f
@@ -231,7 +231,7 @@ sysexit_from_sys_call:
movq %rax, R10(%rsp)
movq %rax, R9(%rsp)
movq %rax, R8(%rsp)
- jmp int_with_check
+ jmp int_ret_from_sys_call_irqs_off
.endm

sysenter_auditsys:
--
2.4.3

2015-07-03 19:47:14

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 12/17] x86/asm/entry/64: Save all regs on interrupt entry

To prepare for the big rewrite of the error and interrupt exit
paths, we will need pt_regs completely filled in. It's already
completely filled in when error_exit runs, so rearrange interrupt
handling to match it. This will slow down interrupt handling very
slightly (eight instructions), but the simplification it enables
will be more than worth it.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/calling.h | 3 ---
arch/x86/entry/entry_64.S | 29 +++++++++--------------------
2 files changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f4e6308c4200..f5eda6ecbca3 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -135,9 +135,6 @@ For 32-bit we have the following conventions - kernel is built with
movq %rbp, 4*8+\offset(%rsp)
movq %rbx, 5*8+\offset(%rsp)
.endm
- .macro SAVE_EXTRA_REGS_RBP offset=0
- movq %rbp, 4*8+\offset(%rsp)
- .endm

.macro RESTORE_EXTRA_REGS offset=0
movq 0*8+\offset(%rsp), %r15
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4ca5b782ed70..65029f48bcc4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -502,21 +502,13 @@ END(irq_entries_start)
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
cld
- /*
- * Since nothing in interrupt handling code touches r12...r15 members
- * of "struct pt_regs", and since interrupts can nest, we can save
- * four stack slots and simultaneously provide
- * an unwind-friendly stack layout by saving "truncated" pt_regs
- * exactly up to rbp slot, without these members.
- */
- ALLOC_PT_GPREGS_ON_STACK -RBP
- SAVE_C_REGS -RBP
- /* this goes to 0(%rsp) for unwinder, not for saving the value: */
- SAVE_EXTRA_REGS_RBP -RBP
+ ALLOC_PT_GPREGS_ON_STACK
+ SAVE_C_REGS
+ SAVE_EXTRA_REGS

- leaq -RBP(%rsp), %rdi /* arg1 for \func (pointer to pt_regs) */
+ movq %rsp,%rdi /* arg1 for \func (pointer to pt_regs) */

- testb $3, CS-RBP(%rsp)
+ testb $3, CS(%rsp)
jz 1f
SWAPGS
1:
@@ -553,9 +545,7 @@ ret_from_intr:
decl PER_CPU_VAR(irq_count)

/* Restore saved previous stack */
- popq %rsi
- /* return code expects complete pt_regs - adjust rsp accordingly: */
- leaq -RBP(%rsi), %rsp
+ popq %rsp

testb $3, CS(%rsp)
jz retint_kernel
@@ -580,7 +570,7 @@ retint_swapgs: /* return to user-space */
TRACE_IRQS_IRETQ

SWAPGS
- jmp restore_c_regs_and_iret
+ jmp restore_regs_and_iret

/* Returning to kernel space */
retint_kernel:
@@ -604,6 +594,8 @@ retint_kernel:
* At this label, code paths which return to kernel and to user,
* which come from interrupts/exception and from syscalls, merge.
*/
+restore_regs_and_iret:
+ RESTORE_EXTRA_REGS
restore_c_regs_and_iret:
RESTORE_C_REGS
REMOVE_PT_GPREGS_FROM_STACK 8
@@ -674,12 +666,10 @@ retint_signal:
jz retint_swapgs
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
- SAVE_EXTRA_REGS
movq $-1, ORIG_RAX(%rsp)
xorl %esi, %esi /* oldset */
movq %rsp, %rdi /* &pt_regs */
call do_notify_resume
- RESTORE_EXTRA_REGS
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
@@ -1160,7 +1150,6 @@ END(error_entry)
*/
ENTRY(error_exit)
movl %ebx, %eax
- RESTORE_EXTRA_REGS
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl %eax, %eax
--
2.4.3

2015-07-03 19:46:51

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 13/17] x86/asm/entry/64: Simplify irq stack pt_regs handling

There's no need for both rsi and rdi to point to the original stack.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 65029f48bcc4..83eb63d31da4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -506,8 +506,6 @@ END(irq_entries_start)
SAVE_C_REGS
SAVE_EXTRA_REGS

- movq %rsp,%rdi /* arg1 for \func (pointer to pt_regs) */
-
testb $3, CS(%rsp)
jz 1f
SWAPGS
@@ -519,14 +517,14 @@ END(irq_entries_start)
* a little cheaper to use a separate counter in the PDA (short of
* moving irq_enter into assembly, which would be too much work)
*/
- movq %rsp, %rsi
+ movq %rsp, %rdi
incl PER_CPU_VAR(irq_count)
cmovzq PER_CPU_VAR(irq_stack_ptr), %rsp
- pushq %rsi
+ pushq %rdi
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF

- call \func
+ call \func /* rdi points to pt_regs */
.endm

/*
--
2.4.3

2015-07-03 19:46:43

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 64 +++++++++++-----------------------------
arch/x86/entry/entry_64_compat.S | 5 ++++
2 files changed, 23 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 83eb63d31da4..168ee264c345 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -508,7 +508,16 @@ END(irq_entries_start)

testb $3, CS(%rsp)
jz 1f
+
+ /*
+ * IRQ from user mode. Switch to kernel gsbase and inform context
+ * tracking that we're in kernel mode.
+ */
SWAPGS
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
1:
/*
* Save previous stack pointer, optionally switch to interrupt stack.
@@ -547,26 +556,13 @@ ret_from_intr:

testb $3, CS(%rsp)
jz retint_kernel
- /* Interrupt came from user space */
-GLOBAL(retint_user)
- GET_THREAD_INFO(%rcx)

- /* %rcx: thread info. Interrupts are off. */
-retint_with_reschedule:
- movl $_TIF_WORK_MASK, %edi
-retint_check:
+ /* Interrupt came from user space */
LOCKDEP_SYS_EXIT_IRQ
- movl TI_flags(%rcx), %edx
- andl %edi, %edx
- jnz retint_careful
-
-retint_swapgs: /* return to user-space */
- /*
- * The iretq could re-enable interrupts:
- */
- DISABLE_INTERRUPTS(CLBR_ANY)
+GLOBAL(retint_user)
+ mov %rsp,%rdi
+ call prepare_exit_to_usermode
TRACE_IRQS_IRETQ
-
SWAPGS
jmp restore_regs_and_iret

@@ -644,35 +640,6 @@ native_irq_return_ldt:
popq %rax
jmp native_irq_return_iret
#endif
-
- /* edi: workmask, edx: work */
-retint_careful:
- bt $TIF_NEED_RESCHED, %edx
- jnc retint_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq %rdi
- SCHEDULE_USER
- popq %rdi
- GET_THREAD_INFO(%rcx)
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp retint_check
-
-retint_signal:
- testl $_TIF_DO_NOTIFY_MASK, %edx
- jz retint_swapgs
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- movq $-1, ORIG_RAX(%rsp)
- xorl %esi, %esi /* oldset */
- movq %rsp, %rdi /* &pt_regs */
- call do_notify_resume
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- GET_THREAD_INFO(%rcx)
- jmp retint_with_reschedule
-
END(common_interrupt)

/*
@@ -1088,7 +1055,12 @@ ENTRY(error_entry)
SWAPGS

.Lerror_entry_from_usermode_after_swapgs:
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
.Lerror_entry_done:
+
TRACE_IRQS_OFF
ret

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 204528cf4359..55fa85837da2 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -455,6 +455,11 @@ ia32_badarg:
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF

+ /* Now finish entering normal kernel mode. */
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
/* And exit again. */
jmp retint_user

--
2.4.3

2015-07-03 19:46:10

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 15/17] x86/entry: Remove exception_enter from most trap handlers

On 64-bit kernels, we don't need it any more: we handle context
tracking directly on entry from user mode and exit to user mode. On
32-bit kernels, we don't support context tracking at all, so these
hooks had no effect.

This doesn't change do_page_fault. Before we do that, we need to
make sure that there is no code that can page fault from kernel mode
with CONTEXT_USER. The 32-bit fast system call stack argument code
is the only offender I'm aware of right now.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/traps.h | 4 +-
arch/x86/kernel/cpu/mcheck/mce.c | 5 +--
arch/x86/kernel/cpu/mcheck/p5.c | 5 +--
arch/x86/kernel/cpu/mcheck/winchip.c | 4 +-
arch/x86/kernel/traps.c | 78 +++++++++---------------------------
5 files changed, 27 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c5380bea2a36..c3496619740a 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -112,8 +112,8 @@ asmlinkage void smp_threshold_interrupt(void);
asmlinkage void smp_deferred_error_interrupt(void);
#endif

-extern enum ctx_state ist_enter(struct pt_regs *regs);
-extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+extern void ist_enter(struct pt_regs *regs);
+extern void ist_exit(struct pt_regs *regs);
extern void ist_begin_non_atomic(struct pt_regs *regs);
extern void ist_end_non_atomic(void);

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index df919ff103c3..dc87973098dc 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1029,7 +1029,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
{
struct mca_config *cfg = &mca_cfg;
struct mce m, *final;
- enum ctx_state prev_state;
int i;
int worst = 0;
int severity;
@@ -1055,7 +1054,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
int flags = MF_ACTION_REQUIRED;
int lmce = 0;

- prev_state = ist_enter(regs);
+ ist_enter(regs);

this_cpu_inc(mce_exception_count);

@@ -1227,7 +1226,7 @@ out:
local_irq_disable();
ist_end_non_atomic();
done:
- ist_exit(regs, prev_state);
+ ist_exit(regs);
}
EXPORT_SYMBOL_GPL(do_machine_check);

diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
index 737b0ad4e61a..12402e10aeff 100644
--- a/arch/x86/kernel/cpu/mcheck/p5.c
+++ b/arch/x86/kernel/cpu/mcheck/p5.c
@@ -19,10 +19,9 @@ int mce_p5_enabled __read_mostly;
/* Machine check handler for Pentium class Intel CPUs: */
static void pentium_machine_check(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
u32 loaddr, hi, lotype;

- prev_state = ist_enter(regs);
+ ist_enter(regs);

rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
@@ -39,7 +38,7 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)

add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);

- ist_exit(regs, prev_state);
+ ist_exit(regs);
}

/* Set up machine check reporting for processors with Intel style MCE: */
diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
index 44f138296fbe..01dd8702880b 100644
--- a/arch/x86/kernel/cpu/mcheck/winchip.c
+++ b/arch/x86/kernel/cpu/mcheck/winchip.c
@@ -15,12 +15,12 @@
/* Machine check handler for WinChip C6: */
static void winchip_machine_check(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state = ist_enter(regs);
+ ist_enter(regs);

printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);

- ist_exit(regs, prev_state);
+ ist_exit(regs);
}

/* Set up machine check reporting on the Winchip C6 series */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2a783c4fe0e9..8e65d8a9b8db 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -108,13 +108,10 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
preempt_count_dec();
}

-enum ctx_state ist_enter(struct pt_regs *regs)
+void ist_enter(struct pt_regs *regs)
{
- enum ctx_state prev_state;
-
if (user_mode(regs)) {
- /* Other than that, we're just an exception. */
- prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
} else {
/*
* We might have interrupted pretty much anything. In
@@ -123,32 +120,25 @@ enum ctx_state ist_enter(struct pt_regs *regs)
* but we need to notify RCU.
*/
rcu_nmi_enter();
- prev_state = CONTEXT_KERNEL; /* the value is irrelevant. */
}

/*
- * We are atomic because we're on the IST stack (or we're on x86_32,
- * in which case we still shouldn't schedule).
- *
- * This must be after exception_enter(), because exception_enter()
- * won't do anything if in_interrupt() returns true.
+ * We are atomic because we're on the IST stack; or we're on
+ * x86_32, in which case we still shouldn't schedule; or we're
+ * on x86_64 and entered from user mode, in which case we're
+ * still atomic unless ist_begin_non_atomic is called.
*/
preempt_count_add(HARDIRQ_OFFSET);

/* This code is a bit fragile. Test it. */
rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
-
- return prev_state;
}

-void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
+void ist_exit(struct pt_regs *regs)
{
- /* Must be before exception_exit. */
preempt_count_sub(HARDIRQ_OFFSET);

- if (user_mode(regs))
- return exception_exit(prev_state);
- else
+ if (!user_mode(regs))
rcu_nmi_exit();
}

@@ -162,7 +152,7 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
* a double fault, it can be safe to schedule. ist_begin_non_atomic()
* begins a non-atomic section within an ist_enter()/ist_exit() region.
* Callers are responsible for enabling interrupts themselves inside
- * the non-atomic section, and callers must call is_end_non_atomic()
+ * the non-atomic section, and callers must call ist_end_non_atomic()
* before ist_exit().
*/
void ist_begin_non_atomic(struct pt_regs *regs)
@@ -289,7 +279,6 @@ NOKPROBE_SYMBOL(do_trap);
static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
unsigned long trapnr, int signr)
{
- enum ctx_state prev_state = exception_enter();
siginfo_t info;

CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -300,8 +289,6 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
do_trap(trapnr, signr, str, regs, error_code,
fill_trap_info(regs, signr, trapnr, &info));
}
-
- exception_exit(prev_state);
}

#define DO_ERROR(trapnr, signr, str, name) \
@@ -353,7 +340,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
}
#endif

- ist_enter(regs); /* Discard prev_state because we won't return. */
+ ist_enter(regs);
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

tsk->thread.error_code = error_code;
@@ -373,15 +360,13 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)

dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
const struct bndcsr *bndcsr;
siginfo_t *info;

- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
- goto exit;
+ return;
conditional_sti(regs);

if (!user_mode(regs))
@@ -438,9 +423,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
die("bounds", regs, error_code);
}

-exit:
- exception_exit(prev_state);
return;
+
exit_trap:
/*
* This path out is for all the cases where we could not
@@ -450,36 +434,33 @@ exit_trap:
* time..
*/
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
- exception_exit(prev_state);
}

dotraplinkage void
do_general_protection(struct pt_regs *regs, long error_code)
{
struct task_struct *tsk;
- enum ctx_state prev_state;

- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
conditional_sti(regs);

if (v8086_mode(regs)) {
local_irq_enable();
handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
- goto exit;
+ return;
}

tsk = current;
if (!user_mode(regs)) {
if (fixup_exception(regs))
- goto exit;
+ return;

tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
if (notify_die(DIE_GPF, "general protection fault", regs, error_code,
X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
die("general protection fault", regs, error_code);
- goto exit;
+ return;
}

tsk->thread.error_code = error_code;
@@ -495,16 +476,12 @@ do_general_protection(struct pt_regs *regs, long error_code)
}

force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
-exit:
- exception_exit(prev_state);
}
NOKPROBE_SYMBOL(do_general_protection);

/* May run on IST stack. */
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
#ifdef CONFIG_DYNAMIC_FTRACE
/*
* ftrace must be first, everything else may cause a recursive crash.
@@ -517,7 +494,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
if (poke_int3_handler(regs))
return;

- prev_state = ist_enter(regs);
+ ist_enter(regs);
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
@@ -544,7 +521,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
preempt_conditional_cli(regs);
debug_stack_usage_dec();
exit:
- ist_exit(regs, prev_state);
+ ist_exit(regs);
}
NOKPROBE_SYMBOL(do_int3);

@@ -620,12 +597,11 @@ NOKPROBE_SYMBOL(fixup_bad_iret);
dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
{
struct task_struct *tsk = current;
- enum ctx_state prev_state;
int user_icebp = 0;
unsigned long dr6;
int si_code;

- prev_state = ist_enter(regs);
+ ist_enter(regs);

get_debugreg(dr6, 6);

@@ -700,7 +676,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
debug_stack_usage_dec();

exit:
- ist_exit(regs, prev_state);
+ ist_exit(regs);
}
NOKPROBE_SYMBOL(do_debug);

@@ -752,23 +728,15 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)

dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_MF);
- exception_exit(prev_state);
}

dotraplinkage void
do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_XF);
- exception_exit(prev_state);
}

dotraplinkage void
@@ -780,9 +748,6 @@ do_spurious_interrupt_bug(struct pt_regs *regs, long error_code)
dotraplinkage void
do_device_not_available(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
BUG_ON(use_eager_fpu());

@@ -794,7 +759,6 @@ do_device_not_available(struct pt_regs *regs, long error_code)

info.regs = regs;
math_emulate(&info);
- exception_exit(prev_state);
return;
}
#endif
@@ -802,7 +766,6 @@ do_device_not_available(struct pt_regs *regs, long error_code)
#ifdef CONFIG_X86_32
conditional_sti(regs);
#endif
- exception_exit(prev_state);
}
NOKPROBE_SYMBOL(do_device_not_available);

@@ -810,9 +773,7 @@ NOKPROBE_SYMBOL(do_device_not_available);
dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
{
siginfo_t info;
- enum ctx_state prev_state;

- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
local_irq_enable();

@@ -825,7 +786,6 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
do_trap(X86_TRAP_IRET, SIGILL, "iret exception", regs, error_code,
&info);
}
- exception_exit(prev_state);
}
#endif

--
2.4.3

2015-07-03 19:46:05

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 16/17] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h

SCHEDULE_USER is no longer used, and asm/context-tracking.h
contained nothing else. Remove the header entirely

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 1 -
arch/x86/include/asm/context_tracking.h | 10 ----------
2 files changed, 11 deletions(-)
delete mode 100644 arch/x86/include/asm/context_tracking.h

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 168ee264c345..041a37a643e1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -33,7 +33,6 @@
#include <asm/paravirt.h>
#include <asm/percpu.h>
#include <asm/asm.h>
-#include <asm/context_tracking.h>
#include <asm/smap.h>
#include <asm/pgtable_types.h>
#include <linux/err.h>
diff --git a/arch/x86/include/asm/context_tracking.h b/arch/x86/include/asm/context_tracking.h
deleted file mode 100644
index 1fe49704b146..000000000000
--- a/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif
--
2.4.3

2015-07-03 19:45:56

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v5 17/17] x86/irq: Document how IRQ context tracking works and add an assertion

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/irq.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 88b366487b0e..6233de046c08 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
unsigned vector = ~regs->orig_ax;
unsigned irq;

+ /*
+ * NB: Unlike exception entries, IRQ entries do not reliably
+ * handle context tracking in the low-level entry code. This is
+ * because syscall entries execute briefly with IRQs on before
+ * updating context tracking state, so we can take an IRQ from
+ * kernel mode with CONTEXT_USER. The low-level entry code only
+ * updates the context if we came from user mode, so we won't
+ * switch to CONTEXT_KERNEL. We'll fix that once the syscall
+ * code is cleaned up enough that we can cleanly defer enabling
+ * IRQs.
+ */
+
entering_irq();

+ /* entering_irq() tells RCU that we're not quiescent. Check it. */
+ rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");
+
irq = __this_cpu_read(vector_irq[vector]);

if (!handle_irq(irq, regs)) {
--
2.4.3

Subject: [tip:x86/asm] x86/entry, selftests/x86: Add a test for 32-bit fast syscall arg faults

Commit-ID: 5e5c684a2c78b98dcba3d6fce56773a375f63980
Gitweb: http://git.kernel.org/tip/5e5c684a2c78b98dcba3d6fce56773a375f63980
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:18 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:58:30 +0200

x86/entry, selftests/x86: Add a test for 32-bit fast syscall arg faults

This test passes on 4.0 and fails on some newer kernels.
Fortunately, the failure is likely not a big deal.

This test will make sure that we don't break it further (e.g. OOPSing)
as we clean up the entry code and that we eventually fix the
regression.

There's arguably no need to preserve the old ABI here --
anything that makes it into a fast (vDSO) syscall with a bad
stack is about to crash no matter what we do.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/9cfcc51005168cb1b06b31991931214d770fc59a.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/syscall_arg_fault.c | 130 ++++++++++++++++++++++++
2 files changed, 131 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d5..e8df47e 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -5,7 +5,7 @@ include ../lib.mk
.PHONY: all all_32 all_64 warn_32bit_failure clean

TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
-TARGETS_C_32BIT_ONLY := entry_from_vm86
+TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault

TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
BINARIES_32 := $(TARGETS_C_32BIT_ALL:%=%_32)
diff --git a/tools/testing/selftests/x86/syscall_arg_fault.c b/tools/testing/selftests/x86/syscall_arg_fault.c
new file mode 100644
index 0000000..7db4fc9
--- /dev/null
+++ b/tools/testing/selftests/x86/syscall_arg_fault.c
@@ -0,0 +1,130 @@
+/*
+ * syscall_arg_fault.c - tests faults 32-bit fast syscall stack args
+ * Copyright (c) 2015 Andrew Lutomirski
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/signal.h>
+#include <sys/ucontext.h>
+#include <err.h>
+#include <setjmp.h>
+#include <errno.h>
+
+/* Our sigaltstack scratch space. */
+static unsigned char altstack_data[SIGSTKSZ];
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+ int flags)
+{
+ struct sigaction sa;
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_sigaction = handler;
+ sa.sa_flags = SA_SIGINFO | flags;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static volatile sig_atomic_t sig_traps;
+static sigjmp_buf jmpbuf;
+
+static volatile sig_atomic_t n_errs;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+ ucontext_t *ctx = (ucontext_t*)ctx_void;
+
+ if (ctx->uc_mcontext.gregs[REG_EAX] != -EFAULT) {
+ printf("[FAIL]\tAX had the wrong value: 0x%x\n",
+ ctx->uc_mcontext.gregs[REG_EAX]);
+ n_errs++;
+ } else {
+ printf("[OK]\tSeems okay\n");
+ }
+
+ siglongjmp(jmpbuf, 1);
+}
+
+static void sigill(int sig, siginfo_t *info, void *ctx_void)
+{
+ printf("[SKIP]\tIllegal instruction\n");
+ siglongjmp(jmpbuf, 1);
+}
+
+int main()
+{
+ stack_t stack = {
+ .ss_sp = altstack_data,
+ .ss_size = SIGSTKSZ,
+ };
+ if (sigaltstack(&stack, NULL) != 0)
+ err(1, "sigaltstack");
+
+ sethandler(SIGSEGV, sigsegv, SA_ONSTACK);
+ sethandler(SIGILL, sigill, SA_ONSTACK);
+
+ /*
+ * Exercise another nasty special case. The 32-bit SYSCALL
+ * and SYSENTER instructions (even in compat mode) each
+ * clobber one register. A Linux system call has a syscall
+ * number and six arguments, and the user stack pointer
+ * needs to live in some register on return. That means
+ * that we need eight registers, but SYSCALL and SYSENTER
+ * only preserve seven registers. As a result, one argument
+ * ends up on the stack. The stack is user memory, which
+ * means that the kernel can fail to read it.
+ *
+ * The 32-bit fast system calls don't have a defined ABI:
+ * we're supposed to invoke them through the vDSO. So we'll
+ * fudge it: we set all regs to invalid pointer values and
+ * invoke the entry instruction. The return will fail no
+ * matter what, and we completely lose our program state,
+ * but we can fix it up with a signal handler.
+ */
+
+ printf("[RUN]\tSYSENTER with invalid state\n");
+ if (sigsetjmp(jmpbuf, 1) == 0) {
+ asm volatile (
+ "movl $-1, %%eax\n\t"
+ "movl $-1, %%ebx\n\t"
+ "movl $-1, %%ecx\n\t"
+ "movl $-1, %%edx\n\t"
+ "movl $-1, %%esi\n\t"
+ "movl $-1, %%edi\n\t"
+ "movl $-1, %%ebp\n\t"
+ "movl $-1, %%esp\n\t"
+ "sysenter"
+ : : : "memory", "flags");
+ }
+
+ printf("[RUN]\tSYSCALL with invalid state\n");
+ if (sigsetjmp(jmpbuf, 1) == 0) {
+ asm volatile (
+ "movl $-1, %%eax\n\t"
+ "movl $-1, %%ebx\n\t"
+ "movl $-1, %%ecx\n\t"
+ "movl $-1, %%edx\n\t"
+ "movl $-1, %%esi\n\t"
+ "movl $-1, %%edi\n\t"
+ "movl $-1, %%ebp\n\t"
+ "movl $-1, %%esp\n\t"
+ "syscall\n\t"
+ "pushl $0" /* make sure we segfault cleanly */
+ : : : "memory", "flags");
+ }
+
+ return 0;
+}

Subject: [tip:x86/asm] x86/entry/64/compat: Fix bad fast syscall arg failure path

Commit-ID: 5e99cb7c35ca0580da8e892f91c655d35ecf8798
Gitweb: http://git.kernel.org/tip/5e99cb7c35ca0580da8e892f91c655d35ecf8798
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:19 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:58:30 +0200

x86/entry/64/compat: Fix bad fast syscall arg failure path

If user code does SYSCALL32 or SYSENTER without a valid stack,
then our attempt to determine the syscall args will result in a
failed uaccess fault. Previously, we would try to recover by
jumping to the syscall exit code, but we'd run the syscall exit
work even though we never made it to the syscall entry work.

Clean it up by treating the failure path as a non-syscall entry
and exit pair.

This fixes strace's output when running the syscall_arg_fault
test. Without this fix, strace would get out of sync and would
fail to associate syscall entries with syscall exits.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/903010762c07a3d67df914fea2da84b52b0f8f1d.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 2 +-
arch/x86/entry/entry_64_compat.S | 35 +++++++++++++++++++++++++++++++++--
2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3bb2c43..141a5d4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -613,7 +613,7 @@ ret_from_intr:
testb $3, CS(%rsp)
jz retint_kernel
/* Interrupt came from user space */
-retint_user:
+GLOBAL(retint_user)
GET_THREAD_INFO(%rcx)

/* %rcx: thread info. Interrupts are off. */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index b868cfc..e5ebdd9 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -428,8 +428,39 @@ cstar_tracesys:
END(entry_SYSCALL_compat)

ia32_badarg:
- ASM_CLAC
- movq $-EFAULT, RAX(%rsp)
+ /*
+ * So far, we've entered kernel mode, set AC, turned on IRQs, and
+ * saved C regs except r8-r11. We haven't done any of the other
+ * standard entry work, though. We want to bail, but we shouldn't
+ * treat this as a syscall entry since we don't even know what the
+ * args are. Instead, treat this as a non-syscall entry, finish
+ * the entry work, and immediately exit after setting AX = -EFAULT.
+ *
+ * We're really just being polite here. Killing the task outright
+ * would be a reasonable action, too. Given that the only valid
+ * way to have gotten here is through the vDSO, and we already know
+ * that the stack pointer is bad, the task isn't going to survive
+ * for long no matter what we do.
+ */
+
+ ASM_CLAC /* undo STAC */
+ movq $-EFAULT, RAX(%rsp) /* return -EFAULT if possible */
+
+ /* Fill in the rest of pt_regs */
+ xorl %eax, %eax
+ movq %rax, R11(%rsp)
+ movq %rax, R10(%rsp)
+ movq %rax, R9(%rsp)
+ movq %rax, R8(%rsp)
+ SAVE_EXTRA_REGS
+
+ /* Turn IRQs back off. */
+ DISABLE_INTERRUPTS(CLBR_NONE)
+ TRACE_IRQS_OFF
+
+ /* And exit again. */
+ jmp retint_user
+
ia32_ret_from_sys_call:
xorl %eax, %eax /* Do not leak kernel information */
movq %rax, R11(%rsp)

Subject: [tip:x86/asm] um: Fix do_signal() prototype

Commit-ID: ccaee5f851470dec6894a6835b6fadffc2bb7514
Gitweb: http://git.kernel.org/tip/ccaee5f851470dec6894a6835b6fadffc2bb7514
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:20 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:58:54 +0200

um: Fix do_signal() prototype

Once x86 exports its do_signal(), the prototypes will clash.

Fix the clash and also improve the code a bit: remove the
unnecessary kern_do_signal() indirection. This allows
interrupt_end() to share the 'regs' parameter calculation.

Also remove the unused return code to match x86.

Minimally build and boot tested.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/67c57eac09a589bac3c6c5ff22f9623ec55a184a.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/um/include/shared/kern_util.h | 3 ++-
arch/um/kernel/process.c | 6 ++++--
arch/um/kernel/signal.c | 8 +-------
arch/um/kernel/tlb.c | 2 +-
arch/um/kernel/trap.c | 2 +-
5 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 83a91f9..35ab97e 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -22,7 +22,8 @@ extern int kmalloc_ok;
extern unsigned long alloc_stack(int order, int atomic);
extern void free_stack(unsigned long stack, int order);

-extern int do_signal(void);
+struct pt_regs;
+extern void do_signal(struct pt_regs *regs);
extern void interrupt_end(void);
extern void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs);

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 68b9119..a6d9226 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -90,12 +90,14 @@ void *__switch_to(struct task_struct *from, struct task_struct *to)

void interrupt_end(void)
{
+ struct pt_regs *regs = &current->thread.regs;
+
if (need_resched())
schedule();
if (test_thread_flag(TIF_SIGPENDING))
- do_signal();
+ do_signal(regs);
if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(&current->thread.regs);
+ tracehook_notify_resume(regs);
}

void exit_thread(void)
diff --git a/arch/um/kernel/signal.c b/arch/um/kernel/signal.c
index 4f60e4a..57acbd6 100644
--- a/arch/um/kernel/signal.c
+++ b/arch/um/kernel/signal.c
@@ -64,7 +64,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
signal_setup_done(err, ksig, singlestep);
}

-static int kern_do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
{
struct ksignal ksig;
int handled_sig = 0;
@@ -110,10 +110,4 @@ static int kern_do_signal(struct pt_regs *regs)
*/
if (!handled_sig)
restore_saved_sigmask();
- return handled_sig;
-}
-
-int do_signal(void)
-{
- return kern_do_signal(&current->thread.regs);
}
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index f1b3eb1..2077248 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -291,7 +291,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
/* We are under mmap_sem, release it such that current can terminate */
up_write(&current->mm->mmap_sem);
force_sig(SIGKILL, current);
- do_signal();
+ do_signal(&current->thread.regs);
}
}

diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 557232f..d8a9fce 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -173,7 +173,7 @@ static void bad_segv(struct faultinfo fi, unsigned long ip)
void fatal_sigsegv(void)
{
force_sigsegv(SIGSEGV, current);
- do_signal();
+ do_signal(&current->thread.regs);
/*
* This is to tell gcc that we're not returning - do_signal
* can, in general, return, but in this case, it's not, since

Subject: [tip:x86/asm] context_tracking: Add ct_state() and CT_WARN_ON()

Commit-ID: f9281648ecd5081803bb2da84b9ccb0cf48436cd
Gitweb: http://git.kernel.org/tip/f9281648ecd5081803bb2da84b9ccb0cf48436cd
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:21 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:04 +0200

context_tracking: Add ct_state() and CT_WARN_ON()

This will let us sprinkle sanity checks around the kernel
without making too much of a mess.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/5da41fb2ceb29eac671f427c67040401ba2a1fa0.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/context_tracking.h | 15 +++++++++++++++
include/linux/context_tracking_state.h | 1 +
2 files changed, 16 insertions(+)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd29..008fc67 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -49,13 +49,28 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}
}

+
+/**
+ * ct_state() - return the current context tracking state if known
+ *
+ * Returns the current cpu's context tracking state if context tracking
+ * is enabled. If context tracking is disabled, returns
+ * CONTEXT_DISABLED. This should be used primarily for debugging.
+ */
+static inline enum ctx_state ct_state(void)
+{
+ return context_tracking_is_enabled() ?
+ this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
+}
#else
static inline void user_enter(void) { }
static inline void user_exit(void) { }
static inline enum ctx_state exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
+static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
#endif /* !CONFIG_CONTEXT_TRACKING */

+#define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))

#ifdef CONFIG_CONTEXT_TRACKING_FORCE
extern void context_tracking_init(void);
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 678ecdf..ee956c5 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -14,6 +14,7 @@ struct context_tracking {
bool active;
int recursion;
enum ctx_state {
+ CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */
CONTEXT_KERNEL = 0,
CONTEXT_USER,
CONTEXT_GUEST,

Subject: [tip:x86/asm] notifiers, RCU: Assert that RCU is watching in notify_die()

Commit-ID: e727c7d7a11e109849582e9165d54b254eb181d7
Gitweb: http://git.kernel.org/tip/e727c7d7a11e109849582e9165d54b254eb181d7
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:22 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:04 +0200

notifiers, RCU: Assert that RCU is watching in notify_die()

Low-level arch entries often call notify_die(), and it's easy for
arch code to fail to exit an RCU quiescent state first. Assert
that we're not quiescent in notify_die().

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/1f5fe6c23d5b432a23267102f2d72b787d80fdd8.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/notifier.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index ae9fc7c..980e433 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -544,6 +544,8 @@ int notrace notify_die(enum die_val val, const char *str,
.signr = sig,

};
+ rcu_lockdep_assert(rcu_is_watching(),
+ "notify_die called but RCU thinks we're quiescent");
return atomic_notifier_call_chain(&die_chain, val, &args);
}
NOKPROBE_SYMBOL(notify_die);

Subject: [tip:x86/asm] x86/entry: Move C entry and exit code to arch/x86/ entry/common.c

Commit-ID: 1f484aa6904697f390027c12fba130fa94b20831
Gitweb: http://git.kernel.org/tip/1f484aa6904697f390027c12fba130fa94b20831
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:23 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:05 +0200

x86/entry: Move C entry and exit code to arch/x86/entry/common.c

The entry and exit C helpers were confusingly scattered between
ptrace.c and signal.c, even though they aren't specific to
ptrace or signal handling. Move them together in a new file.

This change just moves code around. It doesn't change anything.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/324d686821266544d8572423cc281f961da445f4.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/Makefile | 1 +
arch/x86/entry/common.c | 253 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/signal.h | 1 +
arch/x86/kernel/ptrace.c | 202 +--------------------------------
arch/x86/kernel/signal.c | 28 +----
5 files changed, 257 insertions(+), 228 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7a14497..bd55ded 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,6 +2,7 @@
# Makefile for the x86 low level entry code
#
obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
+obj-y += common.o

obj-y += vdso/
obj-y += vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 0000000..917d0c3
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,253 @@
+/*
+ * common.c - C code for kernel entry and exit
+ * Copyright (c) 2015 Andrew Lutomirski
+ * GPL v2
+ *
+ * Based on asm and ptrace code by many authors. The code here originated
+ * in ptrace.c and signal.c.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/errno.h>
+#include <linux/ptrace.h>
+#include <linux/tracehook.h>
+#include <linux/audit.h>
+#include <linux/seccomp.h>
+#include <linux/signal.h>
+#include <linux/export.h>
+#include <linux/context_tracking.h>
+#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
+
+#include <asm/desc.h>
+#include <asm/traps.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+ if (arch == AUDIT_ARCH_X86_64) {
+ audit_syscall_entry(regs->orig_ax, regs->di,
+ regs->si, regs->dx, regs->r10);
+ } else
+#endif
+ {
+ audit_syscall_entry(regs->orig_ax, regs->bx,
+ regs->cx, regs->dx, regs->si);
+ }
+}
+
+/*
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2. If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0: resume the syscall
+ * 1: go to phase 2; no seccomp phase 2 needed
+ * anything else: go to phase 2; pass return value to seccomp
+ */
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+ unsigned long ret = 0;
+ u32 work;
+
+ BUG_ON(regs != task_pt_regs(current));
+
+ work = ACCESS_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_SYSCALL_ENTRY;
+
+ /*
+ * If TIF_NOHZ is set, we are required to call user_exit() before
+ * doing anything that could touch RCU.
+ */
+ if (work & _TIF_NOHZ) {
+ user_exit();
+ work &= ~_TIF_NOHZ;
+ }
+
+#ifdef CONFIG_SECCOMP
+ /*
+ * Do seccomp first -- it should minimize exposure of other
+ * code, and keeping seccomp fast is probably more valuable
+ * than the rest of this.
+ */
+ if (work & _TIF_SECCOMP) {
+ struct seccomp_data sd;
+
+ sd.arch = arch;
+ sd.nr = regs->orig_ax;
+ sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+ if (arch == AUDIT_ARCH_X86_64) {
+ sd.args[0] = regs->di;
+ sd.args[1] = regs->si;
+ sd.args[2] = regs->dx;
+ sd.args[3] = regs->r10;
+ sd.args[4] = regs->r8;
+ sd.args[5] = regs->r9;
+ } else
+#endif
+ {
+ sd.args[0] = regs->bx;
+ sd.args[1] = regs->cx;
+ sd.args[2] = regs->dx;
+ sd.args[3] = regs->si;
+ sd.args[4] = regs->di;
+ sd.args[5] = regs->bp;
+ }
+
+ BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+ BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+ ret = seccomp_phase1(&sd);
+ if (ret == SECCOMP_PHASE1_SKIP) {
+ regs->orig_ax = -1;
+ ret = 0;
+ } else if (ret != SECCOMP_PHASE1_OK) {
+ return ret; /* Go directly to phase 2 */
+ }
+
+ work &= ~_TIF_SECCOMP;
+ }
+#endif
+
+ /* Do our best to finish without phase 2. */
+ if (work == 0)
+ return ret; /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+ if (work == _TIF_SYSCALL_AUDIT) {
+ /*
+ * If there is no more work to be done except auditing,
+ * then audit in phase 1. Phase 2 always audits, so, if
+ * we audit here, then we can't go on to phase 2.
+ */
+ do_audit_syscall_entry(regs, arch);
+ return 0;
+ }
+#endif
+
+ return 1; /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+ unsigned long phase1_result)
+{
+ long ret = 0;
+ u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_SYSCALL_ENTRY;
+
+ BUG_ON(regs != task_pt_regs(current));
+
+ /*
+ * If we stepped into a sysenter/syscall insn, it trapped in
+ * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
+ * If user-mode had set TF itself, then it's still clear from
+ * do_debug() and we need to set it again to restore the user
+ * state. If we entered on the slow path, TF was already set.
+ */
+ if (work & _TIF_SINGLESTEP)
+ regs->flags |= X86_EFLAGS_TF;
+
+#ifdef CONFIG_SECCOMP
+ /*
+ * Call seccomp_phase2 before running the other hooks so that
+ * they can see any changes made by a seccomp tracer.
+ */
+ if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
+ /* seccomp failures shouldn't expose any additional code. */
+ return -1;
+ }
+#endif
+
+ if (unlikely(work & _TIF_SYSCALL_EMU))
+ ret = -1L;
+
+ if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
+ tracehook_report_syscall_entry(regs))
+ ret = -1L;
+
+ if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+ trace_sys_enter(regs, regs->orig_ax);
+
+ do_audit_syscall_entry(regs, arch);
+
+ return ret ?: regs->orig_ax;
+}
+
+long syscall_trace_enter(struct pt_regs *regs)
+{
+ u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+ unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+ if (phase1_result == 0)
+ return regs->orig_ax;
+ else
+ return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
+void syscall_trace_leave(struct pt_regs *regs)
+{
+ bool step;
+
+ /*
+ * We may come here right after calling schedule_user()
+ * or do_notify_resume(), in which case we can be in RCU
+ * user mode.
+ */
+ user_exit();
+
+ audit_syscall_exit(regs);
+
+ if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+ trace_sys_exit(regs, regs->ax);
+
+ /*
+ * If TIF_SYSCALL_EMU is set, we only get here because of
+ * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+ * We already reported this syscall instruction in
+ * syscall_trace_enter().
+ */
+ step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
+ !test_thread_flag(TIF_SYSCALL_EMU);
+ if (step || test_thread_flag(TIF_SYSCALL_TRACE))
+ tracehook_report_syscall_exit(regs, step);
+
+ user_enter();
+}
+
+/*
+ * notification of userspace execution resumption
+ * - triggered by the TIF_WORK_MASK flags
+ */
+__visible void
+do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
+{
+ user_exit();
+
+ if (thread_info_flags & _TIF_UPROBE)
+ uprobe_notify_resume(regs);
+
+ /* deal with pending signal delivery */
+ if (thread_info_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+ if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();
+
+ user_enter();
+}
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 31eab86..b42408b 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -30,6 +30,7 @@ typedef sigset_t compat_sigset_t;
#endif /* __ASSEMBLY__ */
#include <uapi/asm/signal.h>
#ifndef __ASSEMBLY__
+extern void do_signal(struct pt_regs *regs);
extern void do_notify_resume(struct pt_regs *, void *, __u32);

#define __ARCH_HAS_SA_RESTORER
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 7155957..558f50e 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -37,12 +37,10 @@
#include <asm/proto.h>
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
+#include <asm/syscall.h>

#include "tls.h"

-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
enum x86_regset {
REGSET_GENERAL,
REGSET_FP,
@@ -1444,201 +1442,3 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
/* Send us the fake SIGTRAP */
force_sig_info(SIGTRAP, &info, tsk);
}
-
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
- if (arch == AUDIT_ARCH_X86_64) {
- audit_syscall_entry(regs->orig_ax, regs->di,
- regs->si, regs->dx, regs->r10);
- } else
-#endif
- {
- audit_syscall_entry(regs->orig_ax, regs->bx,
- regs->cx, regs->dx, regs->si);
- }
-}
-
-/*
- * We can return 0 to resume the syscall or anything else to go to phase
- * 2. If we resume the syscall, we need to put something appropriate in
- * regs->orig_ax.
- *
- * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
- * are fully functional.
- *
- * For phase 2's benefit, our return value is:
- * 0: resume the syscall
- * 1: go to phase 2; no seccomp phase 2 needed
- * anything else: go to phase 2; pass return value to seccomp
- */
-unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
-{
- unsigned long ret = 0;
- u32 work;
-
- BUG_ON(regs != task_pt_regs(current));
-
- work = ACCESS_ONCE(current_thread_info()->flags) &
- _TIF_WORK_SYSCALL_ENTRY;
-
- /*
- * If TIF_NOHZ is set, we are required to call user_exit() before
- * doing anything that could touch RCU.
- */
- if (work & _TIF_NOHZ) {
- user_exit();
- work &= ~_TIF_NOHZ;
- }
-
-#ifdef CONFIG_SECCOMP
- /*
- * Do seccomp first -- it should minimize exposure of other
- * code, and keeping seccomp fast is probably more valuable
- * than the rest of this.
- */
- if (work & _TIF_SECCOMP) {
- struct seccomp_data sd;
-
- sd.arch = arch;
- sd.nr = regs->orig_ax;
- sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
- if (arch == AUDIT_ARCH_X86_64) {
- sd.args[0] = regs->di;
- sd.args[1] = regs->si;
- sd.args[2] = regs->dx;
- sd.args[3] = regs->r10;
- sd.args[4] = regs->r8;
- sd.args[5] = regs->r9;
- } else
-#endif
- {
- sd.args[0] = regs->bx;
- sd.args[1] = regs->cx;
- sd.args[2] = regs->dx;
- sd.args[3] = regs->si;
- sd.args[4] = regs->di;
- sd.args[5] = regs->bp;
- }
-
- BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
- BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
-
- ret = seccomp_phase1(&sd);
- if (ret == SECCOMP_PHASE1_SKIP) {
- regs->orig_ax = -1;
- ret = 0;
- } else if (ret != SECCOMP_PHASE1_OK) {
- return ret; /* Go directly to phase 2 */
- }
-
- work &= ~_TIF_SECCOMP;
- }
-#endif
-
- /* Do our best to finish without phase 2. */
- if (work == 0)
- return ret; /* seccomp and/or nohz only (ret == 0 here) */
-
-#ifdef CONFIG_AUDITSYSCALL
- if (work == _TIF_SYSCALL_AUDIT) {
- /*
- * If there is no more work to be done except auditing,
- * then audit in phase 1. Phase 2 always audits, so, if
- * we audit here, then we can't go on to phase 2.
- */
- do_audit_syscall_entry(regs, arch);
- return 0;
- }
-#endif
-
- return 1; /* Something is enabled that we can't handle in phase 1 */
-}
-
-/* Returns the syscall nr to run (which should match regs->orig_ax). */
-long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
- unsigned long phase1_result)
-{
- long ret = 0;
- u32 work = ACCESS_ONCE(current_thread_info()->flags) &
- _TIF_WORK_SYSCALL_ENTRY;
-
- BUG_ON(regs != task_pt_regs(current));
-
- /*
- * If we stepped into a sysenter/syscall insn, it trapped in
- * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
- * If user-mode had set TF itself, then it's still clear from
- * do_debug() and we need to set it again to restore the user
- * state. If we entered on the slow path, TF was already set.
- */
- if (work & _TIF_SINGLESTEP)
- regs->flags |= X86_EFLAGS_TF;
-
-#ifdef CONFIG_SECCOMP
- /*
- * Call seccomp_phase2 before running the other hooks so that
- * they can see any changes made by a seccomp tracer.
- */
- if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
- /* seccomp failures shouldn't expose any additional code. */
- return -1;
- }
-#endif
-
- if (unlikely(work & _TIF_SYSCALL_EMU))
- ret = -1L;
-
- if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
- tracehook_report_syscall_entry(regs))
- ret = -1L;
-
- if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
- trace_sys_enter(regs, regs->orig_ax);
-
- do_audit_syscall_entry(regs, arch);
-
- return ret ?: regs->orig_ax;
-}
-
-long syscall_trace_enter(struct pt_regs *regs)
-{
- u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
- unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
-
- if (phase1_result == 0)
- return regs->orig_ax;
- else
- return syscall_trace_enter_phase2(regs, arch, phase1_result);
-}
-
-void syscall_trace_leave(struct pt_regs *regs)
-{
- bool step;
-
- /*
- * We may come here right after calling schedule_user()
- * or do_notify_resume(), in which case we can be in RCU
- * user mode.
- */
- user_exit();
-
- audit_syscall_exit(regs);
-
- if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
- trace_sys_exit(regs, regs->ax);
-
- /*
- * If TIF_SYSCALL_EMU is set, we only get here because of
- * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
- * We already reported this syscall instruction in
- * syscall_trace_enter().
- */
- step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
- !test_thread_flag(TIF_SYSCALL_EMU);
- if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
-
- user_enter();
-}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6c22aad..7e88cc7 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -700,7 +700,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
* want to handle. Thus you cannot kill init even with a SIGKILL even by
* mistake.
*/
-static void do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
{
struct ksignal ksig;

@@ -735,32 +735,6 @@ static void do_signal(struct pt_regs *regs)
restore_saved_sigmask();
}

-/*
- * notification of userspace execution resumption
- * - triggered by the TIF_WORK_MASK flags
- */
-__visible void
-do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
-{
- user_exit();
-
- if (thread_info_flags & _TIF_UPROBE)
- uprobe_notify_resume(regs);
-
- /* deal with pending signal delivery */
- if (thread_info_flags & _TIF_SIGPENDING)
- do_signal(regs);
-
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- }
- if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
- fire_user_return_notifiers();
-
- user_enter();
-}
-
void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
{
struct task_struct *me = current;

Subject: [tip:x86/asm] x86/traps, context_tracking: Assert that we' re in CONTEXT_KERNEL in exception entries

Commit-ID: 02fdcd5eac9d653d1addbd69b0c58d73650e1c00
Gitweb: http://git.kernel.org/tip/02fdcd5eac9d653d1addbd69b0c58d73650e1c00
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:24 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:05 +0200

x86/traps, context_tracking: Assert that we're in CONTEXT_KERNEL in exception entries

Other than the super-atomic exception entries, all exception
entries are supposed to switch our context tracking state to
CONTEXT_KERNEL. Assert that they do. These assertions appear
trivial at this point, as exception_enter() is the function
responsible for switching context, but I'm planning on reworking
x86's exception context tracking, and these assertions will help
make sure that all of this code keeps working.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/20fa1ee2d943233a184aaf96ff75394d3b34dfba.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/traps.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f579192..2a783c4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,6 +292,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
enum ctx_state prev_state = exception_enter();
siginfo_t info;

+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
NOTIFY_STOP) {
conditional_sti(regs);
@@ -376,6 +378,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
siginfo_t *info;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
goto exit;
@@ -457,6 +460,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
conditional_sti(regs);

if (v8086_mode(regs)) {
@@ -514,6 +518,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
return;

prev_state = ist_enter(regs);
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
SIGTRAP) == NOTIFY_STOP)
@@ -750,6 +755,7 @@ dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_MF);
exception_exit(prev_state);
}
@@ -760,6 +766,7 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_XF);
exception_exit(prev_state);
}
@@ -776,6 +783,7 @@ do_device_not_available(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
BUG_ON(use_eager_fpu());

#ifdef CONFIG_MATH_EMULATION
@@ -805,6 +813,7 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
local_irq_enable();

info.si_signo = SIGILL;

Subject: [tip:x86/asm] x86/entry: Add enter_from_user_mode() and use it in syscalls

Commit-ID: feed36cde0a10adb957445a37e48f957f30b2273
Gitweb: http://git.kernel.org/tip/feed36cde0a10adb957445a37e48f957f30b2273
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:25 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:06 +0200

x86/entry: Add enter_from_user_mode() and use it in syscalls

Changing the x86 context tracking hooks is dangerous because
there are no good checks that we track our context correctly.
Add a helper to check that we're actually in CONTEXT_USER when
we enter from user mode and wire it up for syscall entries.

Subsequent patches will wire this up for all non-NMI entries as
well. NMIs are their own special beast and cannot currently
switch overall context tracking state. Instead, they have their
own special RCU hooks.

This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a
layer of indirection). Eventually, we should fix up the core
context tracking code to supply a function that does what we
want (and can be much simpler than user_exit), which will enable
us to get rid of the extra call.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/853b42420066ec3fb856779cdc223a6dcb5d355b.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/common.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 917d0c3..9a327ee 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -28,6 +28,15 @@
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

+#ifdef CONFIG_CONTEXT_TRACKING
+/* Called on entry from user mode with IRQs off. */
+__visible void enter_from_user_mode(void)
+{
+ CT_WARN_ON(ct_state() != CONTEXT_USER);
+ user_exit();
+}
+#endif
+
static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
{
#ifdef CONFIG_X86_64
@@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
work = ACCESS_ONCE(current_thread_info()->flags) &
_TIF_WORK_SYSCALL_ENTRY;

+#ifdef CONFIG_CONTEXT_TRACKING
/*
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
if (work & _TIF_NOHZ) {
- user_exit();
+ enter_from_user_mode();
work &= ~_TIF_NOHZ;
}
+#endif

#ifdef CONFIG_SECCOMP
/*

Subject: [tip:x86/asm] x86/entry: Add new, comprehensible entry and exit handlers written in C

Commit-ID: c5c46f59e4e7c1ab244b8d38f2b61d317df90bba
Gitweb: http://git.kernel.org/tip/c5c46f59e4e7c1ab244b8d38f2b61d317df90bba
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:26 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:06 +0200

x86/entry: Add new, comprehensible entry and exit handlers written in C

The current x86 entry and exit code, written in a mixture of assembly and
C code, is incomprehensible due to being open-coded in a lot of places
without coherent documentation.

It appears to work primary by luck and duct tape: i.e. obvious runtime
failures were fixed on-demand, without re-thinking the design.

Due to those reasons our confidence level in that code is low, and it is
very difficult to incrementally improve.

Add new code written in C, in preparation for simply deleting the old
entry code.

prepare_exit_to_usermode() is a new function that will handle all
slow path exits to user mode. It is called with IRQs disabled
and it leaves us in a state in which it is safe to immediately
return to user mode. IRQs must not be re-enabled at any point
after prepare_exit_to_usermode() returns and user mode is actually
entered. (We can, of course, fail to enter user mode and treat
that failure as a fresh entry to kernel mode.)

All callers of do_notify_resume() will be migrated to call
prepare_exit_to_usermode() instead; prepare_exit_to_usermode() needs
to do everything that do_notify_resume() does today, but it also
takes care of scheduling and context tracking. Unlike
do_notify_resume(), it does not need to be called in a loop.

syscall_return_slowpath() is exactly what it sounds like: it will
be called on any syscall exit slow path. It will replace
syscall_trace_leave() and it calls prepare_exit_to_usermode() on the
way out.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/c57c8b87661a4152801d7d3786eac2d1a2f209dd.1435952415.git.luto@kernel.org
[ Improved the changelog a bit. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/common.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 111 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9a327ee..febc530 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
return syscall_trace_enter_phase2(regs, arch, phase1_result);
}

+/* Deprecated. */
void syscall_trace_leave(struct pt_regs *regs)
{
bool step;
@@ -237,8 +238,117 @@ void syscall_trace_leave(struct pt_regs *regs)
user_enter();
}

+static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
+{
+ unsigned long top_of_stack =
+ (unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
+ return (struct thread_info *)(top_of_stack - THREAD_SIZE);
+}
+
+/* Called with IRQs disabled. */
+__visible void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+ if (WARN_ON(!irqs_disabled()))
+ local_irq_disable();
+
+ /*
+ * In order to return to user mode, we need to have IRQs off with
+ * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
+ * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags
+ * can be set at any time on preemptable kernels if we have IRQs on,
+ * so we need to loop. Disabling preemption wouldn't help: doing the
+ * work to clear some of the flags can sleep.
+ */
+ while (true) {
+ u32 cached_flags =
+ READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+
+ if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+ _TIF_UPROBE | _TIF_NEED_RESCHED)))
+ break;
+
+ /* We have work to do. */
+ local_irq_enable();
+
+ if (cached_flags & _TIF_NEED_RESCHED)
+ schedule();
+
+ if (cached_flags & _TIF_UPROBE)
+ uprobe_notify_resume(regs);
+
+ /* deal with pending signal delivery */
+ if (cached_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (cached_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();
+
+ /* Disable IRQs and retry */
+ local_irq_disable();
+ }
+
+ user_enter();
+}
+
+/*
+ * Called with IRQs on and fully valid regs. Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+__visible void syscall_return_slowpath(struct pt_regs *regs)
+{
+ struct thread_info *ti = pt_regs_to_thread_info(regs);
+ u32 cached_flags = READ_ONCE(ti->flags);
+ bool step;
+
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
+ if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
+ regs->orig_ax))
+ local_irq_enable();
+
+ /*
+ * First do one-time work. If these work items are enabled, we
+ * want to run them exactly once per syscall exit with IRQs on.
+ */
+ if (cached_flags & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |
+ _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)) {
+ audit_syscall_exit(regs);
+
+ if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+ trace_sys_exit(regs, regs->ax);
+
+ /*
+ * If TIF_SYSCALL_EMU is set, we only get here because of
+ * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+ * We already reported this syscall instruction in
+ * syscall_trace_enter().
+ */
+ step = unlikely(
+ (cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
+ == _TIF_SINGLESTEP);
+ if (step || cached_flags & _TIF_SYSCALL_TRACE)
+ tracehook_report_syscall_exit(regs, step);
+ }
+
+#ifdef CONFIG_COMPAT
+ /*
+ * Compat syscalls set TS_COMPAT. Make sure we clear it before
+ * returning to user mode.
+ */
+ ti->status &= ~TS_COMPAT;
+#endif
+
+ local_irq_disable();
+ prepare_exit_to_usermode(regs);
+}
+
/*
- * notification of userspace execution resumption
+ * Deprecated notification of userspace execution resumption
* - triggered by the TIF_WORK_MASK flags
*/
__visible void

Subject: [tip:x86/asm] x86/entry/64: Really create an error-entry-from-usermode code path

Commit-ID: cb6f64ed5a04036eef07e70b57dd5dd78f2fbcef
Gitweb: http://git.kernel.org/tip/cb6f64ed5a04036eef07e70b57dd5dd78f2fbcef
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:27 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:07 +0200

x86/entry/64: Really create an error-entry-from-usermode code path

In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
gsbase/ebx/usermode code"), I arranged the code slightly wrong
-- IRET faults would skip the code path that was intended to
execute on all error entries from user mode. Fix it up.

While we're at it, make all the labels in error_entry local.

This does not fix a bug, but we'll need it, and it slightly
shrinks the code.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/91e17891e49fa3d61357eadc451529ad48143ee1.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 141a5d4..ccfcba9 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1143,12 +1143,17 @@ ENTRY(error_entry)
SAVE_EXTRA_REGS 8
xorl %ebx, %ebx
testb $3, CS+8(%rsp)
- jz error_kernelspace
+ jz .Lerror_kernelspace

- /* We entered from user mode */
+.Lerror_entry_from_usermode_swapgs:
+ /*
+ * We entered from user mode or we're pretending to have entered
+ * from user mode due to an IRET fault.
+ */
SWAPGS

-error_entry_done:
+.Lerror_entry_from_usermode_after_swapgs:
+.Lerror_entry_done:
TRACE_IRQS_OFF
ret

@@ -1158,31 +1163,30 @@ error_entry_done:
* truncated RIP for IRET exceptions returning to compat mode. Check
* for these here too.
*/
-error_kernelspace:
+.Lerror_kernelspace:
incl %ebx
leaq native_irq_return_iret(%rip), %rcx
cmpq %rcx, RIP+8(%rsp)
- je error_bad_iret
+ je .Lerror_bad_iret
movl %ecx, %eax /* zero extend */
cmpq %rax, RIP+8(%rsp)
- je bstep_iret
+ je .Lbstep_iret
cmpq $gs_change, RIP+8(%rsp)
- jne error_entry_done
+ jne .Lerror_entry_done

/*
* hack: gs_change can fail with user gsbase. If this happens, fix up
* gsbase and proceed. We'll fix up the exception and land in
* gs_change's error handler with kernel gsbase.
*/
- SWAPGS
- jmp error_entry_done
+ jmp .Lerror_entry_from_usermode_swapgs

-bstep_iret:
+.Lbstep_iret:
/* Fix truncated RIP */
movq %rcx, RIP+8(%rsp)
/* fall through */

-error_bad_iret:
+.Lerror_bad_iret:
/*
* We came from an IRET to user mode, so we have user gsbase.
* Switch to kernel gsbase:
@@ -1198,7 +1202,7 @@ error_bad_iret:
call fixup_bad_iret
mov %rax, %rsp
decl %ebx
- jmp error_entry_done
+ jmp .Lerror_entry_from_usermode_after_swapgs
END(error_entry)

Subject: [tip:x86/asm] x86/entry/64: Migrate 64-bit and compat syscalls to the new exit handlers and remove old assembly code

Commit-ID: 29ea1b258b98a862e59d72556714b75051ae93fb
Gitweb: http://git.kernel.org/tip/29ea1b258b98a862e59d72556714b75051ae93fb
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:28 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:07 +0200

x86/entry/64: Migrate 64-bit and compat syscalls to the new exit handlers and remove old assembly code

These need to be migrated together, as the compat case used to
jump into the middle of the 64-bit exit code.

Remove the old assembly code.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/d4d1d70de08ac3640badf50048a9e8f18fe2497f.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 69 +++++-----------------------------------
arch/x86/entry/entry_64_compat.S | 6 ++--
2 files changed, 11 insertions(+), 64 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ccfcba9..4ca5b78 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
*/
USERGS_SYSRET64

+GLOBAL(int_ret_from_sys_call_irqs_off)
+ TRACE_IRQS_ON
+ ENABLE_INTERRUPTS(CLBR_NONE)
+ jmp int_ret_from_sys_call
+
/* Do syscall entry tracing */
tracesys:
movq %rsp, %rdi
@@ -272,69 +277,11 @@ tracesys_phase2:
* Has correct iret frame.
*/
GLOBAL(int_ret_from_sys_call)
- DISABLE_INTERRUPTS(CLBR_NONE)
-int_ret_from_sys_call_irqs_off: /* jumps come here from the irqs-off SYSRET path */
- TRACE_IRQS_OFF
- movl $_TIF_ALLWORK_MASK, %edi
- /* edi: mask to check */
-GLOBAL(int_with_check)
- LOCKDEP_SYS_EXIT_IRQ
- GET_THREAD_INFO(%rcx)
- movl TI_flags(%rcx), %edx
- andl %edi, %edx
- jnz int_careful
- andl $~TS_COMPAT, TI_status(%rcx)
- jmp syscall_return
-
- /*
- * Either reschedule or signal or syscall exit tracking needed.
- * First do a reschedule test.
- * edx: work, edi: workmask
- */
-int_careful:
- bt $TIF_NEED_RESCHED, %edx
- jnc int_very_careful
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq %rdi
- SCHEDULE_USER
- popq %rdi
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp int_with_check
-
- /* handle signals and tracing -- both require a full pt_regs */
-int_very_careful:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_EXTRA_REGS
- /* Check for syscall exit trace */
- testl $_TIF_WORK_SYSCALL_EXIT, %edx
- jz int_signal
- pushq %rdi
- leaq 8(%rsp), %rdi /* &ptregs -> arg1 */
- call syscall_trace_leave
- popq %rdi
- andl $~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU), %edi
- jmp int_restore_rest
-
-int_signal:
- testl $_TIF_DO_NOTIFY_MASK, %edx
- jz 1f
- movq %rsp, %rdi /* &ptregs -> arg1 */
- xorl %esi, %esi /* oldset -> arg2 */
- call do_notify_resume
-1: movl $_TIF_WORK_MASK, %edi
-int_restore_rest:
+ movq %rsp, %rdi
+ call syscall_return_slowpath /* returns with IRQs disabled */
RESTORE_EXTRA_REGS
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp int_with_check
-
-syscall_return:
- /* The IRETQ could re-enable interrupts: */
- DISABLE_INTERRUPTS(CLBR_ANY)
- TRACE_IRQS_IRETQ
+ TRACE_IRQS_IRETQ /* we're about to change IF */

/*
* Try to use SYSRET instead of IRET if we're returning to
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e5ebdd9..d9bbd31 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -210,10 +210,10 @@ sysexit_from_sys_call:
.endm

.macro auditsys_exit exit
- testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
- jnz ia32_ret_from_sys_call
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+ jnz ia32_ret_from_sys_call
movl %eax, %esi /* second arg, syscall return value */
cmpl $-MAX_ERRNO, %eax /* is it an error ? */
jbe 1f
@@ -232,7 +232,7 @@ sysexit_from_sys_call:
movq %rax, R10(%rsp)
movq %rax, R9(%rsp)
movq %rax, R8(%rsp)
- jmp int_with_check
+ jmp int_ret_from_sys_call_irqs_off
.endm

sysenter_auditsys:

Subject: [tip:x86/asm] x86/asm/entry/64: Save all regs on interrupt entry

Commit-ID: ff467594f2a4be01a0fa5e9ffc223fa930d232dd
Gitweb: http://git.kernel.org/tip/ff467594f2a4be01a0fa5e9ffc223fa930d232dd
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:29 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:07 +0200

x86/asm/entry/64: Save all regs on interrupt entry

To prepare for the big rewrite of the error and interrupt exit
paths, we will need pt_regs completely filled in.

It's already completely filled in when error_exit runs, so rearrange
interrupt handling to match it. This will slow down interrupt
handling very slightly (eight instructions), but the
simplification it enables will be more than worth it.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/d8a766a7f558b30e6e01352854628a2d9943460c.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/calling.h | 3 ---
arch/x86/entry/entry_64.S | 29 +++++++++--------------------
2 files changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 519207f..3c71dd9 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -135,9 +135,6 @@ For 32-bit we have the following conventions - kernel is built with
movq %rbp, 4*8+\offset(%rsp)
movq %rbx, 5*8+\offset(%rsp)
.endm
- .macro SAVE_EXTRA_REGS_RBP offset=0
- movq %rbp, 4*8+\offset(%rsp)
- .endm

.macro RESTORE_EXTRA_REGS offset=0
movq 0*8+\offset(%rsp), %r15
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4ca5b78..65029f4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -502,21 +502,13 @@ END(irq_entries_start)
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
cld
- /*
- * Since nothing in interrupt handling code touches r12...r15 members
- * of "struct pt_regs", and since interrupts can nest, we can save
- * four stack slots and simultaneously provide
- * an unwind-friendly stack layout by saving "truncated" pt_regs
- * exactly up to rbp slot, without these members.
- */
- ALLOC_PT_GPREGS_ON_STACK -RBP
- SAVE_C_REGS -RBP
- /* this goes to 0(%rsp) for unwinder, not for saving the value: */
- SAVE_EXTRA_REGS_RBP -RBP
+ ALLOC_PT_GPREGS_ON_STACK
+ SAVE_C_REGS
+ SAVE_EXTRA_REGS

- leaq -RBP(%rsp), %rdi /* arg1 for \func (pointer to pt_regs) */
+ movq %rsp,%rdi /* arg1 for \func (pointer to pt_regs) */

- testb $3, CS-RBP(%rsp)
+ testb $3, CS(%rsp)
jz 1f
SWAPGS
1:
@@ -553,9 +545,7 @@ ret_from_intr:
decl PER_CPU_VAR(irq_count)

/* Restore saved previous stack */
- popq %rsi
- /* return code expects complete pt_regs - adjust rsp accordingly: */
- leaq -RBP(%rsi), %rsp
+ popq %rsp

testb $3, CS(%rsp)
jz retint_kernel
@@ -580,7 +570,7 @@ retint_swapgs: /* return to user-space */
TRACE_IRQS_IRETQ

SWAPGS
- jmp restore_c_regs_and_iret
+ jmp restore_regs_and_iret

/* Returning to kernel space */
retint_kernel:
@@ -604,6 +594,8 @@ retint_kernel:
* At this label, code paths which return to kernel and to user,
* which come from interrupts/exception and from syscalls, merge.
*/
+restore_regs_and_iret:
+ RESTORE_EXTRA_REGS
restore_c_regs_and_iret:
RESTORE_C_REGS
REMOVE_PT_GPREGS_FROM_STACK 8
@@ -674,12 +666,10 @@ retint_signal:
jz retint_swapgs
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
- SAVE_EXTRA_REGS
movq $-1, ORIG_RAX(%rsp)
xorl %esi, %esi /* oldset */
movq %rsp, %rdi /* &pt_regs */
call do_notify_resume
- RESTORE_EXTRA_REGS
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
@@ -1160,7 +1150,6 @@ END(error_entry)
*/
ENTRY(error_exit)
movl %ebx, %eax
- RESTORE_EXTRA_REGS
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl %eax, %eax

Subject: [tip:x86/asm] x86/asm/entry/64: Simplify IRQ stack pt_regs handling

Commit-ID: a586f98e9767fb0dfdb989002866b4024f00ce08
Gitweb: http://git.kernel.org/tip/a586f98e9767fb0dfdb989002866b4024f00ce08
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:30 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:08 +0200

x86/asm/entry/64: Simplify IRQ stack pt_regs handling

There's no need for both RSI and RDI to point to the original stack.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/3a0481f809dd340c7d3f54ce3fd6d66ef2a578cd.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 65029f4..83eb63d 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -506,8 +506,6 @@ END(irq_entries_start)
SAVE_C_REGS
SAVE_EXTRA_REGS

- movq %rsp,%rdi /* arg1 for \func (pointer to pt_regs) */
-
testb $3, CS(%rsp)
jz 1f
SWAPGS
@@ -519,14 +517,14 @@ END(irq_entries_start)
* a little cheaper to use a separate counter in the PDA (short of
* moving irq_enter into assembly, which would be too much work)
*/
- movq %rsp, %rsi
+ movq %rsp, %rdi
incl PER_CPU_VAR(irq_count)
cmovzq PER_CPU_VAR(irq_stack_ptr), %rsp
- pushq %rsi
+ pushq %rdi
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF

- call \func
+ call \func /* rdi points to pt_regs */
.endm

/*

Subject: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

Commit-ID: 02bc7768fe447ae305e924b931fa629073a4a1b9
Gitweb: http://git.kernel.org/tip/02bc7768fe447ae305e924b931fa629073a4a1b9
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:31 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:08 +0200

x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/60e90901eee611e59e958bfdbbe39969b4f88fe5.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 64 +++++++++++-----------------------------
arch/x86/entry/entry_64_compat.S | 5 ++++
2 files changed, 23 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 83eb63d..168ee26 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -508,7 +508,16 @@ END(irq_entries_start)

testb $3, CS(%rsp)
jz 1f
+
+ /*
+ * IRQ from user mode. Switch to kernel gsbase and inform context
+ * tracking that we're in kernel mode.
+ */
SWAPGS
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
1:
/*
* Save previous stack pointer, optionally switch to interrupt stack.
@@ -547,26 +556,13 @@ ret_from_intr:

testb $3, CS(%rsp)
jz retint_kernel
- /* Interrupt came from user space */
-GLOBAL(retint_user)
- GET_THREAD_INFO(%rcx)

- /* %rcx: thread info. Interrupts are off. */
-retint_with_reschedule:
- movl $_TIF_WORK_MASK, %edi
-retint_check:
+ /* Interrupt came from user space */
LOCKDEP_SYS_EXIT_IRQ
- movl TI_flags(%rcx), %edx
- andl %edi, %edx
- jnz retint_careful
-
-retint_swapgs: /* return to user-space */
- /*
- * The iretq could re-enable interrupts:
- */
- DISABLE_INTERRUPTS(CLBR_ANY)
+GLOBAL(retint_user)
+ mov %rsp,%rdi
+ call prepare_exit_to_usermode
TRACE_IRQS_IRETQ
-
SWAPGS
jmp restore_regs_and_iret

@@ -644,35 +640,6 @@ native_irq_return_ldt:
popq %rax
jmp native_irq_return_iret
#endif
-
- /* edi: workmask, edx: work */
-retint_careful:
- bt $TIF_NEED_RESCHED, %edx
- jnc retint_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq %rdi
- SCHEDULE_USER
- popq %rdi
- GET_THREAD_INFO(%rcx)
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp retint_check
-
-retint_signal:
- testl $_TIF_DO_NOTIFY_MASK, %edx
- jz retint_swapgs
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- movq $-1, ORIG_RAX(%rsp)
- xorl %esi, %esi /* oldset */
- movq %rsp, %rdi /* &pt_regs */
- call do_notify_resume
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- GET_THREAD_INFO(%rcx)
- jmp retint_with_reschedule
-
END(common_interrupt)

/*
@@ -1088,7 +1055,12 @@ ENTRY(error_entry)
SWAPGS

.Lerror_entry_from_usermode_after_swapgs:
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
.Lerror_entry_done:
+
TRACE_IRQS_OFF
ret

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index d9bbd31..25aca51 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -458,6 +458,11 @@ ia32_badarg:
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF

+ /* Now finish entering normal kernel mode. */
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
/* And exit again. */
jmp retint_user

Subject: [tip:x86/asm] x86/entry: Remove exception_enter() from most trap handlers

Commit-ID: 8c84014f3bbb112d07e73f30a10ac8a3a72f8649
Gitweb: http://git.kernel.org/tip/8c84014f3bbb112d07e73f30a10ac8a3a72f8649
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:32 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:09 +0200

x86/entry: Remove exception_enter() from most trap handlers

On 64-bit kernels, we don't need it any more: we handle context
tracking directly on entry from user mode and exit to user mode.

On 32-bit kernels, we don't support context tracking at all, so
these callbacks had no effect.

Note: this doesn't change do_page_fault(). Before we do that,
we need to make sure that there is no code that can page fault
from kernel mode with CONTEXT_USER. The 32-bit fast system call
stack argument code is the only offender I'm aware of right now.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/ae22f4dfebd799c916574089964592be218151f9.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/traps.h | 4 +-
arch/x86/kernel/cpu/mcheck/mce.c | 5 +--
arch/x86/kernel/cpu/mcheck/p5.c | 5 +--
arch/x86/kernel/cpu/mcheck/winchip.c | 4 +-
arch/x86/kernel/traps.c | 78 +++++++++---------------------------
5 files changed, 27 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c5380be..c3496619 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -112,8 +112,8 @@ asmlinkage void smp_threshold_interrupt(void);
asmlinkage void smp_deferred_error_interrupt(void);
#endif

-extern enum ctx_state ist_enter(struct pt_regs *regs);
-extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+extern void ist_enter(struct pt_regs *regs);
+extern void ist_exit(struct pt_regs *regs);
extern void ist_begin_non_atomic(struct pt_regs *regs);
extern void ist_end_non_atomic(void);

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 96ccecc..99940d1 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1029,7 +1029,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
{
struct mca_config *cfg = &mca_cfg;
struct mce m, *final;
- enum ctx_state prev_state;
int i;
int worst = 0;
int severity;
@@ -1055,7 +1054,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
int flags = MF_ACTION_REQUIRED;
int lmce = 0;

- prev_state = ist_enter(regs);
+ ist_enter(regs);

this_cpu_inc(mce_exception_count);

@@ -1227,7 +1226,7 @@ out:
local_irq_disable();
ist_end_non_atomic();
done:
- ist_exit(regs, prev_state);
+ ist_exit(regs);
}
EXPORT_SYMBOL_GPL(do_machine_check);

diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
index 737b0ad..12402e1 100644
--- a/arch/x86/kernel/cpu/mcheck/p5.c
+++ b/arch/x86/kernel/cpu/mcheck/p5.c
@@ -19,10 +19,9 @@ int mce_p5_enabled __read_mostly;
/* Machine check handler for Pentium class Intel CPUs: */
static void pentium_machine_check(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
u32 loaddr, hi, lotype;

- prev_state = ist_enter(regs);
+ ist_enter(regs);

rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
@@ -39,7 +38,7 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)

add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);

- ist_exit(regs, prev_state);
+ ist_exit(regs);
}

/* Set up machine check reporting for processors with Intel style MCE: */
diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
index 44f1382..01dd870 100644
--- a/arch/x86/kernel/cpu/mcheck/winchip.c
+++ b/arch/x86/kernel/cpu/mcheck/winchip.c
@@ -15,12 +15,12 @@
/* Machine check handler for WinChip C6: */
static void winchip_machine_check(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state = ist_enter(regs);
+ ist_enter(regs);

printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);

- ist_exit(regs, prev_state);
+ ist_exit(regs);
}

/* Set up machine check reporting on the Winchip C6 series */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2a783c4..8e65d8a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -108,13 +108,10 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
preempt_count_dec();
}

-enum ctx_state ist_enter(struct pt_regs *regs)
+void ist_enter(struct pt_regs *regs)
{
- enum ctx_state prev_state;
-
if (user_mode(regs)) {
- /* Other than that, we're just an exception. */
- prev_state = exception_enter();
+ CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
} else {
/*
* We might have interrupted pretty much anything. In
@@ -123,32 +120,25 @@ enum ctx_state ist_enter(struct pt_regs *regs)
* but we need to notify RCU.
*/
rcu_nmi_enter();
- prev_state = CONTEXT_KERNEL; /* the value is irrelevant. */
}

/*
- * We are atomic because we're on the IST stack (or we're on x86_32,
- * in which case we still shouldn't schedule).
- *
- * This must be after exception_enter(), because exception_enter()
- * won't do anything if in_interrupt() returns true.
+ * We are atomic because we're on the IST stack; or we're on
+ * x86_32, in which case we still shouldn't schedule; or we're
+ * on x86_64 and entered from user mode, in which case we're
+ * still atomic unless ist_begin_non_atomic is called.
*/
preempt_count_add(HARDIRQ_OFFSET);

/* This code is a bit fragile. Test it. */
rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
-
- return prev_state;
}

-void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
+void ist_exit(struct pt_regs *regs)
{
- /* Must be before exception_exit. */
preempt_count_sub(HARDIRQ_OFFSET);

- if (user_mode(regs))
- return exception_exit(prev_state);
- else
+ if (!user_mode(regs))
rcu_nmi_exit();
}

@@ -162,7 +152,7 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
* a double fault, it can be safe to schedule. ist_begin_non_atomic()
* begins a non-atomic section within an ist_enter()/ist_exit() region.
* Callers are responsible for enabling interrupts themselves inside
- * the non-atomic section, and callers must call is_end_non_atomic()
+ * the non-atomic section, and callers must call ist_end_non_atomic()
* before ist_exit().
*/
void ist_begin_non_atomic(struct pt_regs *regs)
@@ -289,7 +279,6 @@ NOKPROBE_SYMBOL(do_trap);
static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
unsigned long trapnr, int signr)
{
- enum ctx_state prev_state = exception_enter();
siginfo_t info;

CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -300,8 +289,6 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
do_trap(trapnr, signr, str, regs, error_code,
fill_trap_info(regs, signr, trapnr, &info));
}
-
- exception_exit(prev_state);
}

#define DO_ERROR(trapnr, signr, str, name) \
@@ -353,7 +340,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
}
#endif

- ist_enter(regs); /* Discard prev_state because we won't return. */
+ ist_enter(regs);
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

tsk->thread.error_code = error_code;
@@ -373,15 +360,13 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)

dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
const struct bndcsr *bndcsr;
siginfo_t *info;

- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
- goto exit;
+ return;
conditional_sti(regs);

if (!user_mode(regs))
@@ -438,9 +423,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
die("bounds", regs, error_code);
}

-exit:
- exception_exit(prev_state);
return;
+
exit_trap:
/*
* This path out is for all the cases where we could not
@@ -450,36 +434,33 @@ exit_trap:
* time..
*/
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
- exception_exit(prev_state);
}

dotraplinkage void
do_general_protection(struct pt_regs *regs, long error_code)
{
struct task_struct *tsk;
- enum ctx_state prev_state;

- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
conditional_sti(regs);

if (v8086_mode(regs)) {
local_irq_enable();
handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
- goto exit;
+ return;
}

tsk = current;
if (!user_mode(regs)) {
if (fixup_exception(regs))
- goto exit;
+ return;

tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
if (notify_die(DIE_GPF, "general protection fault", regs, error_code,
X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
die("general protection fault", regs, error_code);
- goto exit;
+ return;
}

tsk->thread.error_code = error_code;
@@ -495,16 +476,12 @@ do_general_protection(struct pt_regs *regs, long error_code)
}

force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
-exit:
- exception_exit(prev_state);
}
NOKPROBE_SYMBOL(do_general_protection);

/* May run on IST stack. */
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
#ifdef CONFIG_DYNAMIC_FTRACE
/*
* ftrace must be first, everything else may cause a recursive crash.
@@ -517,7 +494,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
if (poke_int3_handler(regs))
return;

- prev_state = ist_enter(regs);
+ ist_enter(regs);
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
@@ -544,7 +521,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
preempt_conditional_cli(regs);
debug_stack_usage_dec();
exit:
- ist_exit(regs, prev_state);
+ ist_exit(regs);
}
NOKPROBE_SYMBOL(do_int3);

@@ -620,12 +597,11 @@ NOKPROBE_SYMBOL(fixup_bad_iret);
dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
{
struct task_struct *tsk = current;
- enum ctx_state prev_state;
int user_icebp = 0;
unsigned long dr6;
int si_code;

- prev_state = ist_enter(regs);
+ ist_enter(regs);

get_debugreg(dr6, 6);

@@ -700,7 +676,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
debug_stack_usage_dec();

exit:
- ist_exit(regs, prev_state);
+ ist_exit(regs);
}
NOKPROBE_SYMBOL(do_debug);

@@ -752,23 +728,15 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)

dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_MF);
- exception_exit(prev_state);
}

dotraplinkage void
do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_XF);
- exception_exit(prev_state);
}

dotraplinkage void
@@ -780,9 +748,6 @@ do_spurious_interrupt_bug(struct pt_regs *regs, long error_code)
dotraplinkage void
do_device_not_available(struct pt_regs *regs, long error_code)
{
- enum ctx_state prev_state;
-
- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
BUG_ON(use_eager_fpu());

@@ -794,7 +759,6 @@ do_device_not_available(struct pt_regs *regs, long error_code)

info.regs = regs;
math_emulate(&info);
- exception_exit(prev_state);
return;
}
#endif
@@ -802,7 +766,6 @@ do_device_not_available(struct pt_regs *regs, long error_code)
#ifdef CONFIG_X86_32
conditional_sti(regs);
#endif
- exception_exit(prev_state);
}
NOKPROBE_SYMBOL(do_device_not_available);

@@ -810,9 +773,7 @@ NOKPROBE_SYMBOL(do_device_not_available);
dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
{
siginfo_t info;
- enum ctx_state prev_state;

- prev_state = exception_enter();
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
local_irq_enable();

@@ -825,7 +786,6 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
do_trap(X86_TRAP_IRET, SIGILL, "iret exception", regs, error_code,
&info);
}
- exception_exit(prev_state);
}
#endif

Subject: [tip:x86/asm] x86/entry: Remove SCHEDULE_USER and asm/ context-tracking.h

Commit-ID: 06a7b36c7bd932e60997bedbae32b3d8e6722281
Gitweb: http://git.kernel.org/tip/06a7b36c7bd932e60997bedbae32b3d8e6722281
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:33 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:09 +0200

x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h

SCHEDULE_USER is no longer used, and asm/context-tracking.h
contained nothing else. Remove the header entirely.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/854e9b45f69af20e26c47099eb236321563ebcee.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64.S | 1 -
arch/x86/include/asm/context_tracking.h | 10 ----------
2 files changed, 11 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 168ee26..041a37a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -33,7 +33,6 @@
#include <asm/paravirt.h>
#include <asm/percpu.h>
#include <asm/asm.h>
-#include <asm/context_tracking.h>
#include <asm/smap.h>
#include <asm/pgtable_types.h>
#include <linux/err.h>
diff --git a/arch/x86/include/asm/context_tracking.h b/arch/x86/include/asm/context_tracking.h
deleted file mode 100644
index 1fe4970..0000000
--- a/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif

Subject: [tip:x86/asm] x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion

Commit-ID: 0333a209cbf600e980fc55c24878a56f25f48b65
Gitweb: http://git.kernel.org/tip/0333a209cbf600e980fc55c24878a56f25f48b65
Author: Andy Lutomirski <[email protected]>
AuthorDate: Fri, 3 Jul 2015 12:44:34 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 7 Jul 2015 10:59:10 +0200

x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/e8bdc4ed0193fb2fd130f3d6b7b8023e2ec1ab62.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/irq.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 88b36648..6233de0 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
unsigned vector = ~regs->orig_ax;
unsigned irq;

+ /*
+ * NB: Unlike exception entries, IRQ entries do not reliably
+ * handle context tracking in the low-level entry code. This is
+ * because syscall entries execute briefly with IRQs on before
+ * updating context tracking state, so we can take an IRQ from
+ * kernel mode with CONTEXT_USER. The low-level entry code only
+ * updates the context if we came from user mode, so we won't
+ * switch to CONTEXT_KERNEL. We'll fix that once the syscall
+ * code is cleaned up enough that we can cleanly defer enabling
+ * IRQs.
+ */
+
entering_irq();

+ /* entering_irq() tells RCU that we're not quiescent. Check it. */
+ rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");
+
irq = __this_cpu_read(vector_irq[vector]);

if (!handle_irq(irq, regs)) {

2015-07-07 11:12:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v5 00/17] x86: Rewrite exit-to-userspace code


So this looks mostly problem free on my boxen, except this warning triggers:

Adding 3911820k swap on /dev/sda2. Priority:-1 extents:1 across:3911820k
capability: warning: `dbus-daemon' uses 32-bit capabilities (legacy support in use)
------------[ cut here ]------------
WARNING: CPU: 1 PID: 2445 at arch/x86/entry/common.c:311 syscall_return_slowpath+0x4c/0x270()
syscall 6 left IRQs disabled
Modules linked in:
CPU: 1 PID: 2445 Comm: distccd Not tainted 4.2.0-rc1-01597-gaecd781-dirty #18
0000000000000000 00000000776afac2 ffff880035413e58 ffffffff81c8915f
0000000000000000 ffff880035413eb0 ffff880035413e98 ffffffff810a8d82
ffff880035413e78 ffff880035413f58 0000000020020002 ffff880035410000
Call Trace:
[<ffffffff81c8915f>] dump_stack+0x4f/0x7b
[<ffffffff810a8d82>] warn_slowpath_common+0xa2/0xc0
[<ffffffff810a8df5>] warn_slowpath_fmt+0x55/0x70
[<ffffffff81001ddc>] syscall_return_slowpath+0x4c/0x270
[<ffffffff81c96471>] int_ret_from_sys_call+0x25/0x9f
---[ end trace 083efc734e089d37 ]---
device: 'vcs2': device_add
PM: Adding info for No Bus:vcs2
device: 'vcsa2': device_add

with ancient user-space, running the attached .config.

The system booted up fine otherwise. The warning corresponds to:

if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
regs->orig_ax))
local_irq_enable();

and this was just the regular startup of the distccd daemon during bootup, nothing
particularly fancy.

Note that 'distccd' is a 32-bit ELF binary - and this is a 64-bit kernel.

Syscall 6 would be:

arch/x86/entry/syscalls/syscall_32.tbl:6 i386 close sys_close

Thanks,

Ingo


Attachments:
(No filename) (1.62 kB)
config (117.75 kB)
Download all attachments

2015-07-07 16:04:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 00/17] x86: Rewrite exit-to-userspace code

On Tue, Jul 7, 2015 at 4:12 AM, Ingo Molnar <[email protected]> wrote:
>
> So this looks mostly problem free on my boxen, except this warning triggers:
>
> Adding 3911820k swap on /dev/sda2. Priority:-1 extents:1 across:3911820k
> capability: warning: `dbus-daemon' uses 32-bit capabilities (legacy support in use)
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 2445 at arch/x86/entry/common.c:311 syscall_return_slowpath+0x4c/0x270()
> syscall 6 left IRQs disabled
> Modules linked in:
> CPU: 1 PID: 2445 Comm: distccd Not tainted 4.2.0-rc1-01597-gaecd781-dirty #18
> 0000000000000000 00000000776afac2 ffff880035413e58 ffffffff81c8915f
> 0000000000000000 ffff880035413eb0 ffff880035413e98 ffffffff810a8d82
> ffff880035413e78 ffff880035413f58 0000000020020002 ffff880035410000
> Call Trace:
> [<ffffffff81c8915f>] dump_stack+0x4f/0x7b
> [<ffffffff810a8d82>] warn_slowpath_common+0xa2/0xc0
> [<ffffffff810a8df5>] warn_slowpath_fmt+0x55/0x70
> [<ffffffff81001ddc>] syscall_return_slowpath+0x4c/0x270
> [<ffffffff81c96471>] int_ret_from_sys_call+0x25/0x9f
> ---[ end trace 083efc734e089d37 ]---
> device: 'vcs2': device_add
> PM: Adding info for No Bus:vcs2
> device: 'vcsa2': device_add
>
> with ancient user-space, running the attached .config.
>
> The system booted up fine otherwise. The warning corresponds to:
>
> if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
>
> and this was just the regular startup of the distccd daemon during bootup, nothing
> particularly fancy.
>
> Note that 'distccd' is a 32-bit ELF binary - and this is a 64-bit kernel.
>
> Syscall 6 would be:
>
> arch/x86/entry/syscalls/syscall_32.tbl:6 i386 close sys_close
>
> Thanks,
>
> Ingo

It's irq state confusion in these lovely macros:

#ifndef CONFIG_AUDITSYSCALL
# define sysexit_audit ia32_ret_from_sys_call
# define sysretl_audit ia32_ret_from_sys_call
#endif

Frankly, I'm amazed that the old code seems to have worked. I should
have a patch for you later today.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-07-07 17:55:39

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] x86/entry/64: Fix warning on compat syscalls with CONFIG_AUDITSYSCALL=n

int_ret_from_sys_call now expects IRQs to be enabled. I got this right
in the real sysexit_audit and sysretl_audit asm paths, but I missed it
in the #defined-away versions when CONFIG_AUDITSYSCALL=n. This is
a straightforward fix for CONFIG_AUDITSYSCALL=n

Fixes: 29ea1b258b98 ("x86/entry/64: Migrate 64-bit and compat syscalls to the new exit handlers and remove old assembly code")
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64_compat.S | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 25aca51a6324..d7571532e7ce 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -22,8 +22,8 @@
#define __AUDIT_ARCH_LE 0x40000000

#ifndef CONFIG_AUDITSYSCALL
-# define sysexit_audit ia32_ret_from_sys_call
-# define sysretl_audit ia32_ret_from_sys_call
+# define sysexit_audit ia32_ret_from_sys_call_irqs_off
+# define sysretl_audit ia32_ret_from_sys_call_irqs_off
#endif

.section .entry.text, "ax"
@@ -466,6 +466,10 @@ ia32_badarg:
/* And exit again. */
jmp retint_user

+ia32_ret_from_sys_call_irqs_off:
+ TRACE_IRQS_ON
+ ENABLE_INTERRUPTS(CLBR_NONE)
+
ia32_ret_from_sys_call:
xorl %eax, %eax /* Do not leak kernel information */
movq %rax, R11(%rsp)
--
2.4.3

Subject: [tip:x86/asm] x86/entry/64: Fix IRQ state confusion and related warning on compat syscalls with CONFIG_AUDITSYSCALL =n

Commit-ID: 0c6541b605747fc39dc6b1715e1f3a3dca1cace5
Gitweb: http://git.kernel.org/tip/0c6541b605747fc39dc6b1715e1f3a3dca1cace5
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 7 Jul 2015 10:55:28 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 8 Jul 2015 11:53:44 +0200

x86/entry/64: Fix IRQ state confusion and related warning on compat syscalls with CONFIG_AUDITSYSCALL=n

int_ret_from_sys_call now expects IRQs to be enabled. I got
this right in the real sysexit_audit and sysretl_audit asm
paths, but I missed it in the #defined-away versions when
CONFIG_AUDITSYSCALL=n. This is a straightforward fix for
CONFIG_AUDITSYSCALL=n

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 29ea1b258b98 ("x86/entry/64: Migrate 64-bit and compat syscalls to the new exit handlers and remove old assembly code")
Link: http://lkml.kernel.org/r/25cf0a01e01c6008118dd8f8d9f043020416700c.1436291493.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64_compat.S | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 25aca51..d757153 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -22,8 +22,8 @@
#define __AUDIT_ARCH_LE 0x40000000

#ifndef CONFIG_AUDITSYSCALL
-# define sysexit_audit ia32_ret_from_sys_call
-# define sysretl_audit ia32_ret_from_sys_call
+# define sysexit_audit ia32_ret_from_sys_call_irqs_off
+# define sysretl_audit ia32_ret_from_sys_call_irqs_off
#endif

.section .entry.text, "ax"
@@ -466,6 +466,10 @@ ia32_badarg:
/* And exit again. */
jmp retint_user

+ia32_ret_from_sys_call_irqs_off:
+ TRACE_IRQS_ON
+ ENABLE_INTERRUPTS(CLBR_NONE)
+
ia32_ret_from_sys_call:
xorl %eax, %eax /* Do not leak kernel information */
movq %rax, R11(%rsp)

Subject: [tip:x86/asm] x86/entry/64: Fix IRQ state confusion and related warning on compat syscalls with CONFIG_AUDITSYSCALL =n

Commit-ID: 8f7f06b87acd2e017d6c536f59e10045dd8d0578
Gitweb: http://git.kernel.org/tip/8f7f06b87acd2e017d6c536f59e10045dd8d0578
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 7 Jul 2015 10:55:28 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 8 Jul 2015 21:10:25 +0200

x86/entry/64: Fix IRQ state confusion and related warning on compat syscalls with CONFIG_AUDITSYSCALL=n

int_ret_from_sys_call now expects IRQs to be enabled. I got
this right in the real sysexit_audit and sysretl_audit asm
paths, but I missed it in the #defined-away versions when
CONFIG_AUDITSYSCALL=n. This is a straightforward fix for
CONFIG_AUDITSYSCALL=n

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 29ea1b258b98 ("x86/entry/64: Migrate 64-bit and compat syscalls to the new exit handlers and remove old assembly code")
Link: http://lkml.kernel.org/r/25cf0a01e01c6008118dd8f8d9f043020416700c.1436291493.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/entry_64_compat.S | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 25aca51..d757153 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -22,8 +22,8 @@
#define __AUDIT_ARCH_LE 0x40000000

#ifndef CONFIG_AUDITSYSCALL
-# define sysexit_audit ia32_ret_from_sys_call
-# define sysretl_audit ia32_ret_from_sys_call
+# define sysexit_audit ia32_ret_from_sys_call_irqs_off
+# define sysretl_audit ia32_ret_from_sys_call_irqs_off
#endif

.section .entry.text, "ax"
@@ -466,6 +466,10 @@ ia32_badarg:
/* And exit again. */
jmp retint_user

+ia32_ret_from_sys_call_irqs_off:
+ TRACE_IRQS_ON
+ ENABLE_INTERRUPTS(CLBR_NONE)
+
ia32_ret_from_sys_call:
xorl %eax, %eax /* Do not leak kernel information */
movq %rax, R11(%rsp)

2015-07-14 23:01:08

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/entry: Add enter_from_user_mode() and use it in syscalls

On Tue, Jul 07, 2015 at 03:51:29AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID: feed36cde0a10adb957445a37e48f957f30b2273
> Gitweb: http://git.kernel.org/tip/feed36cde0a10adb957445a37e48f957f30b2273
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Fri, 3 Jul 2015 12:44:25 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 7 Jul 2015 10:59:06 +0200
>
> x86/entry: Add enter_from_user_mode() and use it in syscalls
>
> Changing the x86 context tracking hooks is dangerous because
> there are no good checks that we track our context correctly.
> Add a helper to check that we're actually in CONTEXT_USER when
> we enter from user mode and wire it up for syscall entries.
>
> Subsequent patches will wire this up for all non-NMI entries as
> well. NMIs are their own special beast and cannot currently
> switch overall context tracking state. Instead, they have their
> own special RCU hooks.
>
> This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
> branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a
> layer of indirection). Eventually, we should fix up the core
> context tracking code to supply a function that does what we
> want (and can be much simpler than user_exit), which will enable
> us to get rid of the extra call.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/853b42420066ec3fb856779cdc223a6dcb5d355b.1435952415.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/entry/common.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 917d0c3..9a327ee 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -28,6 +28,15 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/syscalls.h>
>
> +#ifdef CONFIG_CONTEXT_TRACKING
> +/* Called on entry from user mode with IRQs off. */
> +__visible void enter_from_user_mode(void)
> +{
> + CT_WARN_ON(ct_state() != CONTEXT_USER);
> + user_exit();
> +}
> +#endif
> +
> static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> {
> #ifdef CONFIG_X86_64
> @@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
> work = ACCESS_ONCE(current_thread_info()->flags) &
> _TIF_WORK_SYSCALL_ENTRY;
>
> +#ifdef CONFIG_CONTEXT_TRACKING
> /*
> * If TIF_NOHZ is set, we are required to call user_exit() before
> * doing anything that could touch RCU.
> */
> if (work & _TIF_NOHZ) {
> - user_exit();
> + enter_from_user_mode();
> work &= ~_TIF_NOHZ;

We should move the sanity check to user_exit/enter() and use user_exit/enter()
only when we actually enter/exit user. Here it's the case but syscall_trace_leave()
and do_notify_resume() are special case that should probably use exception_enter/exit()
unless your patchset have changed things such that there is only one call to user_exit()
once we completed everything before resuming userspace. I need to review the rest of
the patchset to discover that :-)

> }
> +#endif
>
> #ifdef CONFIG_SECCOMP
> /*

2015-07-14 23:05:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/entry: Add enter_from_user_mode() and use it in syscalls

On Tue, Jul 14, 2015 at 4:00 PM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 03:51:29AM -0700, tip-bot for Andy Lutomirski wrote:
>> Commit-ID: feed36cde0a10adb957445a37e48f957f30b2273
>> Gitweb: http://git.kernel.org/tip/feed36cde0a10adb957445a37e48f957f30b2273
>> Author: Andy Lutomirski <[email protected]>
>> AuthorDate: Fri, 3 Jul 2015 12:44:25 -0700
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 7 Jul 2015 10:59:06 +0200
>>
>> x86/entry: Add enter_from_user_mode() and use it in syscalls
>>
>> Changing the x86 context tracking hooks is dangerous because
>> there are no good checks that we track our context correctly.
>> Add a helper to check that we're actually in CONTEXT_USER when
>> we enter from user mode and wire it up for syscall entries.
>>
>> Subsequent patches will wire this up for all non-NMI entries as
>> well. NMIs are their own special beast and cannot currently
>> switch overall context tracking state. Instead, they have their
>> own special RCU hooks.
>>
>> This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
>> branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a
>> layer of indirection). Eventually, we should fix up the core
>> context tracking code to supply a function that does what we
>> want (and can be much simpler than user_exit), which will enable
>> us to get rid of the extra call.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Brian Gerst <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Link: http://lkml.kernel.org/r/853b42420066ec3fb856779cdc223a6dcb5d355b.1435952415.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/entry/common.c | 13 ++++++++++++-
>> 1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>> index 917d0c3..9a327ee 100644
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -28,6 +28,15 @@
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/syscalls.h>
>>
>> +#ifdef CONFIG_CONTEXT_TRACKING
>> +/* Called on entry from user mode with IRQs off. */
>> +__visible void enter_from_user_mode(void)
>> +{
>> + CT_WARN_ON(ct_state() != CONTEXT_USER);
>> + user_exit();
>> +}
>> +#endif
>> +
>> static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> {
>> #ifdef CONFIG_X86_64
>> @@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>> work = ACCESS_ONCE(current_thread_info()->flags) &
>> _TIF_WORK_SYSCALL_ENTRY;
>>
>> +#ifdef CONFIG_CONTEXT_TRACKING
>> /*
>> * If TIF_NOHZ is set, we are required to call user_exit() before
>> * doing anything that could touch RCU.
>> */
>> if (work & _TIF_NOHZ) {
>> - user_exit();
>> + enter_from_user_mode();
>> work &= ~_TIF_NOHZ;
>
> We should move the sanity check to user_exit/enter() and use user_exit/enter()
> only when we actually enter/exit user.

I agree, but I don't know what other arches to.

> Here it's the case but syscall_trace_leave()
> and do_notify_resume() are special case that should probably use exception_enter/exit()
> unless your patchset have changed things such that there is only one call to user_exit()
> once we completed everything before resuming userspace. I need to review the rest of
> the patchset to discover that :-)

syscall_trace_leave and do_notify_resume may be so screwed up that
your suggestion wouldn't even work. However, the next set of patches
(out for review but currently stalled pending Brian Gerst's vm86 work)
remove those functions entirely.

--Andy

2015-07-14 23:07:33

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/entry: Add new, comprehensible entry and exit handlers written in C

On Tue, Jul 07, 2015 at 03:51:48AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID: c5c46f59e4e7c1ab244b8d38f2b61d317df90bba
> Gitweb: http://git.kernel.org/tip/c5c46f59e4e7c1ab244b8d38f2b61d317df90bba
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Fri, 3 Jul 2015 12:44:26 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 7 Jul 2015 10:59:06 +0200
>
> x86/entry: Add new, comprehensible entry and exit handlers written in C
>
> The current x86 entry and exit code, written in a mixture of assembly and
> C code, is incomprehensible due to being open-coded in a lot of places
> without coherent documentation.
>
> It appears to work primary by luck and duct tape: i.e. obvious runtime
> failures were fixed on-demand, without re-thinking the design.
>
> Due to those reasons our confidence level in that code is low, and it is
> very difficult to incrementally improve.
>
> Add new code written in C, in preparation for simply deleting the old
> entry code.
>
> prepare_exit_to_usermode() is a new function that will handle all
> slow path exits to user mode. It is called with IRQs disabled
> and it leaves us in a state in which it is safe to immediately
> return to user mode. IRQs must not be re-enabled at any point
> after prepare_exit_to_usermode() returns and user mode is actually
> entered. (We can, of course, fail to enter user mode and treat
> that failure as a fresh entry to kernel mode.)
>
> All callers of do_notify_resume() will be migrated to call
> prepare_exit_to_usermode() instead; prepare_exit_to_usermode() needs
> to do everything that do_notify_resume() does today, but it also
> takes care of scheduling and context tracking. Unlike
> do_notify_resume(), it does not need to be called in a loop.
>
> syscall_return_slowpath() is exactly what it sounds like: it will
> be called on any syscall exit slow path. It will replace
> syscall_trace_leave() and it calls prepare_exit_to_usermode() on the
> way out.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/c57c8b87661a4152801d7d3786eac2d1a2f209dd.1435952415.git.luto@kernel.org
> [ Improved the changelog a bit. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/entry/common.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 111 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 9a327ee..febc530 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
> return syscall_trace_enter_phase2(regs, arch, phase1_result);
> }
>
> +/* Deprecated. */
> void syscall_trace_leave(struct pt_regs *regs)
> {
> bool step;
> @@ -237,8 +238,117 @@ void syscall_trace_leave(struct pt_regs *regs)
> user_enter();
> }
>
> +static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
> +{
> + unsigned long top_of_stack =
> + (unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
> + return (struct thread_info *)(top_of_stack - THREAD_SIZE);
> +}
> +
> +/* Called with IRQs disabled. */
> +__visible void prepare_exit_to_usermode(struct pt_regs *regs)
> +{
> + if (WARN_ON(!irqs_disabled()))
> + local_irq_disable();
> +
> + /*
> + * In order to return to user mode, we need to have IRQs off with
> + * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
> + * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags
> + * can be set at any time on preemptable kernels if we have IRQs on,
> + * so we need to loop. Disabling preemption wouldn't help: doing the
> + * work to clear some of the flags can sleep.
> + */
> + while (true) {
> + u32 cached_flags =
> + READ_ONCE(pt_regs_to_thread_info(regs)->flags);
> +
> + if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> + _TIF_UPROBE | _TIF_NEED_RESCHED)))
> + break;
> +
> + /* We have work to do. */
> + local_irq_enable();
> +
> + if (cached_flags & _TIF_NEED_RESCHED)
> + schedule();
> +
> + if (cached_flags & _TIF_UPROBE)
> + uprobe_notify_resume(regs);
> +
> + /* deal with pending signal delivery */
> + if (cached_flags & _TIF_SIGPENDING)
> + do_signal(regs);
> +
> + if (cached_flags & _TIF_NOTIFY_RESUME) {
> + clear_thread_flag(TIF_NOTIFY_RESUME);
> + tracehook_notify_resume(regs);
> + }
> +
> + if (cached_flags & _TIF_USER_RETURN_NOTIFY)
> + fire_user_return_notifiers();
> +
> + /* Disable IRQs and retry */
> + local_irq_disable();
> + }

I dreamed so many times about this loop in C!

> +
> + user_enter();

So now we are sure that we have only one call to user_enter() before
resuming userspace, once we've completed everything, rescheduling, signals,
etc... No more context tracking hacky round on signals and rescheduling?

That's great. I need to check if other archs still need schedule_user().

Thanks a lot!

2015-07-14 23:26:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion

On Tue, Jul 07, 2015 at 03:54:32AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID: 0333a209cbf600e980fc55c24878a56f25f48b65
> Gitweb: http://git.kernel.org/tip/0333a209cbf600e980fc55c24878a56f25f48b65
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Fri, 3 Jul 2015 12:44:34 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 7 Jul 2015 10:59:10 +0200
>
> x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/e8bdc4ed0193fb2fd130f3d6b7b8023e2ec1ab62.1435952415.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/kernel/irq.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 88b36648..6233de0 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
> unsigned vector = ~regs->orig_ax;
> unsigned irq;
>
> + /*
> + * NB: Unlike exception entries, IRQ entries do not reliably
> + * handle context tracking in the low-level entry code. This is
> + * because syscall entries execute briefly with IRQs on before
> + * updating context tracking state, so we can take an IRQ from
> + * kernel mode with CONTEXT_USER. The low-level entry code only
> + * updates the context if we came from user mode, so we won't
> + * switch to CONTEXT_KERNEL. We'll fix that once the syscall
> + * code is cleaned up enough that we can cleanly defer enabling
> + * IRQs.
> + */
> +

Now is it a problem to take interrupts in kernel mode with CONTEXT_USER?
I'm not sure it's worth trying to make it not happen.

> entering_irq();
>
> + /* entering_irq() tells RCU that we're not quiescent. Check it. */
> + rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");

Why do we need to check that?

> +
> irq = __this_cpu_read(vector_irq[vector]);
>
> if (!handle_irq(irq, regs)) {

Thanks.

2015-07-14 23:28:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/entry: Add enter_from_user_mode() and use it in syscalls

On Tue, Jul 14, 2015 at 04:04:47PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 14, 2015 at 4:00 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jul 07, 2015 at 03:51:29AM -0700, tip-bot for Andy Lutomirski wrote:
> >> Commit-ID: feed36cde0a10adb957445a37e48f957f30b2273
> >> Gitweb: http://git.kernel.org/tip/feed36cde0a10adb957445a37e48f957f30b2273
> >> Author: Andy Lutomirski <[email protected]>
> >> AuthorDate: Fri, 3 Jul 2015 12:44:25 -0700
> >> Committer: Ingo Molnar <[email protected]>
> >> CommitDate: Tue, 7 Jul 2015 10:59:06 +0200
> >>
> >> x86/entry: Add enter_from_user_mode() and use it in syscalls
> >>
> >> Changing the x86 context tracking hooks is dangerous because
> >> there are no good checks that we track our context correctly.
> >> Add a helper to check that we're actually in CONTEXT_USER when
> >> we enter from user mode and wire it up for syscall entries.
> >>
> >> Subsequent patches will wire this up for all non-NMI entries as
> >> well. NMIs are their own special beast and cannot currently
> >> switch overall context tracking state. Instead, they have their
> >> own special RCU hooks.
> >>
> >> This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
> >> branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a
> >> layer of indirection). Eventually, we should fix up the core
> >> context tracking code to supply a function that does what we
> >> want (and can be much simpler than user_exit), which will enable
> >> us to get rid of the extra call.
> >>
> >> Signed-off-by: Andy Lutomirski <[email protected]>
> >> Cc: Andy Lutomirski <[email protected]>
> >> Cc: Borislav Petkov <[email protected]>
> >> Cc: Brian Gerst <[email protected]>
> >> Cc: Denys Vlasenko <[email protected]>
> >> Cc: Denys Vlasenko <[email protected]>
> >> Cc: Frederic Weisbecker <[email protected]>
> >> Cc: H. Peter Anvin <[email protected]>
> >> Cc: Kees Cook <[email protected]>
> >> Cc: Linus Torvalds <[email protected]>
> >> Cc: Oleg Nesterov <[email protected]>
> >> Cc: Peter Zijlstra <[email protected]>
> >> Cc: Rik van Riel <[email protected]>
> >> Cc: Thomas Gleixner <[email protected]>
> >> Cc: [email protected]
> >> Link: http://lkml.kernel.org/r/853b42420066ec3fb856779cdc223a6dcb5d355b.1435952415.git.luto@kernel.org
> >> Signed-off-by: Ingo Molnar <[email protected]>
> >> ---
> >> arch/x86/entry/common.c | 13 ++++++++++++-
> >> 1 file changed, 12 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> >> index 917d0c3..9a327ee 100644
> >> --- a/arch/x86/entry/common.c
> >> +++ b/arch/x86/entry/common.c
> >> @@ -28,6 +28,15 @@
> >> #define CREATE_TRACE_POINTS
> >> #include <trace/events/syscalls.h>
> >>
> >> +#ifdef CONFIG_CONTEXT_TRACKING
> >> +/* Called on entry from user mode with IRQs off. */
> >> +__visible void enter_from_user_mode(void)
> >> +{
> >> + CT_WARN_ON(ct_state() != CONTEXT_USER);
> >> + user_exit();
> >> +}
> >> +#endif
> >> +
> >> static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> >> {
> >> #ifdef CONFIG_X86_64
> >> @@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
> >> work = ACCESS_ONCE(current_thread_info()->flags) &
> >> _TIF_WORK_SYSCALL_ENTRY;
> >>
> >> +#ifdef CONFIG_CONTEXT_TRACKING
> >> /*
> >> * If TIF_NOHZ is set, we are required to call user_exit() before
> >> * doing anything that could touch RCU.
> >> */
> >> if (work & _TIF_NOHZ) {
> >> - user_exit();
> >> + enter_from_user_mode();
> >> work &= ~_TIF_NOHZ;
> >
> > We should move the sanity check to user_exit/enter() and use user_exit/enter()
> > only when we actually enter/exit user.
>
> I agree, but I don't know what other arches to.

Right, I'll need to check that carefully, once I fully understand your patchset.

>
> > Here it's the case but syscall_trace_leave()
> > and do_notify_resume() are special case that should probably use exception_enter/exit()
> > unless your patchset have changed things such that there is only one call to user_exit()
> > once we completed everything before resuming userspace. I need to review the rest of
> > the patchset to discover that :-)
>
> syscall_trace_leave and do_notify_resume may be so screwed up that
> your suggestion wouldn't even work. However, the next set of patches
> (out for review but currently stalled pending Brian Gerst's vm86 work)
> remove those functions entirely.

Ok so I'm probably confused. I need to check the resulting code.

>
> --Andy

2015-07-14 23:34:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion

On Tue, Jul 14, 2015 at 4:26 PM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 03:54:32AM -0700, tip-bot for Andy Lutomirski wrote:
>> Commit-ID: 0333a209cbf600e980fc55c24878a56f25f48b65
>> Gitweb: http://git.kernel.org/tip/0333a209cbf600e980fc55c24878a56f25f48b65
>> Author: Andy Lutomirski <[email protected]>
>> AuthorDate: Fri, 3 Jul 2015 12:44:34 -0700
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 7 Jul 2015 10:59:10 +0200
>>
>> x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Brian Gerst <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Paul E. McKenney <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Link: http://lkml.kernel.org/r/e8bdc4ed0193fb2fd130f3d6b7b8023e2ec1ab62.1435952415.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/kernel/irq.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index 88b36648..6233de0 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
>> unsigned vector = ~regs->orig_ax;
>> unsigned irq;
>>
>> + /*
>> + * NB: Unlike exception entries, IRQ entries do not reliably
>> + * handle context tracking in the low-level entry code. This is
>> + * because syscall entries execute briefly with IRQs on before
>> + * updating context tracking state, so we can take an IRQ from
>> + * kernel mode with CONTEXT_USER. The low-level entry code only
>> + * updates the context if we came from user mode, so we won't
>> + * switch to CONTEXT_KERNEL. We'll fix that once the syscall
>> + * code is cleaned up enough that we can cleanly defer enabling
>> + * IRQs.
>> + */
>> +
>
> Now is it a problem to take interrupts in kernel mode with CONTEXT_USER?
> I'm not sure it's worth trying to make it not happen.

It's not currently a problem, but it would be nice if we could do the
equivalent of:

if (user_mode(regs)) {
user_exit(); (or enter_from_user_mode or whatever)
} else {
// don't bother -- already in CONTEXT_KERNEL
}

i.e. the same thing that do_general_protection, etc do in -tip. That
would get rid of any need to store the previous context.

Currently we can't because of syscalls and maybe because of KVM. KVM
has a weird fake interrupt thing.

>
>> entering_irq();
>>
>> + /* entering_irq() tells RCU that we're not quiescent. Check it. */
>> + rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");
>
> Why do we need to check that?

Sanity check. If we're changing a bunch of context tracking details,
I want to assert that it actually works.

--Andy

2015-07-15 19:56:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/entry: Add new, comprehensible entry and exit handlers written in C

On Tue, Jul 14, 2015 at 4:07 PM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 03:51:48AM -0700, tip-bot for Andy Lutomirski wrote:
>> + while (true) {
>> + u32 cached_flags =
>> + READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>> +
>> + if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>> + _TIF_UPROBE | _TIF_NEED_RESCHED)))
>> + break;
>> +
>> + /* We have work to do. */
>> + local_irq_enable();
>> +
>> + if (cached_flags & _TIF_NEED_RESCHED)
>> + schedule();
>> +
>> + if (cached_flags & _TIF_UPROBE)
>> + uprobe_notify_resume(regs);
>> +
>> + /* deal with pending signal delivery */
>> + if (cached_flags & _TIF_SIGPENDING)
>> + do_signal(regs);
>> +
>> + if (cached_flags & _TIF_NOTIFY_RESUME) {
>> + clear_thread_flag(TIF_NOTIFY_RESUME);
>> + tracehook_notify_resume(regs);
>> + }
>> +
>> + if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>> + fire_user_return_notifiers();
>> +
>> + /* Disable IRQs and retry */
>> + local_irq_disable();
>> + }
>
> I dreamed so many times about this loop in C!

So this made me look at it again, and now I'm worried.

There's that "early break", but it doesn't check
_TIF_USER_RETURN_NOTIFY. So if *only* USER_RETURN_NOTIFY is set, we're
screwed.

It migth be that that doesn't happen for some reason, but I'm not
seeing what that reason would be.

The other thing that worries me is that this depends on all the
handler routines to clear the flags (except for
tracehook_notify_resume()). Which they hopefully do. But that means
that just looking at this locally, it's not at all obvious that it
works right.

So wouldn't it be much nicer to do:

u32 cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);

cached_flags &= _TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
_TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | _TIF_NEED_RESCHED;

if (!cached_flags)
break;

atomic_clear_mask(cached_flags, &pt_regs_to_thread_info(regs)->flags);

and then have those bit tests after that?

Linus

2015-07-15 20:47:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/entry: Add new, comprehensible entry and exit handlers written in C

On Wed, Jul 15, 2015 at 12:56 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Jul 14, 2015 at 4:07 PM, Frederic Weisbecker <[email protected]> wrote:
>> On Tue, Jul 07, 2015 at 03:51:48AM -0700, tip-bot for Andy Lutomirski wrote:
>>> + while (true) {
>>> + u32 cached_flags =
>>> + READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>>> +
>>> + if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>>> + _TIF_UPROBE | _TIF_NEED_RESCHED)))
>>> + break;
>>> +
>>> + /* We have work to do. */
>>> + local_irq_enable();
>>> +
>>> + if (cached_flags & _TIF_NEED_RESCHED)
>>> + schedule();
>>> +
>>> + if (cached_flags & _TIF_UPROBE)
>>> + uprobe_notify_resume(regs);
>>> +
>>> + /* deal with pending signal delivery */
>>> + if (cached_flags & _TIF_SIGPENDING)
>>> + do_signal(regs);
>>> +
>>> + if (cached_flags & _TIF_NOTIFY_RESUME) {
>>> + clear_thread_flag(TIF_NOTIFY_RESUME);
>>> + tracehook_notify_resume(regs);
>>> + }
>>> +
>>> + if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>>> + fire_user_return_notifiers();
>>> +
>>> + /* Disable IRQs and retry */
>>> + local_irq_disable();
>>> + }
>>
>> I dreamed so many times about this loop in C!
>
> So this made me look at it again, and now I'm worried.
>
> There's that "early break", but it doesn't check
> _TIF_USER_RETURN_NOTIFY. So if *only* USER_RETURN_NOTIFY is set, we're
> screwed.

Crap, that's a bug. I'll send a patch.

>
> It migth be that that doesn't happen for some reason, but I'm not
> seeing what that reason would be.
>
> The other thing that worries me is that this depends on all the
> handler routines to clear the flags (except for
> tracehook_notify_resume()). Which they hopefully do. But that means
> that just looking at this locally, it's not at all obvious that it
> works right.

The old do_notify_resume work loop worked more or less the same way,
so we should be okay here. See below.

>
> So wouldn't it be much nicer to do:
>
> u32 cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>
> cached_flags &= _TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | _TIF_NEED_RESCHED;
>
> if (!cached_flags)
> break;
>
> atomic_clear_mask(cached_flags, &pt_regs_to_thread_info(regs)->flags);
>
> and then have those bit tests after that?
>

Yes, but it would be a slowdown unless we converted all the various
handlers stopped clearing the bits separately (two atomics instead of
one). And to do that, we'd probably want to change all the arches,

Signal handling has all the recalc_sigpending stuff. schedule() had
better clear TIF_NEED_RESCHED. fire_user_return_notifiers is totally
absurd but it does clear the bit. uprobes clears the bit directly.

I'd be all for changing this, but coordinating with the generic code
could be annoying.

--Andy

2015-07-15 21:25:21

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] x86/entry: Fix _TIF_USER_RETURN_NOTIFY check in prepare_exit_to_usermode

Linus noticed that the early return check was missing
_TIF_USER_RETURN_NOTIFY. If the only work flag was
_TIF_USER_RETURN_NOTIFY, we'd skip user return notifiers. Fix it.
(This is the only missing bit.)

This fixes double faults on a KVM host. It's the same issue as last
time, except that this time it's very easy to trigger. Apparently no
one uses -next as a KVM host.

(I'm still not quite sure what it is that KVM does that blows up so
badly if we miss a user return notifier. My best guess is that KVM
lets KERNEL_GS_BASE (i.e. the user's gs base) be negative and fixes
it up in a user return notifier. If we actually end up in user mode
with a negative gs base, we blow up pretty badly.)

Fixes: c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C")
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/common.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index febc53086a69..a3e9c7fa15d9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -264,7 +264,8 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
READ_ONCE(pt_regs_to_thread_info(regs)->flags);

if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
- _TIF_UPROBE | _TIF_NEED_RESCHED)))
+ _TIF_UPROBE | _TIF_NEED_RESCHED |
+ _TIF_USER_RETURN_NOTIFY)))
break;

/* We have work to do. */
--
2.4.3

Subject: [tip:x86/asm] x86/entry: Fix _TIF_USER_RETURN_NOTIFY check in prepare_exit_to_usermode

Commit-ID: d132803e6c611d50c19baedc8ae520203a2baca7
Gitweb: http://git.kernel.org/tip/d132803e6c611d50c19baedc8ae520203a2baca7
Author: Andy Lutomirski <[email protected]>
AuthorDate: Wed, 15 Jul 2015 14:25:16 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 17 Jul 2015 16:08:22 +0200

x86/entry: Fix _TIF_USER_RETURN_NOTIFY check in prepare_exit_to_usermode

Linus noticed that the early return check was missing
_TIF_USER_RETURN_NOTIFY. If the only work flag was
_TIF_USER_RETURN_NOTIFY, we'd skip user return notifiers. Fix
it. (This is the only missing bit.)

This fixes double faults on a KVM host. It's the same issue as
last time, except that this time it's very easy to trigger.
Apparently no one uses -next as a KVM host.

( I'm still not quite sure what it is that KVM does that blows up
so badly if we miss a user return notifier. My best guess is that KVM
lets KERNEL_GS_BASE (i.e. the user's gs base) be negative and fixes
it up in a user return notifier. If we actually end up in user mode
with a negative gs base, we blow up pretty badly. )

Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C")
Link: http://lkml.kernel.org/r/3f801104d24ee7a6bb1446408d9950777aa63277.1436995419.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/entry/common.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index febc530..a3e9c7f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -264,7 +264,8 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
READ_ONCE(pt_regs_to_thread_info(regs)->flags);

if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
- _TIF_UPROBE | _TIF_NEED_RESCHED)))
+ _TIF_UPROBE | _TIF_NEED_RESCHED |
+ _TIF_USER_RETURN_NOTIFY)))
break;

/* We have work to do. */

2015-07-18 13:24:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion

On Tue, Jul 14, 2015 at 04:33:39PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 14, 2015 at 4:26 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jul 07, 2015 at 03:54:32AM -0700, tip-bot for Andy Lutomirski wrote:
> >> Commit-ID: 0333a209cbf600e980fc55c24878a56f25f48b65
> >> Gitweb: http://git.kernel.org/tip/0333a209cbf600e980fc55c24878a56f25f48b65
> >> Author: Andy Lutomirski <[email protected]>
> >> AuthorDate: Fri, 3 Jul 2015 12:44:34 -0700
> >> Committer: Ingo Molnar <[email protected]>
> >> CommitDate: Tue, 7 Jul 2015 10:59:10 +0200
> >>
> >> x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion
> >>
> >> Signed-off-by: Andy Lutomirski <[email protected]>
> >> Cc: Andy Lutomirski <[email protected]>
> >> Cc: Borislav Petkov <[email protected]>
> >> Cc: Brian Gerst <[email protected]>
> >> Cc: Denys Vlasenko <[email protected]>
> >> Cc: Denys Vlasenko <[email protected]>
> >> Cc: Frederic Weisbecker <[email protected]>
> >> Cc: H. Peter Anvin <[email protected]>
> >> Cc: Kees Cook <[email protected]>
> >> Cc: Linus Torvalds <[email protected]>
> >> Cc: Oleg Nesterov <[email protected]>
> >> Cc: Paul E. McKenney <[email protected]>
> >> Cc: Peter Zijlstra <[email protected]>
> >> Cc: Rik van Riel <[email protected]>
> >> Cc: Thomas Gleixner <[email protected]>
> >> Cc: [email protected]
> >> Link: http://lkml.kernel.org/r/e8bdc4ed0193fb2fd130f3d6b7b8023e2ec1ab62.1435952415.git.luto@kernel.org
> >> Signed-off-by: Ingo Molnar <[email protected]>
> >> ---
> >> arch/x86/kernel/irq.c | 15 +++++++++++++++
> >> 1 file changed, 15 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> >> index 88b36648..6233de0 100644
> >> --- a/arch/x86/kernel/irq.c
> >> +++ b/arch/x86/kernel/irq.c
> >> @@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
> >> unsigned vector = ~regs->orig_ax;
> >> unsigned irq;
> >>
> >> + /*
> >> + * NB: Unlike exception entries, IRQ entries do not reliably
> >> + * handle context tracking in the low-level entry code. This is
> >> + * because syscall entries execute briefly with IRQs on before
> >> + * updating context tracking state, so we can take an IRQ from
> >> + * kernel mode with CONTEXT_USER. The low-level entry code only
> >> + * updates the context if we came from user mode, so we won't
> >> + * switch to CONTEXT_KERNEL. We'll fix that once the syscall
> >> + * code is cleaned up enough that we can cleanly defer enabling
> >> + * IRQs.
> >> + */
> >> +
> >
> > Now is it a problem to take interrupts in kernel mode with CONTEXT_USER?
> > I'm not sure it's worth trying to make it not happen.
>
> It's not currently a problem, but it would be nice if we could do the
> equivalent of:
>
> if (user_mode(regs)) {
> user_exit(); (or enter_from_user_mode or whatever)
> } else {
> // don't bother -- already in CONTEXT_KERNEL
> }

This was the initial implementation of context tracking but it was terribly
buggy. What if we enter the kernel, we haven't yet got a change to call
context_tracking_user_exit() and we get an exception in the kernel entry
path? user_mode(regs) will return the wrong value and bad things happen.

This is why context tracking needs its own tracking state, because we are always
out of sync with the real processor context anyway.

>
> i.e. the same thing that do_general_protection, etc do in -tip. That
> would get rid of any need to store the previous context.
>
> Currently we can't because of syscalls and maybe because of KVM. KVM
> has a weird fake interrupt thing.
>
> >
> >> entering_irq();
> >>
> >> + /* entering_irq() tells RCU that we're not quiescent. Check it. */
> >> + rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");
> >
> > Why do we need to check that?
>
> Sanity check. If we're changing a bunch of context tracking details,
> I want to assert that it actually works.

But we call rcu_irq_enter() right before.

It's more or less like doing:

local_irq_disable();
WARN_ON(!irqs_disabled());

2015-07-18 14:10:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion

On Sat, Jul 18, 2015 at 03:23:57PM +0200, Frederic Weisbecker wrote:
> On Tue, Jul 14, 2015 at 04:33:39PM -0700, Andy Lutomirski wrote:
> > On Tue, Jul 14, 2015 at 4:26 PM, Frederic Weisbecker <[email protected]> wrote:
> > > On Tue, Jul 07, 2015 at 03:54:32AM -0700, tip-bot for Andy Lutomirski wrote:
> > >> Commit-ID: 0333a209cbf600e980fc55c24878a56f25f48b65
> > >> Gitweb: http://git.kernel.org/tip/0333a209cbf600e980fc55c24878a56f25f48b65
> > >> Author: Andy Lutomirski <[email protected]>
> > >> AuthorDate: Fri, 3 Jul 2015 12:44:34 -0700
> > >> Committer: Ingo Molnar <[email protected]>
> > >> CommitDate: Tue, 7 Jul 2015 10:59:10 +0200
> > >>
> > >> x86/irq, context_tracking: Document how IRQ context tracking works and add an RCU assertion
> > >>
> > >> Signed-off-by: Andy Lutomirski <[email protected]>
> > >> Cc: Andy Lutomirski <[email protected]>
> > >> Cc: Borislav Petkov <[email protected]>
> > >> Cc: Brian Gerst <[email protected]>
> > >> Cc: Denys Vlasenko <[email protected]>
> > >> Cc: Denys Vlasenko <[email protected]>
> > >> Cc: Frederic Weisbecker <[email protected]>
> > >> Cc: H. Peter Anvin <[email protected]>
> > >> Cc: Kees Cook <[email protected]>
> > >> Cc: Linus Torvalds <[email protected]>
> > >> Cc: Oleg Nesterov <[email protected]>
> > >> Cc: Paul E. McKenney <[email protected]>
> > >> Cc: Peter Zijlstra <[email protected]>
> > >> Cc: Rik van Riel <[email protected]>
> > >> Cc: Thomas Gleixner <[email protected]>
> > >> Cc: [email protected]
> > >> Link: http://lkml.kernel.org/r/e8bdc4ed0193fb2fd130f3d6b7b8023e2ec1ab62.1435952415.git.luto@kernel.org
> > >> Signed-off-by: Ingo Molnar <[email protected]>
> > >> ---
> > >> arch/x86/kernel/irq.c | 15 +++++++++++++++
> > >> 1 file changed, 15 insertions(+)
> > >>
> > >> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> > >> index 88b36648..6233de0 100644
> > >> --- a/arch/x86/kernel/irq.c
> > >> +++ b/arch/x86/kernel/irq.c
> > >> @@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
> > >> unsigned vector = ~regs->orig_ax;
> > >> unsigned irq;
> > >>
> > >> + /*
> > >> + * NB: Unlike exception entries, IRQ entries do not reliably
> > >> + * handle context tracking in the low-level entry code. This is
> > >> + * because syscall entries execute briefly with IRQs on before
> > >> + * updating context tracking state, so we can take an IRQ from
> > >> + * kernel mode with CONTEXT_USER. The low-level entry code only
> > >> + * updates the context if we came from user mode, so we won't
> > >> + * switch to CONTEXT_KERNEL. We'll fix that once the syscall
> > >> + * code is cleaned up enough that we can cleanly defer enabling
> > >> + * IRQs.
> > >> + */
> > >> +
> > >
> > > Now is it a problem to take interrupts in kernel mode with CONTEXT_USER?
> > > I'm not sure it's worth trying to make it not happen.
> >
> > It's not currently a problem, but it would be nice if we could do the
> > equivalent of:
> >
> > if (user_mode(regs)) {
> > user_exit(); (or enter_from_user_mode or whatever)
> > } else {
> > // don't bother -- already in CONTEXT_KERNEL
> > }
>
> This was the initial implementation of context tracking but it was terribly
> buggy. What if we enter the kernel, we haven't yet got a change to call
> context_tracking_user_exit() and we get an exception in the kernel entry
> path? user_mode(regs) will return the wrong value and bad things happen.
>
> This is why context tracking needs its own tracking state, because we are always
> out of sync with the real processor context anyway.
>
> >
> > i.e. the same thing that do_general_protection, etc do in -tip. That
> > would get rid of any need to store the previous context.
> >
> > Currently we can't because of syscalls and maybe because of KVM. KVM
> > has a weird fake interrupt thing.
> >
> > >
> > >> entering_irq();
> > >>
> > >> + /* entering_irq() tells RCU that we're not quiescent. Check it. */
> > >> + rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");
> > >
> > > Why do we need to check that?
> >
> > Sanity check. If we're changing a bunch of context tracking details,
> > I want to assert that it actually works.
>
> But we call rcu_irq_enter() right before.
>
> It's more or less like doing:
>
> local_irq_disable();
> WARN_ON(!irqs_disabled());

If we end up in a world where RCU sometimes uses context-tracking state
and sometimes uses its own state (for example, for architecture that
do not support context tracking), such a check might make more sense.
It would be all too easy for someone to accidentailly manage to disable
both somehow, and things would sort of work but have strange undebuggable
failure cases. Sometimes.

Thanx, Paul

2015-08-11 22:18:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Jul 07, 2015 at 03:53:29AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID: 02bc7768fe447ae305e924b931fa629073a4a1b9
> Gitweb: http://git.kernel.org/tip/02bc7768fe447ae305e924b931fa629073a4a1b9
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Fri, 3 Jul 2015 12:44:31 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 7 Jul 2015 10:59:08 +0200
>
> x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/60e90901eee611e59e958bfdbbe39969b4f88fe5.1435952415.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 64 +++++++++++-----------------------------
> arch/x86/entry/entry_64_compat.S | 5 ++++
> 2 files changed, 23 insertions(+), 46 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 83eb63d..168ee26 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -508,7 +508,16 @@ END(irq_entries_start)
>
> testb $3, CS(%rsp)
> jz 1f
> +
> + /*
> + * IRQ from user mode. Switch to kernel gsbase and inform context
> + * tracking that we're in kernel mode.
> + */
> SWAPGS
> +#ifdef CONFIG_CONTEXT_TRACKING
> + call enter_from_user_mode
> +#endif

There have been a lot of patches going there lately so I couldn't follow
everything and since you just started a discussion on context tracking, I
just had a look on the latest change.

So it seems we're now calling user_exit() on IRQ entry. This is not something
we want. We already have everything we need with rcu_irq_enter() and
vtime_account_irq_enter(). user_exit() brings a lot of overhead here that we
don't need. Plus this is called unconditionally since CONFIG_CONTEXT_TRACKING=y
on most distros now.

We really want the context tracking code to be called on syscall slow path only
(and exceptions with static keys but an exception slow path would be desired as well).

2015-08-11 22:25:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 3:18 PM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 03:53:29AM -0700, tip-bot for Andy Lutomirski wrote:
>> Commit-ID: 02bc7768fe447ae305e924b931fa629073a4a1b9
>> Gitweb: http://git.kernel.org/tip/02bc7768fe447ae305e924b931fa629073a4a1b9
>> Author: Andy Lutomirski <[email protected]>
>> AuthorDate: Fri, 3 Jul 2015 12:44:31 -0700
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 7 Jul 2015 10:59:08 +0200
>>
>> x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Brian Gerst <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Link: http://lkml.kernel.org/r/60e90901eee611e59e958bfdbbe39969b4f88fe5.1435952415.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/entry/entry_64.S | 64 +++++++++++-----------------------------
>> arch/x86/entry/entry_64_compat.S | 5 ++++
>> 2 files changed, 23 insertions(+), 46 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 83eb63d..168ee26 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -508,7 +508,16 @@ END(irq_entries_start)
>>
>> testb $3, CS(%rsp)
>> jz 1f
>> +
>> + /*
>> + * IRQ from user mode. Switch to kernel gsbase and inform context
>> + * tracking that we're in kernel mode.
>> + */
>> SWAPGS
>> +#ifdef CONFIG_CONTEXT_TRACKING
>> + call enter_from_user_mode
>> +#endif
>
> There have been a lot of patches going there lately so I couldn't follow
> everything and since you just started a discussion on context tracking, I
> just had a look on the latest change.
>
> So it seems we're now calling user_exit() on IRQ entry. This is not something
> we want. We already have everything we need with rcu_irq_enter() and
> vtime_account_irq_enter(). user_exit() brings a lot of overhead here that we
> don't need. Plus this is called unconditionally since CONFIG_CONTEXT_TRACKING=y
> on most distros now.
>
> We really want the context tracking code to be called on syscall slow path only
> (and exceptions with static keys but an exception slow path would be desired as well).

Can you explain to me what context tracking does that rcu_irq_enter
and vtime_account_irq_enter don't do that's expensive? Frankly, I'd
rather drop everything except the context tracking callback.

We also need this for the deletion of exception_enter from the trap
entries to be correct.

Like I said in the other thread, there are too many hooks for arch
code to juggle. Grumble.

--Andy

2015-08-11 22:38:33

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Jul 07, 2015 at 03:53:29AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID: 02bc7768fe447ae305e924b931fa629073a4a1b9
> Gitweb: http://git.kernel.org/tip/02bc7768fe447ae305e924b931fa629073a4a1b9
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Fri, 3 Jul 2015 12:44:31 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 7 Jul 2015 10:59:08 +0200
>
> x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Denys Vlasenko <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/60e90901eee611e59e958bfdbbe39969b4f88fe5.1435952415.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 64 +++++++++++-----------------------------
> arch/x86/entry/entry_64_compat.S | 5 ++++
> 2 files changed, 23 insertions(+), 46 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 83eb63d..168ee26 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1088,7 +1055,12 @@ ENTRY(error_entry)
> SWAPGS
>
> .Lerror_entry_from_usermode_after_swapgs:
> +#ifdef CONFIG_CONTEXT_TRACKING
> + call enter_from_user_mode
> +#endif

This makes me very nervous as well!

It means that instead of using the context tracking save/restore model that we had
with exception_enter/exception_exit(), now we rely on the CS register.

I don't think we can do that because our "context tracking" is a soft tracking whereas
CS is hard tracking and both are not atomically synchronized together.

Imagine this situation: we are running in userspace. Context tracking knows it, everything
is fine. Now we do a syscall, we enter in kernel entry code but we trigger an exception
(DEBUG for example) before we got a chance to call user_exit(), which means that the context
tracking code still thinks we are in userspace, so we look at CS from the exception entry code
and it says the exception happened in the kernel. Hence we don't call user_exit() before calling
the exception handler. There is the bug because the exception handler may use RCU which still
thinks we run in userspace.

In early context tracking days we have relied on CS. But I changed that because of such
issue. The only reliable source for soft context tracking is the soft context tracking itself.

2015-08-11 22:49:11

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 03:25:04PM -0700, Andy Lutomirski wrote:
> Can you explain to me what context tracking does that rcu_irq_enter
> and vtime_account_irq_enter don't do that's expensive? Frankly, I'd
> rather drop everything except the context tracking callback.

Irqs have their own hooks in the generic code. irq_enter() and irq_exit().
And those take care of RCU and time accounting already. So arch code really
doesn't need to care about that.

context tracking exists for the sole purpose of tracking states that don't
have generic hooks. Those are syscalls and exceptions.

Besides, rcu_user_exit() is more costly than rcu_irq_enter() which have been
designed for the very purpose of providing a fast RCU tracking for non sleepable
code (which needs rcu_user_exit()).

>
> We also need this for the deletion of exception_enter from the trap
> entries to be correct.

I'm not sure we can really delete exception_enter(). See my other email.

> Like I said in the other thread, there are too many hooks for arch
> code to juggle. Grumble.

Well, archs don't need to care about irq hooks. They only need to track
syscalls and exception.

I've been thinking about pushing down syscalls and exceptions to generic
handlers. It might work for syscalls btw. But many exceptions have only
arch handlers, or significant amount of work is done on the arch level
which might make use of RCU (eg: breakpoint handlers on x86).

2015-08-11 22:51:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 3:38 PM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 03:53:29AM -0700, tip-bot for Andy Lutomirski wrote:
>> Commit-ID: 02bc7768fe447ae305e924b931fa629073a4a1b9
>> Gitweb: http://git.kernel.org/tip/02bc7768fe447ae305e924b931fa629073a4a1b9
>> Author: Andy Lutomirski <[email protected]>
>> AuthorDate: Fri, 3 Jul 2015 12:44:31 -0700
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 7 Jul 2015 10:59:08 +0200
>>
>> x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Brian Gerst <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Denys Vlasenko <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Link: http://lkml.kernel.org/r/60e90901eee611e59e958bfdbbe39969b4f88fe5.1435952415.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/entry/entry_64.S | 64 +++++++++++-----------------------------
>> arch/x86/entry/entry_64_compat.S | 5 ++++
>> 2 files changed, 23 insertions(+), 46 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 83eb63d..168ee26 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1088,7 +1055,12 @@ ENTRY(error_entry)
>> SWAPGS
>>
>> .Lerror_entry_from_usermode_after_swapgs:
>> +#ifdef CONFIG_CONTEXT_TRACKING
>> + call enter_from_user_mode
>> +#endif
>
> This makes me very nervous as well!
>
> It means that instead of using the context tracking save/restore model that we had
> with exception_enter/exception_exit(), now we rely on the CS register.
>
> I don't think we can do that because our "context tracking" is a soft tracking whereas
> CS is hard tracking and both are not atomically synchronized together.
>
> Imagine this situation: we are running in userspace. Context tracking knows it, everything
> is fine. Now we do a syscall, we enter in kernel entry code but we trigger an exception
> (DEBUG for example) before we got a chance to call user_exit(), which means that the context
> tracking code still thinks we are in userspace, so we look at CS from the exception entry code
> and it says the exception happened in the kernel. Hence we don't call user_exit() before calling
> the exception handler. There is the bug because the exception handler may use RCU which still
> thinks we run in userspace.

#DB doesn't go through this patch -- it uses the paranoid entry path
and ist_enter. But I see your point. I think that, if we have a
problem like this in practice, then we should fix it.

But the old code had the same issue. If we got an exception (the most
likely one is probably a vmalloc fault) during user_exit and we then
hit exception_enter, the result would probably be bad.

>
> In early context tracking days we have relied on CS. But I changed that because of such
> issue. The only reliable source for soft context tracking is the soft context tracking itself.

I don't see why the soft state is more reliable. The only bad case is
where the entry itself (HW entry up to user_exit) is not atomic
enough, but that path should be at least as atomic as user_exit itself
is.

--Andy

2015-08-11 23:02:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 3:49 PM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Aug 11, 2015 at 03:25:04PM -0700, Andy Lutomirski wrote:
>> Can you explain to me what context tracking does that rcu_irq_enter
>> and vtime_account_irq_enter don't do that's expensive? Frankly, I'd
>> rather drop everything except the context tracking callback.
>
> Irqs have their own hooks in the generic code. irq_enter() and irq_exit().
> And those take care of RCU and time accounting already. So arch code really
> doesn't need to care about that.

I'd love to have irq_enter_from_user and irq_enter_from_kernel instead.

>
> context tracking exists for the sole purpose of tracking states that don't
> have generic hooks. Those are syscalls and exceptions.
>
> Besides, rcu_user_exit() is more costly than rcu_irq_enter() which have been
> designed for the very purpose of providing a fast RCU tracking for non sleepable
> code (which needs rcu_user_exit()).
>

So rcu_user_exit is slower because it's okay to sleep after calling it?

Would it be possible to defer the overhead until we actually try to
sleep rather than doing it on entry? (I have no idea what's going on
under the hood.)

Anyway, irq_enter_from_user would solve this problem completely.

>
> I've been thinking about pushing down syscalls and exceptions to generic
> handlers. It might work for syscalls btw. But many exceptions have only
> arch handlers, or significant amount of work is done on the arch level
> which might make use of RCU (eg: breakpoint handlers on x86).

I'm trying to port the meat of the x86 syscall code to C. Maybe the
result will generalize. The exit code is already in C (in -tip).

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-08-11 23:22:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code


On Tue, Aug 11, 2015 at 03:51:26PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 11, 2015 at 3:38 PM, Frederic Weisbecker <[email protected]> wrote:
> >
> > This makes me very nervous as well!
> >
> > It means that instead of using the context tracking save/restore model that we had
> > with exception_enter/exception_exit(), now we rely on the CS register.
> >
> > I don't think we can do that because our "context tracking" is a soft tracking whereas
> > CS is hard tracking and both are not atomically synchronized together.
> >
> > Imagine this situation: we are running in userspace. Context tracking knows it, everything
> > is fine. Now we do a syscall, we enter in kernel entry code but we trigger an exception
> > (DEBUG for example) before we got a chance to call user_exit(), which means that the context
> > tracking code still thinks we are in userspace, so we look at CS from the exception entry code
> > and it says the exception happened in the kernel. Hence we don't call user_exit() before calling
> > the exception handler. There is the bug because the exception handler may use RCU which still
> > thinks we run in userspace.
>
> #DB doesn't go through this patch -- it uses the paranoid entry path
> and ist_enter. But I see your point. I think that, if we have a
> problem like this in practice, then we should fix it.

Whatever hack we do to prevent from exceptions happening in between real kernel entry
to tracked kernel entry is going to be far less robust than relying strictly on soft
context tracking.

The resulting bugs are rare and very hard to reproduce and diagnose.

>
> But the old code had the same issue. If we got an exception (the most
> likely one is probably a vmalloc fault) during user_exit and we then
> hit exception_enter, the result would probably be bad.

We have a recursion protection in context tracking that should protect against
exceptions triggering in the middle of half-set states.

>
> >
> > In early context tracking days we have relied on CS. But I changed that because of such
> > issue. The only reliable source for soft context tracking is the soft context tracking itself.
>
> I don't see why the soft state is more reliable. The only bad case is
> where the entry itself (HW entry up to user_exit) is not atomic
> enough, but that path should be at least as atomic as user_exit itself
> is.

Note it's not only about entry code up to user_exit() but also about
user_enter() up to iret.

Also as long as there is at least one instruction between entry to the kernel
and context tracking noting it, there is a risk for an exception. Hence entry
code will never be atomic enough to avoid this kind of bugs.

Heh if only we had something like local_exception_save()!

2015-08-11 23:33:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 4:22 PM, Frederic Weisbecker <[email protected]> wrote:
>
> On Tue, Aug 11, 2015 at 03:51:26PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 11, 2015 at 3:38 PM, Frederic Weisbecker <[email protected]> wrote:
>> >
>> > This makes me very nervous as well!
>> >
>> > It means that instead of using the context tracking save/restore model that we had
>> > with exception_enter/exception_exit(), now we rely on the CS register.
>> >
>> > I don't think we can do that because our "context tracking" is a soft tracking whereas
>> > CS is hard tracking and both are not atomically synchronized together.
>> >
>> > Imagine this situation: we are running in userspace. Context tracking knows it, everything
>> > is fine. Now we do a syscall, we enter in kernel entry code but we trigger an exception
>> > (DEBUG for example) before we got a chance to call user_exit(), which means that the context
>> > tracking code still thinks we are in userspace, so we look at CS from the exception entry code
>> > and it says the exception happened in the kernel. Hence we don't call user_exit() before calling
>> > the exception handler. There is the bug because the exception handler may use RCU which still
>> > thinks we run in userspace.
>>
>> #DB doesn't go through this patch -- it uses the paranoid entry path
>> and ist_enter. But I see your point. I think that, if we have a
>> problem like this in practice, then we should fix it.
>
> Whatever hack we do to prevent from exceptions happening in between real kernel entry
> to tracked kernel entry is going to be far less robust than relying strictly on soft
> context tracking.
>

Why?

Any exception that doesn't leave the context tracking state exactly
the way it found it is buggy. That means that we need to make sure
that context tracking itself is safe wrt exceptions and that we need
to make sure that any exception that can happen early in entry is
itself safe.

The latter is annoying, but the entry code needs to deal with it
anyway. For example, any exception early in NMI is currently really
bad. Non-IST exceptions very early in SYSCALL are fatal.
Non-paranoid exceptions outside swapgs are fatal. Etc.

> The resulting bugs are rare and very hard to reproduce and diagnose.

That's why I stuck assertions all over the place. I know of exactly
one case that will trip the assertion, and it's a false positive and I
plan on fixing it soon.

>
>>
>> But the old code had the same issue. If we got an exception (the most
>> likely one is probably a vmalloc fault) during user_exit and we then
>> hit exception_enter, the result would probably be bad.
>
> We have a recursion protection in context tracking that should protect against
> exceptions triggering in the middle of half-set states.

I sure hope so. It would be nice to mark it with with nokprobes, etc
if needed, too.

>
>>
>> >
>> > In early context tracking days we have relied on CS. But I changed that because of such
>> > issue. The only reliable source for soft context tracking is the soft context tracking itself.
>>
>> I don't see why the soft state is more reliable. The only bad case is
>> where the entry itself (HW entry up to user_exit) is not atomic
>> enough, but that path should be at least as atomic as user_exit itself
>> is.
>
> Note it's not only about entry code up to user_exit() but also about
> user_enter() up to iret.
>

We already need to block interrupts there, and the code for exit back
to userspace is very clean in -tip.

> Also as long as there is at least one instruction between entry to the kernel
> and context tracking noting it, there is a risk for an exception. Hence entry
> code will never be atomic enough to avoid this kind of bugs.

By that argument, we're doomed. Non-IST exceptions outside swapgs are fatal.

>
> Heh if only we had something like local_exception_save()!

What would that mean?

Exceptions aren't magic asynchronous things. They happen only when
you do something that can trigger an exception.

--Andy

2015-08-12 01:02:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 03:59:37PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 11, 2015 at 3:49 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Aug 11, 2015 at 03:25:04PM -0700, Andy Lutomirski wrote:
> >> Can you explain to me what context tracking does that rcu_irq_enter
> >> and vtime_account_irq_enter don't do that's expensive? Frankly, I'd
> >> rather drop everything except the context tracking callback.
> >
> > Irqs have their own hooks in the generic code. irq_enter() and irq_exit().
> > And those take care of RCU and time accounting already. So arch code really
> > doesn't need to care about that.
>
> I'd love to have irq_enter_from_user and irq_enter_from_kernel instead.

RCU would need to know about irq_enter_from_user(), but could blithely
ignore irq_enter_from_kernel(). Unless irq_enter_from_kernel() is called
from the idle loop, in which case RCU would need to know. All that aside,
the overhead of rcu_irq_enter() when called from non-idle kernel mode
should be relatively small. So just telling RCU about all the interrupts
is actually not a bad strategy.

> > context tracking exists for the sole purpose of tracking states that don't
> > have generic hooks. Those are syscalls and exceptions.
> >
> > Besides, rcu_user_exit() is more costly than rcu_irq_enter() which have been
> > designed for the very purpose of providing a fast RCU tracking for non sleepable
> > code (which needs rcu_user_exit()).
>
> So rcu_user_exit is slower because it's okay to sleep after calling it?
>
> Would it be possible to defer the overhead until we actually try to
> sleep rather than doing it on entry? (I have no idea what's going on
> under the hood.)

Nor do I, at least not until someone tells me what .config they are
using. NO_HZ_FULL, NO_HZ_FULL_SYSIDLE, and RCU_FAST_NO_HZ make a
difference in this case.

> Anyway, irq_enter_from_user would solve this problem completely.
>
> >
> > I've been thinking about pushing down syscalls and exceptions to generic
> > handlers. It might work for syscalls btw. But many exceptions have only
> > arch handlers, or significant amount of work is done on the arch level
> > which might make use of RCU (eg: breakpoint handlers on x86).
>
> I'm trying to port the meat of the x86 syscall code to C. Maybe the
> result will generalize. The exit code is already in C (in -tip).

That does sound like a good thing!

Thanx, Paul

2015-08-12 13:13:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 03:59:37PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 11, 2015 at 3:49 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Aug 11, 2015 at 03:25:04PM -0700, Andy Lutomirski wrote:
> >> Can you explain to me what context tracking does that rcu_irq_enter
> >> and vtime_account_irq_enter don't do that's expensive? Frankly, I'd
> >> rather drop everything except the context tracking callback.
> >
> > Irqs have their own hooks in the generic code. irq_enter() and irq_exit().
> > And those take care of RCU and time accounting already. So arch code really
> > doesn't need to care about that.
>
> I'd love to have irq_enter_from_user and irq_enter_from_kernel instead.

I don't get why we need that. Vtime internals already keeps track of where we
are. Again mixing up hard and soft tracking is asking for troubles.

>
> >
> > context tracking exists for the sole purpose of tracking states that don't
> > have generic hooks. Those are syscalls and exceptions.
> >
> > Besides, rcu_user_exit() is more costly than rcu_irq_enter() which have been
> > designed for the very purpose of providing a fast RCU tracking for non sleepable
> > code (which needs rcu_user_exit()).
> >
>
> So rcu_user_exit is slower because it's okay to sleep after calling it?
>
> Would it be possible to defer the overhead until we actually try to
> sleep rather than doing it on entry? (I have no idea what's going on
> under the hood.)

That's a question for Paul.

> Anyway, irq_enter_from_user would solve this problem completely.

How?

> >
> > I've been thinking about pushing down syscalls and exceptions to generic
> > handlers. It might work for syscalls btw. But many exceptions have only
> > arch handlers, or significant amount of work is done on the arch level
> > which might make use of RCU (eg: breakpoint handlers on x86).
>
> I'm trying to port the meat of the x86 syscall code to C. Maybe the
> result will generalize. The exit code is already in C (in -tip).

But please don't change such semantics along the way, it really doesn't help
to review the x86 low level changes if it's mixed up with fundamental context
tracking changes.

2015-08-12 13:32:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 11, 2015 at 04:33:05PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 11, 2015 at 4:22 PM, Frederic Weisbecker <[email protected]> wrote:
> >
> > On Tue, Aug 11, 2015 at 03:51:26PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 11, 2015 at 3:38 PM, Frederic Weisbecker <[email protected]> wrote:
> >> >
> >> > This makes me very nervous as well!
> >> >
> >> > It means that instead of using the context tracking save/restore model that we had
> >> > with exception_enter/exception_exit(), now we rely on the CS register.
> >> >
> >> > I don't think we can do that because our "context tracking" is a soft tracking whereas
> >> > CS is hard tracking and both are not atomically synchronized together.
> >> >
> >> > Imagine this situation: we are running in userspace. Context tracking knows it, everything
> >> > is fine. Now we do a syscall, we enter in kernel entry code but we trigger an exception
> >> > (DEBUG for example) before we got a chance to call user_exit(), which means that the context
> >> > tracking code still thinks we are in userspace, so we look at CS from the exception entry code
> >> > and it says the exception happened in the kernel. Hence we don't call user_exit() before calling
> >> > the exception handler. There is the bug because the exception handler may use RCU which still
> >> > thinks we run in userspace.
> >>
> >> #DB doesn't go through this patch -- it uses the paranoid entry path
> >> and ist_enter. But I see your point. I think that, if we have a
> >> problem like this in practice, then we should fix it.
> >
> > Whatever hack we do to prevent from exceptions happening in between real kernel entry
> > to tracked kernel entry is going to be far less robust than relying strictly on soft
> > context tracking.
> >
>
> Why?
>
> Any exception that doesn't leave the context tracking state exactly
> the way it found it is buggy. That means that we need to make sure
> that context tracking itself is safe wrt exceptions and that we need
> to make sure that any exception that can happen early in entry is
> itself safe.

Right, and doing it the way we did previously was safe wrt. that.

Can't we have exceptions slow path just like the way we do it in syscalls?

Then the exception slow path would just do:

if TIF_NOHZ
ctx = exception_enter()
exception_handler()
if TIF_NOHZ
exception_exit(ctx)

Right now we are calling unconditionally the context tracking code, which is
not good.

>
> The latter is annoying, but the entry code needs to deal with it
> anyway. For example, any exception early in NMI is currently really
> bad. Non-IST exceptions very early in SYSCALL are fatal.
> Non-paranoid exceptions outside swapgs are fatal. Etc.

Sure but that doesn't mean I'm happy with introducing new fragile path
like those. Especially as we have a way to fix without more overhead.

>
> > The resulting bugs are rare and very hard to reproduce and diagnose.
>
> That's why I stuck assertions all over the place. I know of exactly
> one case that will trip the assertion, and it's a false positive and I
> plan on fixing it soon.
>
> >
> >>
> >> But the old code had the same issue. If we got an exception (the most
> >> likely one is probably a vmalloc fault) during user_exit and we then
> >> hit exception_enter, the result would probably be bad.
> >
> > We have a recursion protection in context tracking that should protect against
> > exceptions triggering in the middle of half-set states.
>
> I sure hope so. It would be nice to mark it with with nokprobes, etc
> if needed, too.

Sure.

> > Also as long as there is at least one instruction between entry to the kernel
> > and context tracking noting it, there is a risk for an exception. Hence entry
> > code will never be atomic enough to avoid this kind of bugs.
>
> By that argument, we're doomed. Non-IST exceptions outside swapgs are fatal.

Does that concern only error_entry() exceptions?

> >
> > Heh if only we had something like local_exception_save()!
>
> What would that mean?
>
> Exceptions aren't magic asynchronous things. They happen only when
> you do something that can trigger an exception.

Sure but, did you really never wish to have such an API? :-p

2015-08-12 15:00:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Wed, Aug 12, 2015 at 6:32 AM, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Aug 11, 2015 at 04:33:05PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 11, 2015 at 4:22 PM, Frederic Weisbecker <[email protected]> wrote:
>> >
>> > On Tue, Aug 11, 2015 at 03:51:26PM -0700, Andy Lutomirski wrote:
>> >> On Tue, Aug 11, 2015 at 3:38 PM, Frederic Weisbecker <[email protected]> wrote:
>> >> >
>> >> > This makes me very nervous as well!
>> >> >
>> >> > It means that instead of using the context tracking save/restore model that we had
>> >> > with exception_enter/exception_exit(), now we rely on the CS register.
>> >> >
>> >> > I don't think we can do that because our "context tracking" is a soft tracking whereas
>> >> > CS is hard tracking and both are not atomically synchronized together.
>> >> >
>> >> > Imagine this situation: we are running in userspace. Context tracking knows it, everything
>> >> > is fine. Now we do a syscall, we enter in kernel entry code but we trigger an exception
>> >> > (DEBUG for example) before we got a chance to call user_exit(), which means that the context
>> >> > tracking code still thinks we are in userspace, so we look at CS from the exception entry code
>> >> > and it says the exception happened in the kernel. Hence we don't call user_exit() before calling
>> >> > the exception handler. There is the bug because the exception handler may use RCU which still
>> >> > thinks we run in userspace.
>> >>
>> >> #DB doesn't go through this patch -- it uses the paranoid entry path
>> >> and ist_enter. But I see your point. I think that, if we have a
>> >> problem like this in practice, then we should fix it.
>> >
>> > Whatever hack we do to prevent from exceptions happening in between real kernel entry
>> > to tracked kernel entry is going to be far less robust than relying strictly on soft
>> > context tracking.
>> >
>>
>> Why?
>>
>> Any exception that doesn't leave the context tracking state exactly
>> the way it found it is buggy. That means that we need to make sure
>> that context tracking itself is safe wrt exceptions and that we need
>> to make sure that any exception that can happen early in entry is
>> itself safe.
>
> Right, and doing it the way we did previously was safe wrt. that.
>
> Can't we have exceptions slow path just like the way we do it in syscalls?
>
> Then the exception slow path would just do:
>
> if TIF_NOHZ
> ctx = exception_enter()
> exception_handler()
> if TIF_NOHZ
> exception_exit(ctx)

What's the purpose of TIF_NOHZ right now? For syscalls, it makes
sense, but is there any case in which TIF_NOHZ is set on one CPU but
not on another CPU? It might make sense to get the performance back
using static keys instead of TIF_NOHZ.

If we switched back to exception_enter, we'd have to remember the
previous state, and, with a single exception right now, I think that's
unnecessary.

I think there are only three states we can be in at exception entry:
user (and user_mode(regs)), kernel (and kernel_mode(regs)), or
NMI-like. In the user case, the new code is correct. In the kernel
case, the new code is also correct. In the NMI case (if we're nested
in an NMI or similar entry)) then it is and was the responsibility of
the NMI-like entry to call rcu_nmi_enter(), and things that nest
inside that shouldn't touch context tracking (with the possible
exception of calling rcu_nmi_enter() again).

In current -tip, there's a slight hole in this due to syscalls, and I'll fix it.

>
>>
>> The latter is annoying, but the entry code needs to deal with it
>> anyway. For example, any exception early in NMI is currently really
>> bad. Non-IST exceptions very early in SYSCALL are fatal.
>> Non-paranoid exceptions outside swapgs are fatal. Etc.
>
> Sure but that doesn't mean I'm happy with introducing new fragile path
> like those. Especially as we have a way to fix without more overhead.

I think my approach can work with even less overhead: there are fewer
branches due to checking the previous state.

>> > Also as long as there is at least one instruction between entry to the kernel
>> > and context tracking noting it, there is a risk for an exception. Hence entry
>> > code will never be atomic enough to avoid this kind of bugs.
>>
>> By that argument, we're doomed. Non-IST exceptions outside swapgs are fatal.
>
> Does that concern only error_entry() exceptions?

Yes, but the set of paranoid_entry exceptions is shrinking. In -tip, there are:

NMI: NMI is special and will call rcu_nmi_enter(). Nothing's changing here.

MCE: Once upon a time, MCE was simply buggy. As of 4.0 (IIRC) MCE
from kernel mode calls rcu_nmi_enter().

BP: This is going away, I think. #BP should stop being special by 4.4.

DB: That's the only weird case. Patches to prevent instruction
breakpoints in entry code are already in -tip. The only thing left is
kernel watchpoints, and we need to do something about that.

>
>> >
>> > Heh if only we had something like local_exception_save()!
>>
>> What would that mean?
>>
>> Exceptions aren't magic asynchronous things. They happen only when
>> you do something that can trigger an exception.
>
> Sure but, did you really never wish to have such an API? :-p

:)

--
Andy Lutomirski
AMA Capital Management, LLC

2015-08-18 22:34:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Wed, Aug 12, 2015 at 07:59:44AM -0700, Andy Lutomirski wrote:
> On Wed, Aug 12, 2015 at 6:32 AM, Frederic Weisbecker <[email protected]> wrote:
> > Right, and doing it the way we did previously was safe wrt. that.
> >
> > Can't we have exceptions slow path just like the way we do it in syscalls?
> >
> > Then the exception slow path would just do:
> >
> > if TIF_NOHZ
> > ctx = exception_enter()
> > exception_handler()
> > if TIF_NOHZ
> > exception_exit(ctx)
>
> What's the purpose of TIF_NOHZ right now? For syscalls, it makes
> sense, but is there any case in which TIF_NOHZ is set on one CPU but
> not on another CPU? It might make sense to get the performance back
> using static keys instead of TIF_NOHZ.

Sure if we can manage to do that. The nice thing about TIF flags is that
they are a single check that is always there.

>
> If we switched back to exception_enter, we'd have to remember the
> previous state, and, with a single exception right now, I think that's
> unnecessary.
>
> I think there are only three states we can be in at exception entry:
> user (and user_mode(regs)), kernel (and kernel_mode(regs)), or
> NMI-like.

But we can have user && (!user_mode(regs)) if exception happens on exception
entry code.

> In the user case, the new code is correct. In the kernel
> case, the new code is also correct. In the NMI case (if we're nested
> in an NMI or similar entry)) then it is and was the responsibility of
> the NMI-like entry to call rcu_nmi_enter(), and things that nest
> inside that shouldn't touch context tracking (with the possible
> exception of calling rcu_nmi_enter() again).
>
> In current -tip, there's a slight hole in this due to syscalls, and I'll fix it.

There must be a check for context tracking enabled anyway. So why can't
we just just do in exception entry code:

if (exception_slow_path()) {
exception_enter()
exception_handler()
exception_exit()
} else {
normal stuff
}

Especially if we can manage to implement static keys in ASM, this will sum up to
a single one.

> >> The latter is annoying, but the entry code needs to deal with it
> >> anyway. For example, any exception early in NMI is currently really
> >> bad. Non-IST exceptions very early in SYSCALL are fatal.
> >> Non-paranoid exceptions outside swapgs are fatal. Etc.
> >
> > Sure but that doesn't mean I'm happy with introducing new fragile path
> > like those. Especially as we have a way to fix without more overhead.
>
> I think my approach can work with even less overhead: there are fewer
> branches due to checking the previous state.
>
> >> > Also as long as there is at least one instruction between entry to the kernel
> >> > and context tracking noting it, there is a risk for an exception. Hence entry
> >> > code will never be atomic enough to avoid this kind of bugs.
> >>
> >> By that argument, we're doomed. Non-IST exceptions outside swapgs are fatal.
> >
> > Does that concern only error_entry() exceptions?
>
> Yes, but the set of paranoid_entry exceptions is shrinking. In -tip, there are:
>
> NMI: NMI is special and will call rcu_nmi_enter(). Nothing's changing here.
>
> MCE: Once upon a time, MCE was simply buggy. As of 4.0 (IIRC) MCE
> from kernel mode calls rcu_nmi_enter().
>
> BP: This is going away, I think. #BP should stop being special by 4.4.
>
> DB: That's the only weird case. Patches to prevent instruction
> breakpoints in entry code are already in -tip. The only thing left is
> kernel watchpoints, and we need to do something about that.

So now we can't set a breakpoint on syscall entry anymore?

I'm still nervous with all that.

2015-08-18 22:40:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 18, 2015 at 3:34 PM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Aug 12, 2015 at 07:59:44AM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 12, 2015 at 6:32 AM, Frederic Weisbecker <[email protected]> wrote:
>> > Right, and doing it the way we did previously was safe wrt. that.
>> >
>> > Can't we have exceptions slow path just like the way we do it in syscalls?
>> >
>> > Then the exception slow path would just do:
>> >
>> > if TIF_NOHZ
>> > ctx = exception_enter()
>> > exception_handler()
>> > if TIF_NOHZ
>> > exception_exit(ctx)
>>
>> What's the purpose of TIF_NOHZ right now? For syscalls, it makes
>> sense, but is there any case in which TIF_NOHZ is set on one CPU but
>> not on another CPU? It might make sense to get the performance back
>> using static keys instead of TIF_NOHZ.
>
> Sure if we can manage to do that. The nice thing about TIF flags is that
> they are a single check that is always there.
>

True, although my patch loses that benefit for the fast compat entries
due to the syscall arg fault stuff (what a mess!).

>>
>> If we switched back to exception_enter, we'd have to remember the
>> previous state, and, with a single exception right now, I think that's
>> unnecessary.
>>
>> I think there are only three states we can be in at exception entry:
>> user (and user_mode(regs)), kernel (and kernel_mode(regs)), or
>> NMI-like.
>
> But we can have user && (!user_mode(regs)) if exception happens on exception
> entry code.

I sure hope not, unless it nests inside an NMI-like thing. It's
conceivable that this might happen due to perf NMIs causing a failed
MSR read or similar. We might need to relax the assertions to check
that we're either in kernel or NMI context. If so, that's
straightforward. Meanwhile no one has reported this happening.

>
>> In the user case, the new code is correct. In the kernel
>> case, the new code is also correct. In the NMI case (if we're nested
>> in an NMI or similar entry)) then it is and was the responsibility of
>> the NMI-like entry to call rcu_nmi_enter(), and things that nest
>> inside that shouldn't touch context tracking (with the possible
>> exception of calling rcu_nmi_enter() again).
>>
>> In current -tip, there's a slight hole in this due to syscalls, and I'll fix it.
>
> There must be a check for context tracking enabled anyway. So why can't
> we just just do in exception entry code:
>
> if (exception_slow_path()) {
> exception_enter()
> exception_handler()
> exception_exit()
> } else {
> normal stuff
> }
>
> Especially if we can manage to implement static keys in ASM, this will sum up to
> a single one.

There isn't really an exception slow path. There's already a branch
for user vs kernel (in the CPL sense), and with my patches, there's no
additional branch for previous context tracking state.

>
>> >> The latter is annoying, but the entry code needs to deal with it
>> >> anyway. For example, any exception early in NMI is currently really
>> >> bad. Non-IST exceptions very early in SYSCALL are fatal.
>> >> Non-paranoid exceptions outside swapgs are fatal. Etc.
>> >
>> > Sure but that doesn't mean I'm happy with introducing new fragile path
>> > like those. Especially as we have a way to fix without more overhead.
>>
>> I think my approach can work with even less overhead: there are fewer
>> branches due to checking the previous state.
>>
>> >> > Also as long as there is at least one instruction between entry to the kernel
>> >> > and context tracking noting it, there is a risk for an exception. Hence entry
>> >> > code will never be atomic enough to avoid this kind of bugs.
>> >>
>> >> By that argument, we're doomed. Non-IST exceptions outside swapgs are fatal.
>> >
>> > Does that concern only error_entry() exceptions?
>>
>> Yes, but the set of paranoid_entry exceptions is shrinking. In -tip, there are:
>>
>> NMI: NMI is special and will call rcu_nmi_enter(). Nothing's changing here.
>>
>> MCE: Once upon a time, MCE was simply buggy. As of 4.0 (IIRC) MCE
>> from kernel mode calls rcu_nmi_enter().
>>
>> BP: This is going away, I think. #BP should stop being special by 4.4.
>>
>> DB: That's the only weird case. Patches to prevent instruction
>> breakpoints in entry code are already in -tip. The only thing left is
>> kernel watchpoints, and we need to do something about that.
>
> So now we can't set a breakpoint on syscall entry anymore?
>
> I'm still nervous with all that.

We haven't done anything that would make breakpoints on syscall entry
less safe than they were, but we now disallow the breakpoints. In the
future, we might take advantage of that change.

--
Andy Lutomirski
AMA Capital Management, LLC

2015-08-19 17:18:17

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Tue, Aug 18, 2015 at 03:40:20PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 18, 2015 at 3:34 PM, Frederic Weisbecker <[email protected]> wrote:
> >> If we switched back to exception_enter, we'd have to remember the
> >> previous state, and, with a single exception right now, I think that's
> >> unnecessary.
> >>
> >> I think there are only three states we can be in at exception entry:
> >> user (and user_mode(regs)), kernel (and kernel_mode(regs)), or
> >> NMI-like.
> >
> > But we can have user && (!user_mode(regs)) if exception happens on exception
> > entry code.
>
> I sure hope not, unless it nests inside an NMI-like thing. It's
> conceivable that this might happen due to perf NMIs causing a failed
> MSR read or similar. We might need to relax the assertions to check
> that we're either in kernel or NMI context. If so, that's
> straightforward. Meanwhile no one has reported this happening.

But we can still have #DB on entry code right? We blocked breakpoints on entry
code (I still don't get why and it looks to me like an overkill) but we still
have watchpoints.

>
> >
> >> In the user case, the new code is correct. In the kernel
> >> case, the new code is also correct. In the NMI case (if we're nested
> >> in an NMI or similar entry)) then it is and was the responsibility of
> >> the NMI-like entry to call rcu_nmi_enter(), and things that nest
> >> inside that shouldn't touch context tracking (with the possible
> >> exception of calling rcu_nmi_enter() again).
> >>
> >> In current -tip, there's a slight hole in this due to syscalls, and I'll fix it.
> >
> > There must be a check for context tracking enabled anyway. So why can't
> > we just just do in exception entry code:
> >
> > if (exception_slow_path()) {
> > exception_enter()
> > exception_handler()
> > exception_exit()
> > } else {
> > normal stuff
> > }
> >
> > Especially if we can manage to implement static keys in ASM, this will sum up to
> > a single one.
>
> There isn't really an exception slow path. There's already a branch
> for user vs kernel (in the CPL sense), and with my patches, there's no
> additional branch for previous context tracking state.

But an exception slow path based on static key would the most lightweight
thing for context tracking off-case (which is 99.9999% of usecases) and we
would keep it robust (ie: no need to enumerate all the fragile non-possibility
for an exception in entry code to get it safe).

> > So now we can't set a breakpoint on syscall entry anymore?
> >
> > I'm still nervous with all that.
>
> We haven't done anything that would make breakpoints on syscall entry
> less safe than they were, but we now disallow the breakpoints. In the
> future, we might take advantage of that change.

I still don't get the reason of that.

2015-08-19 18:03:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/asm] x86/asm/entry/64: Migrate error and IRQ exit work to C and remove old assembly code

On Wed, Aug 19, 2015 at 10:18 AM, Frederic Weisbecker
<[email protected]> wrote:
> On Tue, Aug 18, 2015 at 03:40:20PM -0700, Andy Lutomirski wrote:
>>
>> I sure hope not, unless it nests inside an NMI-like thing. It's
>> conceivable that this might happen due to perf NMIs causing a failed
>> MSR read or similar. We might need to relax the assertions to check
>> that we're either in kernel or NMI context. If so, that's
>> straightforward. Meanwhile no one has reported this happening.
>
> But we can still have #DB on entry code right? We blocked breakpoints on entry
> code (I still don't get why and it looks to me like an overkill) but we still
> have watchpoints.

The actual reason is buried in the many threads about NMIs.
Basically, we want to start using RET to return from exceptions to
contexts with IF=0, but we can't do that if we need RF to work
correctly, and we need RF to work correctly if we allow breakpoints in
entry asm (otherwise we risk random infinite loops). So we're
disallowing breakpoints in entry asm.

> But an exception slow path based on static key would the most lightweight
> thing for context tracking off-case (which is 99.9999% of usecases) and we
> would keep it robust (ie: no need to enumerate all the fragile non-possibility
> for an exception in entry code to get it safe).
>

IRQs work more or less like this in -tip (restructured, but this gets the gist):

if (user_mode(regs)) {
swapgs;
enter_from_user_mode;
do_IRQ;
prepare_exit_to_usermode;
swapgs;
iret;
} else {
do_IRQ;
check for preemption;
iret;
}

In 4.2 and before, the enter_from_user_mode wasn't there, and instead
of calling prepare_exit_to_usermode in a known context
(CONTEXT_KERNEL), we went through the maze of retint_user in an
unknown context. That meant that we needed things like SCHEDULE_USER
(which had a bug at some point), do_notify_resume (probably had tons
of bugs), etc, and somehow we still needed to end up in CONTEXT_USER
at the end.

I think the new state of affairs is much nicer. It means that we
finally actually know what state we're in throughout the entry asm.
The only real downsides that I can see are:

1. There's an unnecessary pair of branches due to rcu_irq_enter and
rcu_irq_exit when an IRQ hits user mode.

2. If user_exit is indeed much more expensive than rcu_irq_enter, then
we pay that cost.

If you have suggestions for how to make this faster without making it
uglier, please let me know. :)

--Andy

2015-12-21 20:50:50

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v5 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls

On 07/03/2015 03:44 PM, Andy Lutomirski wrote:
> Changing the x86 context tracking hooks is dangerous because there
> are no good checks that we track our context correctly. Add a
> helper to check that we're actually in CONTEXT_USER when we enter
> from user mode and wire it up for syscall entries.
>
> Subsequent patches will wire this up for all non-NMI entries as
> well. NMIs are their own special beast and cannot currently switch
> overall context tracking state. Instead, they have their own
> special RCU hooks.
>
> This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
> branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer
> of indirection). Eventually, we should fix up the core context
> tracking code to supply a function that does what we want (and can
> be much simpler than user_exit), which will enable us to get rid of
> the extra call.

Hey Andy,

I see the following warning in today's -next:

[ 2162.706868] ------------[ cut here ]------------
[ 2162.708021] WARNING: CPU: 4 PID: 28801 at arch/x86/entry/common.c:44 enter_from_user_mode+0x1c/0x50()
[ 2162.709466] Modules linked in:
[ 2162.709998] CPU: 4 PID: 28801 Comm: trinity-c375 Tainted: G B 4.4.0-rc5-next-20151221-sasha-00020-g840272e-dirty #2753
[ 2162.711847] 0000000000000000 00000000f17e6fcd ffff880292d5fe08 ffffffffa4045334
[ 2162.713108] 0000000041b58ab3 ffffffffaf66686b ffffffffa4045289 ffff880292d5fdc0
[ 2162.714544] 0000000000000000 00000000f17e6fcd ffffffffa23cf466 0000000000000004
[ 2162.715793] Call Trace:
[ 2162.716229] dump_stack (lib/dump_stack.c:52)
[ 2162.719021] warn_slowpath_common (kernel/panic.c:484)
[ 2162.721014] warn_slowpath_null (kernel/panic.c:518)
[ 2162.721950] enter_from_user_mode (arch/x86/entry/common.c:44 (discriminator 7) include/linux/context_tracking_state.h:30 (discriminator 7) include/linux/context_tracking.h:30 (discriminator 7) arch/x86/entry/common.c:45 (discriminator 7))
[ 2162.722911] syscall_trace_enter_phase1 (arch/x86/entry/common.c:94)
[ 2162.726914] tracesys (arch/x86/entry/entry_64.S:241)
[ 2162.727704] ---[ end trace 1e5b49c361cbfe8b ]---
[ 2162.728468] BUG: scheduling while atomic: trinity-c375/28801/0x00000401
[ 2162.729517] Modules linked in:
[ 2162.730020] Preemption disabled param_attr_store (kernel/params.c:625)
[ 2162.731304]
[ 2162.731579] CPU: 4 PID: 28801 Comm: trinity-c375 Tainted: G B W 4.4.0-rc5-next-20151221-sasha-00020-g840272e-dirty #2753
[ 2162.733432] 0000000000000000 00000000f17e6fcd ffff880292d5fe20 ffffffffa4045334
[ 2162.734778] 0000000041b58ab3 ffffffffaf66686b ffffffffa4045289 ffff880292d5fde0
[ 2162.736036] fffffffface198f9 00000000f17e6fcd ffff880292d5fe50 0000000000000282
[ 2162.737309] Call Trace:
[ 2162.737718] dump_stack (lib/dump_stack.c:52)
[ 2162.740566] __schedule_bug (kernel/sched/core.c:3102)
[ 2162.741498] __schedule (./arch/x86/include/asm/preempt.h:27 kernel/sched/core.c:3116 kernel/sched/core.c:3225)
[ 2162.742391] schedule (kernel/sched/core.c:3312 (discriminator 1))
[ 2162.743221] exit_to_usermode_loop (arch/x86/entry/common.c:246)
[ 2162.744331] syscall_return_slowpath (arch/x86/entry/common.c:282 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:24 arch/x86/entry/common.c:284 arch/x86/entry/common.c:344)
[ 2162.745364] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)


Thanks,
Sasha

2015-12-21 22:44:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls

On Mon, Dec 21, 2015 at 12:50 PM, Sasha Levin <[email protected]> wrote:
> On 07/03/2015 03:44 PM, Andy Lutomirski wrote:
>> Changing the x86 context tracking hooks is dangerous because there
>> are no good checks that we track our context correctly. Add a
>> helper to check that we're actually in CONTEXT_USER when we enter
>> from user mode and wire it up for syscall entries.
>>
>> Subsequent patches will wire this up for all non-NMI entries as
>> well. NMIs are their own special beast and cannot currently switch
>> overall context tracking state. Instead, they have their own
>> special RCU hooks.
>>
>> This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
>> branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer
>> of indirection). Eventually, we should fix up the core context
>> tracking code to supply a function that does what we want (and can
>> be much simpler than user_exit), which will enable us to get rid of
>> the extra call.
>
> Hey Andy,
>
> I see the following warning in today's -next:

Weird. I wonder if you might have hit this while switching context
tracking on a runtime. (Can you even do that?)

--Andy


>
> [ 2162.706868] ------------[ cut here ]------------
> [ 2162.708021] WARNING: CPU: 4 PID: 28801 at arch/x86/entry/common.c:44 enter_from_user_mode+0x1c/0x50()
> [ 2162.709466] Modules linked in:
> [ 2162.709998] CPU: 4 PID: 28801 Comm: trinity-c375 Tainted: G B 4.4.0-rc5-next-20151221-sasha-00020-g840272e-dirty #2753
> [ 2162.711847] 0000000000000000 00000000f17e6fcd ffff880292d5fe08 ffffffffa4045334
> [ 2162.713108] 0000000041b58ab3 ffffffffaf66686b ffffffffa4045289 ffff880292d5fdc0
> [ 2162.714544] 0000000000000000 00000000f17e6fcd ffffffffa23cf466 0000000000000004
> [ 2162.715793] Call Trace:
> [ 2162.716229] dump_stack (lib/dump_stack.c:52)
> [ 2162.719021] warn_slowpath_common (kernel/panic.c:484)
> [ 2162.721014] warn_slowpath_null (kernel/panic.c:518)
> [ 2162.721950] enter_from_user_mode (arch/x86/entry/common.c:44 (discriminator 7) include/linux/context_tracking_state.h:30 (discriminator 7) include/linux/context_tracking.h:30 (discriminator 7) arch/x86/entry/common.c:45 (discriminator 7))
> [ 2162.722911] syscall_trace_enter_phase1 (arch/x86/entry/common.c:94)
> [ 2162.726914] tracesys (arch/x86/entry/entry_64.S:241)
> [ 2162.727704] ---[ end trace 1e5b49c361cbfe8b ]---
> [ 2162.728468] BUG: scheduling while atomic: trinity-c375/28801/0x00000401
> [ 2162.729517] Modules linked in:
> [ 2162.730020] Preemption disabled param_attr_store (kernel/params.c:625)
> [ 2162.731304]
> [ 2162.731579] CPU: 4 PID: 28801 Comm: trinity-c375 Tainted: G B W 4.4.0-rc5-next-20151221-sasha-00020-g840272e-dirty #2753
> [ 2162.733432] 0000000000000000 00000000f17e6fcd ffff880292d5fe20 ffffffffa4045334
> [ 2162.734778] 0000000041b58ab3 ffffffffaf66686b ffffffffa4045289 ffff880292d5fde0
> [ 2162.736036] fffffffface198f9 00000000f17e6fcd ffff880292d5fe50 0000000000000282
> [ 2162.737309] Call Trace:
> [ 2162.737718] dump_stack (lib/dump_stack.c:52)
> [ 2162.740566] __schedule_bug (kernel/sched/core.c:3102)
> [ 2162.741498] __schedule (./arch/x86/include/asm/preempt.h:27 kernel/sched/core.c:3116 kernel/sched/core.c:3225)
> [ 2162.742391] schedule (kernel/sched/core.c:3312 (discriminator 1))
> [ 2162.743221] exit_to_usermode_loop (arch/x86/entry/common.c:246)
> [ 2162.744331] syscall_return_slowpath (arch/x86/entry/common.c:282 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:24 arch/x86/entry/common.c:284 arch/x86/entry/common.c:344)
> [ 2162.745364] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
>
>
> Thanks,
> Sasha



--
Andy Lutomirski
AMA Capital Management, LLC