2015-06-16 20:17:17

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

This is incomplete, but it's finally good enough that I think it's
time to get other opinions on it. It is a complete rewrite of the
slow path code that handles exits to user mode.

The exit-to-usermode code is copied in several places and is written
in a nasty combination of asm and C. It's not at all clear what
it's supposed to do, and the way it's structured makes it very hard
to work with. For example, it's not even clear why syscall exit
hooks are called only once per syscall right now. (It seems to be a
side effect of the way that rdi and rdx are handled in the asm loop,
and it seems reliable, but it's still pointlessly complicated.) The
existing code also makes context tracking overly complicated and
hard to understand. Finally, it's nearly impossible for anyone to
change what happens on exit to usermode, since the existing code is
so fragile.

I tried to clean it up incrementally, but I decided it was too hard.
Instead, this series just replaces the code. It seems to work.

Context tracking in particular works very differently now. The
low-level entry code checks that we're in CONTEXT_USER and switches
to CONTEXT_KERNEL. The exit code does the reverse. There is no
need to track what CONTEXT_XYZ state we came from, because we
already know. Similarly, SCHEDULE_USER is gone, since we can
reschedule if needed by simply calling schedule() from C code.

The main things that are missing are that I haven't done the 32-bit
parts (anyone want to help?) and therefore I haven't deleted the old
C code. I also think this may break UML for trivial reasons.

Because I haven't converted the 32-bit code yet, all of the now-unnecessary
unnecessary calls to exception_enter are still present in traps.c.

IRQ context tracking is still duplicated. We should probably clean
it up by changing the core code to supply something like
irq_enter_we_are_already_in_context_kernel.

Thoughts?

Andy Lutomirski (13):
context_tracking: Add context_tracking_assert_state
notifiers: Assert that RCU is watching in notify_die
x86: Move C entry and exit code to arch/x86/entry/common.c
x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
x86/entry: Add enter_from_user_mode and use it in syscalls
x86/entry: Add new, comprehensible entry and exit hooks
x86/entry/64: Really create an error-entry-from-usermode code path
x86/entry/64: Migrate 64-bit syscalls to new exit hooks
x86/entry/compat: Migrate compat syscalls to new exit hooks
x86/asm/entry/64: Save all regs on interrupt entry
x86/asm/entry/64: Simplify irq stack pt_regs handling
x86/asm/entry/64: Migrate error and interrupt exit work to C
x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h

arch/x86/entry/Makefile | 1 +
arch/x86/entry/common.c | 372 ++++++++++++++++++++++++++++++++
arch/x86/entry/entry_64.S | 176 ++++-----------
arch/x86/entry/entry_64_compat.S | 7 +-
arch/x86/include/asm/context_tracking.h | 10 -
arch/x86/include/asm/signal.h | 1 +
arch/x86/kernel/ptrace.c | 202 +----------------
arch/x86/kernel/signal.c | 28 +--
arch/x86/kernel/traps.c | 9 +
include/linux/context_tracking.h | 8 +
kernel/notifier.c | 2 +
11 files changed, 439 insertions(+), 377 deletions(-)
create mode 100644 arch/x86/entry/common.c
delete mode 100644 arch/x86/include/asm/context_tracking.h

--
2.4.3


2015-06-16 20:17:31

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

This will let us sprinkle sanity checks around the kernel without
making too much of a mess.

Signed-off-by: Andy Lutomirski <[email protected]>
---
include/linux/context_tracking.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..0fbea4b152e1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
if (context_tracking_is_enabled())
__context_tracking_task_switch(prev, next);
}
+
+static inline void context_tracking_assert_state(enum ctx_state state)
+{
+ rcu_lockdep_assert(!context_tracking_is_enabled() ||
+ this_cpu_read(context_tracking.state) == state,
+ "context tracking state was wrong");
+}
#else
static inline void user_enter(void) { }
static inline void user_exit(void) { }
@@ -64,6 +71,7 @@ static inline enum ctx_state exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
static inline void context_tracking_task_switch(struct task_struct *prev,
struct task_struct *next) { }
+static inline void context_tracking_assert_state(enum ctx_state state) { }
#endif /* !CONFIG_CONTEXT_TRACKING */


--
2.4.3

2015-06-16 20:17:44

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die

Low-level arch entries often call notify_die, and it's easy for arch
code to fail to exit an RCU quiescent state first. Assert that
we're not quiescent in notify_die.

Signed-off-by: Andy Lutomirski <[email protected]>
---
kernel/notifier.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index ae9fc7cc360e..980e4330fb59 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -544,6 +544,8 @@ int notrace notify_die(enum die_val val, const char *str,
.signr = sig,

};
+ rcu_lockdep_assert(rcu_is_watching(),
+ "notify_die called but RCU thinks we're quiescent");
return atomic_notifier_call_chain(&die_chain, val, &args);
}
NOKPROBE_SYMBOL(notify_die);
--
2.4.3

2015-06-16 20:18:22

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c

The entry and exit C helpers were confusingly scattered between
ptrace.c and signal.c, even though they aren't specific to ptrace or
signal handling. Move them together in a new file.

This change just moves code around. It doesn't change anything.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/Makefile | 1 +
arch/x86/entry/common.c | 253 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/signal.h | 1 +
arch/x86/kernel/ptrace.c | 202 +--------------------------------
arch/x86/kernel/signal.c | 28 +----
5 files changed, 257 insertions(+), 228 deletions(-)
create mode 100644 arch/x86/entry/common.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7a144971db79..bd55dedd7614 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,6 +2,7 @@
# Makefile for the x86 low level entry code
#
obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
+obj-y += common.o

obj-y += vdso/
obj-y += vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..348465473e55
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,253 @@
+/*
+ * entry.c - C code for kernel entry and exit
+ * Copyright (c) 2015 Andrew Lutomirski
+ * GPL v2
+ *
+ * Based on asm and ptrace code by many authors. The code here originated
+ * in ptrace.c and signal.c.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/errno.h>
+#include <linux/ptrace.h>
+#include <linux/tracehook.h>
+#include <linux/audit.h>
+#include <linux/seccomp.h>
+#include <linux/signal.h>
+#include <linux/export.h>
+#include <linux/context_tracking.h>
+#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
+
+#include <asm/desc.h>
+#include <asm/traps.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+ if (arch == AUDIT_ARCH_X86_64) {
+ audit_syscall_entry(regs->orig_ax, regs->di,
+ regs->si, regs->dx, regs->r10);
+ } else
+#endif
+ {
+ audit_syscall_entry(regs->orig_ax, regs->bx,
+ regs->cx, regs->dx, regs->si);
+ }
+}
+
+/*
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2. If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0: resume the syscall
+ * 1: go to phase 2; no seccomp phase 2 needed
+ * anything else: go to phase 2; pass return value to seccomp
+ */
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+ unsigned long ret = 0;
+ u32 work;
+
+ BUG_ON(regs != task_pt_regs(current));
+
+ work = ACCESS_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_SYSCALL_ENTRY;
+
+ /*
+ * If TIF_NOHZ is set, we are required to call user_exit() before
+ * doing anything that could touch RCU.
+ */
+ if (work & _TIF_NOHZ) {
+ user_exit();
+ work &= ~_TIF_NOHZ;
+ }
+
+#ifdef CONFIG_SECCOMP
+ /*
+ * Do seccomp first -- it should minimize exposure of other
+ * code, and keeping seccomp fast is probably more valuable
+ * than the rest of this.
+ */
+ if (work & _TIF_SECCOMP) {
+ struct seccomp_data sd;
+
+ sd.arch = arch;
+ sd.nr = regs->orig_ax;
+ sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+ if (arch == AUDIT_ARCH_X86_64) {
+ sd.args[0] = regs->di;
+ sd.args[1] = regs->si;
+ sd.args[2] = regs->dx;
+ sd.args[3] = regs->r10;
+ sd.args[4] = regs->r8;
+ sd.args[5] = regs->r9;
+ } else
+#endif
+ {
+ sd.args[0] = regs->bx;
+ sd.args[1] = regs->cx;
+ sd.args[2] = regs->dx;
+ sd.args[3] = regs->si;
+ sd.args[4] = regs->di;
+ sd.args[5] = regs->bp;
+ }
+
+ BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+ BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+ ret = seccomp_phase1(&sd);
+ if (ret == SECCOMP_PHASE1_SKIP) {
+ regs->orig_ax = -1;
+ ret = 0;
+ } else if (ret != SECCOMP_PHASE1_OK) {
+ return ret; /* Go directly to phase 2 */
+ }
+
+ work &= ~_TIF_SECCOMP;
+ }
+#endif
+
+ /* Do our best to finish without phase 2. */
+ if (work == 0)
+ return ret; /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+ if (work == _TIF_SYSCALL_AUDIT) {
+ /*
+ * If there is no more work to be done except auditing,
+ * then audit in phase 1. Phase 2 always audits, so, if
+ * we audit here, then we can't go on to phase 2.
+ */
+ do_audit_syscall_entry(regs, arch);
+ return 0;
+ }
+#endif
+
+ return 1; /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+ unsigned long phase1_result)
+{
+ long ret = 0;
+ u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_SYSCALL_ENTRY;
+
+ BUG_ON(regs != task_pt_regs(current));
+
+ /*
+ * If we stepped into a sysenter/syscall insn, it trapped in
+ * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
+ * If user-mode had set TF itself, then it's still clear from
+ * do_debug() and we need to set it again to restore the user
+ * state. If we entered on the slow path, TF was already set.
+ */
+ if (work & _TIF_SINGLESTEP)
+ regs->flags |= X86_EFLAGS_TF;
+
+#ifdef CONFIG_SECCOMP
+ /*
+ * Call seccomp_phase2 before running the other hooks so that
+ * they can see any changes made by a seccomp tracer.
+ */
+ if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
+ /* seccomp failures shouldn't expose any additional code. */
+ return -1;
+ }
+#endif
+
+ if (unlikely(work & _TIF_SYSCALL_EMU))
+ ret = -1L;
+
+ if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
+ tracehook_report_syscall_entry(regs))
+ ret = -1L;
+
+ if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+ trace_sys_enter(regs, regs->orig_ax);
+
+ do_audit_syscall_entry(regs, arch);
+
+ return ret ?: regs->orig_ax;
+}
+
+long syscall_trace_enter(struct pt_regs *regs)
+{
+ u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+ unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+ if (phase1_result == 0)
+ return regs->orig_ax;
+ else
+ return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
+void syscall_trace_leave(struct pt_regs *regs)
+{
+ bool step;
+
+ /*
+ * We may come here right after calling schedule_user()
+ * or do_notify_resume(), in which case we can be in RCU
+ * user mode.
+ */
+ user_exit();
+
+ audit_syscall_exit(regs);
+
+ if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+ trace_sys_exit(regs, regs->ax);
+
+ /*
+ * If TIF_SYSCALL_EMU is set, we only get here because of
+ * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+ * We already reported this syscall instruction in
+ * syscall_trace_enter().
+ */
+ step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
+ !test_thread_flag(TIF_SYSCALL_EMU);
+ if (step || test_thread_flag(TIF_SYSCALL_TRACE))
+ tracehook_report_syscall_exit(regs, step);
+
+ user_enter();
+}
+
+/*
+ * notification of userspace execution resumption
+ * - triggered by the TIF_WORK_MASK flags
+ */
+__visible void
+do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
+{
+ user_exit();
+
+ if (thread_info_flags & _TIF_UPROBE)
+ uprobe_notify_resume(regs);
+
+ /* deal with pending signal delivery */
+ if (thread_info_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+ if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();
+
+ user_enter();
+}
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 31eab867e6d3..b42408bcf6b5 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -30,6 +30,7 @@ typedef sigset_t compat_sigset_t;
#endif /* __ASSEMBLY__ */
#include <uapi/asm/signal.h>
#ifndef __ASSEMBLY__
+extern void do_signal(struct pt_regs *regs);
extern void do_notify_resume(struct pt_regs *, void *, __u32);

#define __ARCH_HAS_SA_RESTORER
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..d9417430a4b0 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -37,12 +37,10 @@
#include <asm/proto.h>
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
+#include <asm/syscall.h>

#include "tls.h"

-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
enum x86_regset {
REGSET_GENERAL,
REGSET_FP,
@@ -1434,201 +1432,3 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
/* Send us the fake SIGTRAP */
force_sig_info(SIGTRAP, &info, tsk);
}
-
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
- if (arch == AUDIT_ARCH_X86_64) {
- audit_syscall_entry(regs->orig_ax, regs->di,
- regs->si, regs->dx, regs->r10);
- } else
-#endif
- {
- audit_syscall_entry(regs->orig_ax, regs->bx,
- regs->cx, regs->dx, regs->si);
- }
-}
-
-/*
- * We can return 0 to resume the syscall or anything else to go to phase
- * 2. If we resume the syscall, we need to put something appropriate in
- * regs->orig_ax.
- *
- * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
- * are fully functional.
- *
- * For phase 2's benefit, our return value is:
- * 0: resume the syscall
- * 1: go to phase 2; no seccomp phase 2 needed
- * anything else: go to phase 2; pass return value to seccomp
- */
-unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
-{
- unsigned long ret = 0;
- u32 work;
-
- BUG_ON(regs != task_pt_regs(current));
-
- work = ACCESS_ONCE(current_thread_info()->flags) &
- _TIF_WORK_SYSCALL_ENTRY;
-
- /*
- * If TIF_NOHZ is set, we are required to call user_exit() before
- * doing anything that could touch RCU.
- */
- if (work & _TIF_NOHZ) {
- user_exit();
- work &= ~_TIF_NOHZ;
- }
-
-#ifdef CONFIG_SECCOMP
- /*
- * Do seccomp first -- it should minimize exposure of other
- * code, and keeping seccomp fast is probably more valuable
- * than the rest of this.
- */
- if (work & _TIF_SECCOMP) {
- struct seccomp_data sd;
-
- sd.arch = arch;
- sd.nr = regs->orig_ax;
- sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
- if (arch == AUDIT_ARCH_X86_64) {
- sd.args[0] = regs->di;
- sd.args[1] = regs->si;
- sd.args[2] = regs->dx;
- sd.args[3] = regs->r10;
- sd.args[4] = regs->r8;
- sd.args[5] = regs->r9;
- } else
-#endif
- {
- sd.args[0] = regs->bx;
- sd.args[1] = regs->cx;
- sd.args[2] = regs->dx;
- sd.args[3] = regs->si;
- sd.args[4] = regs->di;
- sd.args[5] = regs->bp;
- }
-
- BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
- BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
-
- ret = seccomp_phase1(&sd);
- if (ret == SECCOMP_PHASE1_SKIP) {
- regs->orig_ax = -1;
- ret = 0;
- } else if (ret != SECCOMP_PHASE1_OK) {
- return ret; /* Go directly to phase 2 */
- }
-
- work &= ~_TIF_SECCOMP;
- }
-#endif
-
- /* Do our best to finish without phase 2. */
- if (work == 0)
- return ret; /* seccomp and/or nohz only (ret == 0 here) */
-
-#ifdef CONFIG_AUDITSYSCALL
- if (work == _TIF_SYSCALL_AUDIT) {
- /*
- * If there is no more work to be done except auditing,
- * then audit in phase 1. Phase 2 always audits, so, if
- * we audit here, then we can't go on to phase 2.
- */
- do_audit_syscall_entry(regs, arch);
- return 0;
- }
-#endif
-
- return 1; /* Something is enabled that we can't handle in phase 1 */
-}
-
-/* Returns the syscall nr to run (which should match regs->orig_ax). */
-long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
- unsigned long phase1_result)
-{
- long ret = 0;
- u32 work = ACCESS_ONCE(current_thread_info()->flags) &
- _TIF_WORK_SYSCALL_ENTRY;
-
- BUG_ON(regs != task_pt_regs(current));
-
- /*
- * If we stepped into a sysenter/syscall insn, it trapped in
- * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
- * If user-mode had set TF itself, then it's still clear from
- * do_debug() and we need to set it again to restore the user
- * state. If we entered on the slow path, TF was already set.
- */
- if (work & _TIF_SINGLESTEP)
- regs->flags |= X86_EFLAGS_TF;
-
-#ifdef CONFIG_SECCOMP
- /*
- * Call seccomp_phase2 before running the other hooks so that
- * they can see any changes made by a seccomp tracer.
- */
- if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
- /* seccomp failures shouldn't expose any additional code. */
- return -1;
- }
-#endif
-
- if (unlikely(work & _TIF_SYSCALL_EMU))
- ret = -1L;
-
- if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
- tracehook_report_syscall_entry(regs))
- ret = -1L;
-
- if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
- trace_sys_enter(regs, regs->orig_ax);
-
- do_audit_syscall_entry(regs, arch);
-
- return ret ?: regs->orig_ax;
-}
-
-long syscall_trace_enter(struct pt_regs *regs)
-{
- u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
- unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
-
- if (phase1_result == 0)
- return regs->orig_ax;
- else
- return syscall_trace_enter_phase2(regs, arch, phase1_result);
-}
-
-void syscall_trace_leave(struct pt_regs *regs)
-{
- bool step;
-
- /*
- * We may come here right after calling schedule_user()
- * or do_notify_resume(), in which case we can be in RCU
- * user mode.
- */
- user_exit();
-
- audit_syscall_exit(regs);
-
- if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
- trace_sys_exit(regs, regs->ax);
-
- /*
- * If TIF_SYSCALL_EMU is set, we only get here because of
- * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
- * We already reported this syscall instruction in
- * syscall_trace_enter().
- */
- step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
- !test_thread_flag(TIF_SYSCALL_EMU);
- if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
-
- user_enter();
-}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 1ea14fd53933..c0ad42a640b9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -683,7 +683,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
* want to handle. Thus you cannot kill init even with a SIGKILL even by
* mistake.
*/
-static void do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
{
struct ksignal ksig;

@@ -718,32 +718,6 @@ static void do_signal(struct pt_regs *regs)
restore_saved_sigmask();
}

-/*
- * notification of userspace execution resumption
- * - triggered by the TIF_WORK_MASK flags
- */
-__visible void
-do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
-{
- user_exit();
-
- if (thread_info_flags & _TIF_UPROBE)
- uprobe_notify_resume(regs);
-
- /* deal with pending signal delivery */
- if (thread_info_flags & _TIF_SIGPENDING)
- do_signal(regs);
-
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- }
- if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
- fire_user_return_notifiers();
-
- user_enter();
-}
-
void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
{
struct task_struct *me = current;
--
2.4.3

2015-06-16 20:18:15

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries

Other than the super-atomic exception entries, all exception entries
are supposed to switch our context tracking state to CONTEXT_KERNEL.
Assert that they do. These assertions appear trivial at this point,
as exception_enter is the function responsible for switching
context, but I'm planning on reworking x86's exception context
tracking, and these assertions will help make sure that all of this
code keeps working.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/traps.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index de379366f6d1..786d359fe824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -291,6 +291,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
enum ctx_state prev_state = exception_enter();
siginfo_t info;

+ context_tracking_assert_state(CONTEXT_KERNEL);
+
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
NOTIFY_STOP) {
conditional_sti(regs);
@@ -377,6 +379,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
siginfo_t *info;

prev_state = exception_enter();
+ context_tracking_assert_state(CONTEXT_KERNEL);
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
goto exit;
@@ -458,6 +461,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ context_tracking_assert_state(CONTEXT_KERNEL);
conditional_sti(regs);

if (v8086_mode(regs)) {
@@ -515,6 +519,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
return;

prev_state = ist_enter(regs);
+ context_tracking_assert_state(CONTEXT_KERNEL);
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
SIGTRAP) == NOTIFY_STOP)
@@ -794,6 +799,7 @@ dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ context_tracking_assert_state(CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_MF);
exception_exit(prev_state);
}
@@ -804,6 +810,7 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ context_tracking_assert_state(CONTEXT_KERNEL);
math_error(regs, error_code, X86_TRAP_XF);
exception_exit(prev_state);
}
@@ -862,6 +869,7 @@ do_device_not_available(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ context_tracking_assert_state(CONTEXT_KERNEL);
BUG_ON(use_eager_fpu());

#ifdef CONFIG_MATH_EMULATION
@@ -891,6 +899,7 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
enum ctx_state prev_state;

prev_state = exception_enter();
+ context_tracking_assert_state(CONTEXT_KERNEL);
local_irq_enable();

info.si_signo = SIGILL;
--
2.4.3

2015-06-16 20:17:58

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls

Changing the x86 context tracking hooks is dangerous because there
are no good checks that we track our context correctly. Add a
helper to check that we're actually in CONTEXT_USER when we enter
from user mode and wire it up for syscall entries.

Subsequent patches will wire this up for all non-NMI entries as
well. NMIs are their own special beast and cannot currently switch
overall context tracking state. Instead, they have their own
special RCU hooks.

This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer
of indirection). Eventually, we should fix up the core context
tracking code to supply a function that does what we want (and can
be much simpler than user_exit), which will enable us to get rid of
the extra call.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/common.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 348465473e55..54af0df50080 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -28,6 +28,15 @@
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

+#ifdef CONFIG_CONTEXT_TRACKING
+/* Called on entry from user mode with IRQs off. */
+asmlinkage __visible void enter_from_user_mode(void)
+{
+ context_tracking_assert_state(CONTEXT_USER);
+ user_exit();
+}
+#endif
+
static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
{
#ifdef CONFIG_X86_64
@@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
work = ACCESS_ONCE(current_thread_info()->flags) &
_TIF_WORK_SYSCALL_ENTRY;

+#ifdef CONFIG_CONTEXT_TRACKING
/*
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
if (work & _TIF_NOHZ) {
- user_exit();
+ enter_from_user_mode();
work &= ~_TIF_NOHZ;
}
+#endif

#ifdef CONFIG_SECCOMP
/*
--
2.4.3

2015-06-16 20:18:29

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks

The current entry and exit code is incomprehensible, appears to work
primary by luck, and is very difficult to incrementally improve. Add
new code in preparation for simply deleting the old code.

prepare_exit_to_usermode is a new function that will handle all slow
path exits to user mode. It is called with IRQs disabled and it
leaves us in a state in which it is safe to immediately return to
user mode. IRQs must not be re-enabled at any point after
prepare_exit_to_usermode returns and user mode is actually entered.
(We can, of course, fail to enter user mode and treat that failure
as a fresh entry to kernel mode.) All callers of do_notify_resume
will be migrated to call prepare_exit_to_usermode instead;
prepare_exit_to_usermode needs to do everything that
do_notify_resume does, but it also takes care of scheduling and
context tracking. Unlike do_notify_resume, it does not need to be
called in a loop.

syscall_return_slowpath is exactly what it sounds like. It will be
called on any syscall exit slow path. It will replaces
syscall_trace_leave and it calls prepare_exit_to_usermode on the way
out.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/common.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 54af0df50080..c2ad6b60f4c9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
return syscall_trace_enter_phase2(regs, arch, phase1_result);
}

+/* Deprecated. */
void syscall_trace_leave(struct pt_regs *regs)
{
bool step;
@@ -237,8 +238,115 @@ void syscall_trace_leave(struct pt_regs *regs)
user_enter();
}

+static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
+{
+ unsigned long top_of_stack =
+ (unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
+ return (struct thread_info *)(top_of_stack - THREAD_SIZE);
+}
+
+/* Called with IRQs disabled. */
+asmlinkage __visible void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+ if (WARN_ON(!irqs_disabled()))
+ local_irq_enable();
+
+ /*
+ * In order to return to user mode, we need to have IRQs off with
+ * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
+ * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags
+ * can be set at any time on preemptable kernels if we have IRQs on,
+ * so we need to loop. Disabling preemption wouldn't help: doing the
+ * work to clear some of the flags can sleep.
+ */
+ while (true) {
+ u32 cached_flags =
+ READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+
+ if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+ _TIF_UPROBE | _TIF_NEED_RESCHED)))
+ break;
+
+ /* We have work to do. */
+ local_irq_enable();
+
+ if (cached_flags & _TIF_NEED_RESCHED)
+ schedule();
+
+ if (cached_flags & _TIF_UPROBE)
+ uprobe_notify_resume(regs);
+
+ /* deal with pending signal delivery */
+ if (cached_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (cached_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();
+
+ /* Disable IRQs and retry */
+ local_irq_disable();
+ }
+
+ user_enter();
+}
+
+/*
+ * Called with IRQs on and fully valid regs. Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+asmlinkage __visible void syscall_return_slowpath(struct pt_regs *regs)
+{
+ struct thread_info *ti = pt_regs_to_thread_info(regs);
+ u32 cached_flags = READ_ONCE(ti->flags);
+ bool step;
+
+ context_tracking_assert_state(CONTEXT_KERNEL);
+
+ if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
+ regs->orig_ax))
+ local_irq_enable();
+
+ /*
+ * First do one-time work. If these work items are enabled, we
+ * want to run them exactly once per syscall exit with IRQs on.
+ */
+ if (cached_flags & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |
+ _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)) {
+ audit_syscall_exit(regs);
+
+ if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+ trace_sys_exit(regs, regs->ax);
+
+ /*
+ * If TIF_SYSCALL_EMU is set, we only get here because of
+ * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+ * We already reported this syscall instruction in
+ * syscall_trace_enter().
+ */
+ step = unlikely(
+ (cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
+ == _TIF_SINGLESTEP);
+ if (step || cached_flags & _TIF_SYSCALL_TRACE)
+ tracehook_report_syscall_exit(regs, step);
+ }
+
+ /*
+ * Compat syscalls set TS_COMPAT. Make sure we clear it before
+ * returning to user mode.
+ */
+ ti->status &= ~TS_COMPAT;
+
+ local_irq_disable();
+ prepare_exit_to_usermode(regs);
+}
+
/*
- * notification of userspace execution resumption
+ * Deprecated notification of userspace execution resumption
* - triggered by the TIF_WORK_MASK flags
*/
__visible void
--
2.4.3

2015-06-16 20:18:46

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path

In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
gsbase/ebx/usermode code"), I arranged the code slightly wrong --
IRET faults would skip the code path that was intended to execute on
all error entries from user mode. Fix it up.

This does not fix a bug, but we'll need it, and it slightly shrinks
the code.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3bb2c4302df1..33acc3dcc281 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1145,7 +1145,11 @@ ENTRY(error_entry)
testb $3, CS+8(%rsp)
jz error_kernelspace

- /* We entered from user mode */
+error_entry_from_usermode:
+ /*
+ * We entered from user mode or we're pretending to have entered
+ * from user mode due to an IRET fault.
+ */
SWAPGS

error_entry_done:
@@ -1174,8 +1178,7 @@ error_kernelspace:
* gsbase and proceed. We'll fix up the exception and land in
* gs_change's error handler with kernel gsbase.
*/
- SWAPGS
- jmp error_entry_done
+ jmp error_entry_from_usermode

bstep_iret:
/* Fix truncated RIP */
--
2.4.3

2015-06-16 20:18:38

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks

This is separate for ease of bisection.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
1 file changed, 7 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 33acc3dcc281..a5044d7a9d43 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
*/
USERGS_SYSRET64

+GLOBAL(int_ret_from_sys_call_irqs_off)
+ TRACE_IRQS_ON
+ ENABLE_INTERRUPTS(CLBR_NONE)
+ jmp int_ret_from_sys_call
+
/* Do syscall entry tracing */
tracesys:
movq %rsp, %rdi
@@ -272,69 +277,10 @@ tracesys_phase2:
* Has correct iret frame.
*/
GLOBAL(int_ret_from_sys_call)
- DISABLE_INTERRUPTS(CLBR_NONE)
-int_ret_from_sys_call_irqs_off: /* jumps come here from the irqs-off SYSRET path */
- TRACE_IRQS_OFF
- movl $_TIF_ALLWORK_MASK, %edi
- /* edi: mask to check */
-GLOBAL(int_with_check)
- LOCKDEP_SYS_EXIT_IRQ
- GET_THREAD_INFO(%rcx)
- movl TI_flags(%rcx), %edx
- andl %edi, %edx
- jnz int_careful
- andl $~TS_COMPAT, TI_status(%rcx)
- jmp syscall_return
-
- /*
- * Either reschedule or signal or syscall exit tracking needed.
- * First do a reschedule test.
- * edx: work, edi: workmask
- */
-int_careful:
- bt $TIF_NEED_RESCHED, %edx
- jnc int_very_careful
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq %rdi
- SCHEDULE_USER
- popq %rdi
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp int_with_check
-
- /* handle signals and tracing -- both require a full pt_regs */
-int_very_careful:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_EXTRA_REGS
- /* Check for syscall exit trace */
- testl $_TIF_WORK_SYSCALL_EXIT, %edx
- jz int_signal
- pushq %rdi
- leaq 8(%rsp), %rdi /* &ptregs -> arg1 */
- call syscall_trace_leave
- popq %rdi
- andl $~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU), %edi
- jmp int_restore_rest
-
-int_signal:
- testl $_TIF_DO_NOTIFY_MASK, %edx
- jz 1f
- movq %rsp, %rdi /* &ptregs -> arg1 */
- xorl %esi, %esi /* oldset -> arg2 */
- call do_notify_resume
-1: movl $_TIF_WORK_MASK, %edi
-int_restore_rest:
+ movq %rsp, %rdi
+ call syscall_return_slowpath /* returns with IRQs disabled */
RESTORE_EXTRA_REGS
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp int_with_check
-
-syscall_return:
- /* The IRETQ could re-enable interrupts: */
- DISABLE_INTERRUPTS(CLBR_ANY)
- TRACE_IRQS_IRETQ

/*
* Try to use SYSRET instead of IRET if we're returning to
--
2.4.3

2015-06-16 20:20:15

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat syscalls to new exit hooks

This is separate for ease of bisection.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64_compat.S | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index bb187a6a877c..415afa038edf 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -209,10 +209,10 @@ sysexit_from_sys_call:
.endm

.macro auditsys_exit exit
- testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
- jnz ia32_ret_from_sys_call
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+ jnz ia32_ret_from_sys_call
movl %eax, %esi /* second arg, syscall return value */
cmpl $-MAX_ERRNO, %eax /* is it an error ? */
jbe 1f
@@ -227,11 +227,10 @@ sysexit_from_sys_call:
testl %edi, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jz \exit
xorl %eax, %eax /* Do not leak kernel information */
- movq %rax, R11(%rsp)
+ jmp int_ret_from_sys_call_irqs_off
movq %rax, R10(%rsp)
movq %rax, R9(%rsp)
movq %rax, R8(%rsp)
- jmp int_with_check
.endm

sysenter_auditsys:
--
2.4.3

2015-06-16 20:20:05

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry

To prepare for the big rewrite of the error and interrupt exit
paths, we will need pt_regs completely filled in. It's already
completely filled in when error_exit runs, so rearrange interrupt
handling to match it. This will slow down interrupt handling very
slightly (eight instructions), but the simplification it enables
will be more than worth it.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 29 +++++++++--------------------
1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a5044d7a9d43..43bf5762443c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -501,21 +501,13 @@ END(irq_entries_start)
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
cld
- /*
- * Since nothing in interrupt handling code touches r12...r15 members
- * of "struct pt_regs", and since interrupts can nest, we can save
- * four stack slots and simultaneously provide
- * an unwind-friendly stack layout by saving "truncated" pt_regs
- * exactly up to rbp slot, without these members.
- */
- ALLOC_PT_GPREGS_ON_STACK -RBP
- SAVE_C_REGS -RBP
- /* this goes to 0(%rsp) for unwinder, not for saving the value: */
- SAVE_EXTRA_REGS_RBP -RBP
+ ALLOC_PT_GPREGS_ON_STACK
+ SAVE_C_REGS
+ SAVE_EXTRA_REGS

- leaq -RBP(%rsp), %rdi /* arg1 for \func (pointer to pt_regs) */
+ movq %rsp,%rdi /* arg1 for \func (pointer to pt_regs) */

- testb $3, CS-RBP(%rsp)
+ testb $3, CS(%rsp)
jz 1f
SWAPGS
1:
@@ -552,9 +544,7 @@ ret_from_intr:
decl PER_CPU_VAR(irq_count)

/* Restore saved previous stack */
- popq %rsi
- /* return code expects complete pt_regs - adjust rsp accordingly: */
- leaq -RBP(%rsi), %rsp
+ popq %rsp

testb $3, CS(%rsp)
jz retint_kernel
@@ -579,7 +569,7 @@ retint_swapgs: /* return to user-space */
TRACE_IRQS_IRETQ

SWAPGS
- jmp restore_c_regs_and_iret
+ jmp restore_regs_and_iret

/* Returning to kernel space */
retint_kernel:
@@ -603,6 +593,8 @@ retint_kernel:
* At this label, code paths which return to kernel and to user,
* which come from interrupts/exception and from syscalls, merge.
*/
+restore_regs_and_iret:
+ RESTORE_EXTRA_REGS
restore_c_regs_and_iret:
RESTORE_C_REGS
REMOVE_PT_GPREGS_FROM_STACK 8
@@ -673,12 +665,10 @@ retint_signal:
jz retint_swapgs
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
- SAVE_EXTRA_REGS
movq $-1, ORIG_RAX(%rsp)
xorl %esi, %esi /* oldset */
movq %rsp, %rdi /* &pt_regs */
call do_notify_resume
- RESTORE_EXTRA_REGS
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
@@ -1158,7 +1148,6 @@ END(error_entry)
*/
ENTRY(error_exit)
movl %ebx, %eax
- RESTORE_EXTRA_REGS
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl %eax, %eax
--
2.4.3

2015-06-16 20:19:56

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling

There's no need for both rsi and rdi to point to the original stack.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 43bf5762443c..ab8cbf602d19 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -505,8 +505,6 @@ END(irq_entries_start)
SAVE_C_REGS
SAVE_EXTRA_REGS

- movq %rsp,%rdi /* arg1 for \func (pointer to pt_regs) */
-
testb $3, CS(%rsp)
jz 1f
SWAPGS
@@ -518,14 +516,14 @@ END(irq_entries_start)
* a little cheaper to use a separate counter in the PDA (short of
* moving irq_enter into assembly, which would be too much work)
*/
- movq %rsp, %rsi
+ movq %rsp, %rdi
incl PER_CPU_VAR(irq_count)
cmovzq PER_CPU_VAR(irq_stack_ptr), %rsp
- pushq %rsi
+ pushq %rdi
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF

- call \func
+ call \func /* rdi points to pt_regs */
.endm

/*
--
2.4.3

2015-06-16 20:19:45

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 63 +++++++++++++----------------------------------
1 file changed, 17 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ab8cbf602d19..9ae8b8ab91fa 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -507,7 +507,16 @@ END(irq_entries_start)

testb $3, CS(%rsp)
jz 1f
+
+ /*
+ * IRQ from user mode. Switch to kernel gsbase and inform context
+ * tracking that we're in kernel mode.
+ */
SWAPGS
+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
1:
/*
* Save previous stack pointer, optionally switch to interrupt stack.
@@ -546,26 +555,13 @@ ret_from_intr:

testb $3, CS(%rsp)
jz retint_kernel
- /* Interrupt came from user space */
-retint_user:
- GET_THREAD_INFO(%rcx)

- /* %rcx: thread info. Interrupts are off. */
-retint_with_reschedule:
- movl $_TIF_WORK_MASK, %edi
-retint_check:
+ /* Interrupt came from user space */
LOCKDEP_SYS_EXIT_IRQ
- movl TI_flags(%rcx), %edx
- andl %edi, %edx
- jnz retint_careful
-
-retint_swapgs: /* return to user-space */
- /*
- * The iretq could re-enable interrupts:
- */
- DISABLE_INTERRUPTS(CLBR_ANY)
+retint_user:
+ mov %rsp,%rdi
+ call prepare_exit_to_usermode
TRACE_IRQS_IRETQ
-
SWAPGS
jmp restore_regs_and_iret

@@ -643,35 +639,6 @@ native_irq_return_ldt:
popq %rax
jmp native_irq_return_iret
#endif
-
- /* edi: workmask, edx: work */
-retint_careful:
- bt $TIF_NEED_RESCHED, %edx
- jnc retint_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq %rdi
- SCHEDULE_USER
- popq %rdi
- GET_THREAD_INFO(%rcx)
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp retint_check
-
-retint_signal:
- testl $_TIF_DO_NOTIFY_MASK, %edx
- jz retint_swapgs
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- movq $-1, ORIG_RAX(%rsp)
- xorl %esi, %esi /* oldset */
- movq %rsp, %rdi /* &pt_regs */
- call do_notify_resume
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- GET_THREAD_INFO(%rcx)
- jmp retint_with_reschedule
-
END(common_interrupt)

/*
@@ -1086,6 +1053,10 @@ error_entry_from_usermode:
*/
SWAPGS

+#ifdef CONFIG_CONTEXT_TRACKING
+ call enter_from_user_mode
+#endif
+
error_entry_done:
TRACE_IRQS_OFF
ret
--
2.4.3

2015-06-16 20:19:30

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h

SCHEDULE_USER is no longer used, and asm/context-tracking.h
contained nothing else. Remove the header entirely

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 1 -
arch/x86/include/asm/context_tracking.h | 10 ----------
2 files changed, 11 deletions(-)
delete mode 100644 arch/x86/include/asm/context_tracking.h

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ae8b8ab91fa..25cc9c36ca59 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -33,7 +33,6 @@
#include <asm/paravirt.h>
#include <asm/percpu.h>
#include <asm/asm.h>
-#include <asm/context_tracking.h>
#include <asm/smap.h>
#include <asm/pgtable_types.h>
#include <linux/err.h>
diff --git a/arch/x86/include/asm/context_tracking.h b/arch/x86/include/asm/context_tracking.h
deleted file mode 100644
index 1fe49704b146..000000000000
--- a/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif
--
2.4.3

2015-06-17 09:41:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state


* Andy Lutomirski <[email protected]> wrote:

> This will let us sprinkle sanity checks around the kernel without
> making too much of a mess.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> include/linux/context_tracking.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 2821838256b4..0fbea4b152e1 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> if (context_tracking_is_enabled())
> __context_tracking_task_switch(prev, next);
> }
> +
> +static inline void context_tracking_assert_state(enum ctx_state state)
> +{
> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> + this_cpu_read(context_tracking.state) == state,
> + "context tracking state was wrong");
> +}

Please don't introduce assert() style debug check interfaces!

(And RCU should be fixed too I suspect.)

They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
which are the dominant runtime check interface in the kernel.

Instead make it something like:

#define ct_state() (this_cpu_read(context_tracking.state))

#define CT_WARN_ON(cond) \
WARN_ON(context_tracking_is_enabled() && (cond))

and then the debug checks can be written as:

CT_WARN_ON(ct_state() != CONTEXT_KERNEL);

This is IMHO _far_ more readable than:

context_tracking_assert_state(CONTEXT_KERNEL);

ok?

(Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)

Thanks,

Ingo

2015-06-17 09:49:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Andy Lutomirski <[email protected]> wrote:

> This is incomplete, but it's finally good enough that I think it's
> time to get other opinions on it. It is a complete rewrite of the
> slow path code that handles exits to user mode.

Modulo the small comments I made about the debug checks interface plus naming
details the structure and intention of this series gives me warm fuzzy feelings.

> The exit-to-usermode code is copied in several places and is written in a nasty
> combination of asm and C. It's not at all clear what it's supposed to do, and
> the way it's structured makes it very hard to work with. For example, it's not
> even clear why syscall exit hooks are called only once per syscall right now.
> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
> loop, and it seems reliable, but it's still pointlessly complicated.) The
> existing code also makes context tracking overly complicated and hard to
> understand. Finally, it's nearly impossible for anyone to change what happens
> on exit to usermode, since the existing code is so fragile.

Amen.

> I tried to clean it up incrementally, but I decided it was too hard. Instead,
> this series just replaces the code. It seems to work.

Any known bugs beyond UML build breakage?

> Context tracking in particular works very differently now. The low-level entry
> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL. The exit
> code does the reverse. There is no need to track what CONTEXT_XYZ state we came
> from, because we already know. Similarly, SCHEDULE_USER is gone, since we can
> reschedule if needed by simply calling schedule() from C code.
>
> The main things that are missing are that I haven't done the 32-bit parts
> (anyone want to help?) and therefore I haven't deleted the old C code. I also
> think this may break UML for trivial reasons.
>
> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
> unnecessary calls to exception_enter are still present in traps.c.
>
> IRQ context tracking is still duplicated. We should probably clean it up by
> changing the core code to supply something like
> irq_enter_we_are_already_in_context_kernel.
>
> Thoughts?

So assuming you fix the UML build I'm inclined to go for it, even in this
incomplete form, to increase testing coverage.

Doing that will also decrease ongoing merge friction between your work and other
entry code cleanups ...

Thanks,

Ingo

2015-06-17 10:00:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks


* Andy Lutomirski <[email protected]> wrote:

> This is separate for ease of bisection.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
> 1 file changed, 7 insertions(+), 61 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 33acc3dcc281..a5044d7a9d43 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
> */
> USERGS_SYSRET64
>
> +GLOBAL(int_ret_from_sys_call_irqs_off)
> + TRACE_IRQS_ON
> + ENABLE_INTERRUPTS(CLBR_NONE)
> + jmp int_ret_from_sys_call

Any reason why irq state tracking cannot be done in C as well, like the rest of
the irq state tracking code?

(Btw., context tracking could in theory be merged with irq state tracking, they
have similar needs.)

One complication would be that we have not saved pt_regs yet:

> @@ -272,69 +277,10 @@ tracesys_phase2:
> * Has correct iret frame.
> */
> GLOBAL(int_ret_from_sys_call)
> SAVE_EXTRA_REGS
> + movq %rsp, %rdi
> + call syscall_return_slowpath /* returns with IRQs disabled */
> RESTORE_EXTRA_REGS

... but we can stipulate that IRQs can be enabled after we've constructed pt_regs.
The few cycles constant added latency there isn't an issue - maintainability of
the code is.

... this would also allow us to get rid of paravirt interface uglies like
'CLBR_NONE' which has no place in generic entry code.

With such a conversion very little 'high level, complex logic' should be left in
assembly and as much as possible should be moved to C: that way we can integrate
it all on the C level and achieve similar speedups as with assembly. Half-done
will reduce that potential of speedups through integration.

A bit like how we have done it with do_page_fault(): there's very little left at
the assembly level, still it has not kept people from micro-optimizing the heck
out of the low level page fault code.

Thanks,

Ingo

2015-06-17 10:02:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks


* Ingo Molnar <[email protected]> wrote:

>
> * Andy Lutomirski <[email protected]> wrote:
>
> > This is separate for ease of bisection.
> >
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
> > 1 file changed, 7 insertions(+), 61 deletions(-)
> >
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 33acc3dcc281..a5044d7a9d43 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
> > */
> > USERGS_SYSRET64
> >
> > +GLOBAL(int_ret_from_sys_call_irqs_off)
> > + TRACE_IRQS_ON
> > + ENABLE_INTERRUPTS(CLBR_NONE)
> > + jmp int_ret_from_sys_call
>
> Any reason why irq state tracking cannot be done in C as well, like the rest of
> the irq state tracking code?

Never mind, I see you've done exactly that in patch #12.

Thanks,

Ingo

2015-06-17 10:13:36

by Richard Weinberger

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

On Wed, Jun 17, 2015 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> This is incomplete, but it's finally good enough that I think it's
>> time to get other opinions on it. It is a complete rewrite of the
>> slow path code that handles exits to user mode.
>
> Modulo the small comments I made about the debug checks interface plus naming
> details the structure and intention of this series gives me warm fuzzy feelings.
>
>> The exit-to-usermode code is copied in several places and is written in a nasty
>> combination of asm and C. It's not at all clear what it's supposed to do, and
>> the way it's structured makes it very hard to work with. For example, it's not
>> even clear why syscall exit hooks are called only once per syscall right now.
>> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
>> loop, and it seems reliable, but it's still pointlessly complicated.) The
>> existing code also makes context tracking overly complicated and hard to
>> understand. Finally, it's nearly impossible for anyone to change what happens
>> on exit to usermode, since the existing code is so fragile.
>
> Amen.
>
>> I tried to clean it up incrementally, but I decided it was too hard. Instead,
>> this series just replaces the code. It seems to work.
>
> Any known bugs beyond UML build breakage?
>
>> Context tracking in particular works very differently now. The low-level entry
>> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL. The exit
>> code does the reverse. There is no need to track what CONTEXT_XYZ state we came
>> from, because we already know. Similarly, SCHEDULE_USER is gone, since we can
>> reschedule if needed by simply calling schedule() from C code.
>>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code. I also
>> think this may break UML for trivial reasons.
>>
>> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
>> unnecessary calls to exception_enter are still present in traps.c.
>>
>> IRQ context tracking is still duplicated. We should probably clean it up by
>> changing the core code to supply something like
>> irq_enter_we_are_already_in_context_kernel.
>>
>> Thoughts?
>
> So assuming you fix the UML build I'm inclined to go for it, even in this
> incomplete form, to increase testing coverage.

Andy, can you please share the build breakage you're facing?
I'll happily help you fixing it.

--
Thanks,
//richard

2015-06-17 10:32:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Andy Lutomirski <[email protected]> wrote:

> The main things that are missing are that I haven't done the 32-bit parts
> (anyone want to help?) and therefore I haven't deleted the old C code. I also
> think this may break UML for trivial reasons.

So I'd suggest moving most of the SYSRET fast path to C too.

This is how it looks like now after your patches:

testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz tracesys
entry_SYSCALL_64_fastpath:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
call *sys_call_table(, %rax, 8)
movq %rax, RAX(%rsp)
1:
/*
* Syscall return path ending with SYSRET (fast path).
* Has incompletely filled pt_regs.
*/
LOCKDEP_SYS_EXIT
/*
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.
*/
DISABLE_INTERRUPTS(CLBR_NONE)

/*
* We must check ti flags with interrupts (or at least preemption)
* off because we must *never* return to userspace without
* processing exit work that is enqueued if we're preempted here.
* In particular, returning to userspace with any of the one-shot
* flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
* very bad.
*/
testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */

Most of that can be done in C.

And I think we could also convert the IRET syscall return slow path to C too:

GLOBAL(int_ret_from_sys_call)
SAVE_EXTRA_REGS
movq %rsp, %rdi
call syscall_return_slowpath /* returns with IRQs disabled */
RESTORE_EXTRA_REGS

/*
* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context.
*/
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* RCX == RIP */
jne opportunistic_sysret_failed

/*
* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
* in kernel space. This essentially lets the user take over
* the kernel, since userspace controls RSP.
*
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
*/
.ifne __VIRTUAL_MASK_SHIFT - 47
.error "virtual address width changed -- SYSRET checks need update"
.endif

/* Change top 16 bits to be the sign-extension of 47th bit */
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

/* If this changed %rcx, it was not canonical */
cmpq %rcx, %r11
jne opportunistic_sysret_failed

cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne opportunistic_sysret_failed

movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne opportunistic_sysret_failed

/*
* SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
* restoring TF results in a trap from userspace immediately after
* SYSRET. This would cause an infinite loop whenever #DB happens
* with register state that satisfies the opportunistic SYSRET
* conditions. For example, single-stepping this user code:
*
* movq $stuck_here, %rcx
* pushfq
* popq %r11
* stuck_here:
*
* would never get past 'stuck_here'.
*/
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz opportunistic_sysret_failed

/* nothing to check for RSP */

cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne opportunistic_sysret_failed

/*
* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
RESTORE_C_REGS_EXCEPT_RCX_R11
movq RSP(%rsp), %rsp
USERGS_SYSRET64

opportunistic_sysret_failed:
SWAPGS
jmp restore_c_regs_and_iret
END(entry_SYSCALL_64)


Basically there would be a single C function we'd call, which returns a condition
(or fixes up its return address on the stack directly) to determine between the
SYSRET and IRET return paths.

Moving this to C too has immediate benefits: that way we could easily add
instrumentation to see how efficient these various return methods are, etc.

I.e. I don't think there's two ways about this: once the entry code moves to the
domain of C code, we get the best benefits by moving as much of it as possible.

The only low level bits remaining in assembly will be low level hardware ABI
details: saving registers and restoring registers to the expected format - no
'active' code whatsoever.

Thanks,

Ingo

2015-06-17 11:05:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Richard Weinberger <[email protected]> wrote:

> On Wed, Jun 17, 2015 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> >> This is incomplete, but it's finally good enough that I think it's
> >> time to get other opinions on it. It is a complete rewrite of the
> >> slow path code that handles exits to user mode.
> >
> > Modulo the small comments I made about the debug checks interface plus naming
> > details the structure and intention of this series gives me warm fuzzy feelings.
> >
> >> The exit-to-usermode code is copied in several places and is written in a nasty
> >> combination of asm and C. It's not at all clear what it's supposed to do, and
> >> the way it's structured makes it very hard to work with. For example, it's not
> >> even clear why syscall exit hooks are called only once per syscall right now.
> >> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
> >> loop, and it seems reliable, but it's still pointlessly complicated.) The
> >> existing code also makes context tracking overly complicated and hard to
> >> understand. Finally, it's nearly impossible for anyone to change what happens
> >> on exit to usermode, since the existing code is so fragile.
> >
> > Amen.
> >
> >> I tried to clean it up incrementally, but I decided it was too hard. Instead,
> >> this series just replaces the code. It seems to work.
> >
> > Any known bugs beyond UML build breakage?
> >
> >> Context tracking in particular works very differently now. The low-level entry
> >> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL. The exit
> >> code does the reverse. There is no need to track what CONTEXT_XYZ state we came
> >> from, because we already know. Similarly, SCHEDULE_USER is gone, since we can
> >> reschedule if needed by simply calling schedule() from C code.
> >>
> >> The main things that are missing are that I haven't done the 32-bit parts
> >> (anyone want to help?) and therefore I haven't deleted the old C code. I also
> >> think this may break UML for trivial reasons.
> >>
> >> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
> >> unnecessary calls to exception_enter are still present in traps.c.
> >>
> >> IRQ context tracking is still duplicated. We should probably clean it up by
> >> changing the core code to supply something like
> >> irq_enter_we_are_already_in_context_kernel.
> >>
> >> Thoughts?
> >
> > So assuming you fix the UML build I'm inclined to go for it, even in this
> > incomplete form, to increase testing coverage.
>
> Andy, can you please share the build breakage you're facing?
> I'll happily help you fixing it.

So they come in the form of:

./arch/um/include/shared/kern_util.h:25:12: error: conflicting types for ‘do_signal’

which comes from now x86 also having a do_signal().

The patch below fixes it by harmonizing the UML implementation with the x86 one.
This improves the UML side a bit, and fixes the build failure.

Thanks,

Ingo

=========================>
Subject: uml: Fix do_signal() prototype
From: Ingo Molnar <[email protected]>
Date: Wed Jun 17 12:58:37 CEST 2015

Now that x86 exports its do_signal(), the prototypes clash.

Fix the clash and also improve the code a bit: remove the unnecessary
kern_do_signal() indirection. This allows interrupt_end() to share
the 'regs' parameter calculation.

Also remove the unused return code to match x86.

Minimally build and boot tested.

Cc: Richard Weinberger <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/um/include/shared/kern_util.h | 3 ++-
arch/um/kernel/process.c | 6 ++++--
arch/um/kernel/signal.c | 8 +-------
arch/um/kernel/tlb.c | 2 +-
arch/um/kernel/trap.c | 2 +-
5 files changed, 9 insertions(+), 12 deletions(-)

Index: tip/arch/um/include/shared/kern_util.h
===================================================================
--- tip.orig/arch/um/include/shared/kern_util.h
+++ tip/arch/um/include/shared/kern_util.h
@@ -22,7 +22,8 @@ extern int kmalloc_ok;
extern unsigned long alloc_stack(int order, int atomic);
extern void free_stack(unsigned long stack, int order);

-extern int do_signal(void);
+struct pt_regs;
+extern void do_signal(struct pt_regs *regs);
extern void interrupt_end(void);
extern void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs);

Index: tip/arch/um/kernel/process.c
===================================================================
--- tip.orig/arch/um/kernel/process.c
+++ tip/arch/um/kernel/process.c
@@ -90,12 +90,14 @@ void *__switch_to(struct task_struct *fr

void interrupt_end(void)
{
+ struct pt_regs *regs = &current->thread.regs;
+
if (need_resched())
schedule();
if (test_thread_flag(TIF_SIGPENDING))
- do_signal();
+ do_signal(regs);
if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(&current->thread.regs);
+ tracehook_notify_resume(regs);
}

void exit_thread(void)
Index: tip/arch/um/kernel/signal.c
===================================================================
--- tip.orig/arch/um/kernel/signal.c
+++ tip/arch/um/kernel/signal.c
@@ -64,7 +64,7 @@ static void handle_signal(struct ksignal
signal_setup_done(err, ksig, singlestep);
}

-static int kern_do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
{
struct ksignal ksig;
int handled_sig = 0;
@@ -110,10 +110,4 @@ static int kern_do_signal(struct pt_regs
*/
if (!handled_sig)
restore_saved_sigmask();
- return handled_sig;
-}
-
-int do_signal(void)
-{
- return kern_do_signal(&current->thread.regs);
}
Index: tip/arch/um/kernel/tlb.c
===================================================================
--- tip.orig/arch/um/kernel/tlb.c
+++ tip/arch/um/kernel/tlb.c
@@ -291,7 +291,7 @@ void fix_range_common(struct mm_struct *
/* We are under mmap_sem, release it such that current can terminate */
up_write(&current->mm->mmap_sem);
force_sig(SIGKILL, current);
- do_signal();
+ do_signal(&current->thread.regs);
}
}

Index: tip/arch/um/kernel/trap.c
===================================================================
--- tip.orig/arch/um/kernel/trap.c
+++ tip/arch/um/kernel/trap.c
@@ -173,7 +173,7 @@ static void bad_segv(struct faultinfo fi
void fatal_sigsegv(void)
{
force_sigsegv(SIGSEGV, current);
- do_signal();
+ do_signal(&current->thread.regs);
/*
* This is to tell gcc that we're not returning - do_signal
* can, in general, return, but in this case, it's not, since

2015-06-17 11:15:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Ingo Molnar <[email protected]> wrote:

> Basically there would be a single C function we'd call, which returns a
> condition (or fixes up its return address on the stack directly) to determine
> between the SYSRET and IRET return paths.

This we could do by returning the syscall result in RAX, and the SYSRET/IRET
choice in RDX - that's the natural return parameter for 128-bit return values in
the 64-bit C function ABI, and it's clobbered so it's available 'for free'.

We could do something similar for the IRQ entry/return code as well: there's no
reason why IRQ flag tracking has to be maintained in assembly. We could move all
but the IRQ stack switching code to C.

We can safely flip around the IRQ stack setting with the enter_from_user_mode
call, so that IRQ stack switching becomes part of the register saving and kernel
mode preparatory preamble.

This would allow further optimizations in the IRQ code as well: for example we
could inline enter_from_user_mode() and prepare_exit_to_usermode().

Thanks,

Ingo

2015-06-17 14:12:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks

On Wed, Jun 17, 2015 at 3:02 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>> > This is separate for ease of bisection.
>> >
>> > Signed-off-by: Andy Lutomirski <[email protected]>
>> > ---
>> > arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
>> > 1 file changed, 7 insertions(+), 61 deletions(-)
>> >
>> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> > index 33acc3dcc281..a5044d7a9d43 100644
>> > --- a/arch/x86/entry/entry_64.S
>> > +++ b/arch/x86/entry/entry_64.S
>> > @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
>> > */
>> > USERGS_SYSRET64
>> >
>> > +GLOBAL(int_ret_from_sys_call_irqs_off)
>> > + TRACE_IRQS_ON
>> > + ENABLE_INTERRUPTS(CLBR_NONE)
>> > + jmp int_ret_from_sys_call
>>
>> Any reason why irq state tracking cannot be done in C as well, like the rest of
>> the irq state tracking code?
>
> Never mind, I see you've done exactly that in patch #12.
>

There are still some TRACE_IRQS_ON, LOCKDEP_SYS_EXIT, and such
scattered throughout the asm. it's plausible that even more of that
could be moved to C.

We could also benchmark and find out how bad it would be if we just
always filled pt_regs in completely in syscalls. If the performance
hit isn't enough to matter, then we could potentially move the entire
syscall path except pt_regs setup and sysret/iret into three C
functions.

--Andy

2015-06-17 14:16:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> This will let us sprinkle sanity checks around the kernel without
>> making too much of a mess.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> include/linux/context_tracking.h | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>> index 2821838256b4..0fbea4b152e1 100644
>> --- a/include/linux/context_tracking.h
>> +++ b/include/linux/context_tracking.h
>> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>> if (context_tracking_is_enabled())
>> __context_tracking_task_switch(prev, next);
>> }
>> +
>> +static inline void context_tracking_assert_state(enum ctx_state state)
>> +{
>> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
>> + this_cpu_read(context_tracking.state) == state,
>> + "context tracking state was wrong");
>> +}
>
> Please don't introduce assert() style debug check interfaces!
>
> (And RCU should be fixed too I suspect.)
>
> They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> which are the dominant runtime check interface in the kernel.
>
> Instead make it something like:
>
> #define ct_state() (this_cpu_read(context_tracking.state))
>
> #define CT_WARN_ON(cond) \
> WARN_ON(context_tracking_is_enabled() && (cond))
>
> and then the debug checks can be written as:
>
> CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>
> This is IMHO _far_ more readable than:
>
> context_tracking_assert_state(CONTEXT_KERNEL);
>
> ok?
>
> (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)

Hmm, ok I guess. The part I don't like is having ct_state() at all on
non-context-tracking kernels -- it seems like it's asking for trouble.
We could make CT_WARN_ON not even evaluate its argument if
!CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning
garbage if !context_tracking_is_enabled().

The assert macro avoids all these problems despite being a bit ugly.
It's either a no-op returning void or it does the right thing.

--Andy

2015-06-17 14:20:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

On Wed, Jun 17, 2015 at 3:13 AM, Richard Weinberger
<[email protected]> wrote:
> On Wed, Jun 17, 2015 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>>> This is incomplete, but it's finally good enough that I think it's
>>> time to get other opinions on it. It is a complete rewrite of the
>>> slow path code that handles exits to user mode.
>>
>> Modulo the small comments I made about the debug checks interface plus naming
>> details the structure and intention of this series gives me warm fuzzy feelings.
>>
>>> The exit-to-usermode code is copied in several places and is written in a nasty
>>> combination of asm and C. It's not at all clear what it's supposed to do, and
>>> the way it's structured makes it very hard to work with. For example, it's not
>>> even clear why syscall exit hooks are called only once per syscall right now.
>>> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
>>> loop, and it seems reliable, but it's still pointlessly complicated.) The
>>> existing code also makes context tracking overly complicated and hard to
>>> understand. Finally, it's nearly impossible for anyone to change what happens
>>> on exit to usermode, since the existing code is so fragile.
>>
>> Amen.
>>
>>> I tried to clean it up incrementally, but I decided it was too hard. Instead,
>>> this series just replaces the code. It seems to work.
>>
>> Any known bugs beyond UML build breakage?
>>
>>> Context tracking in particular works very differently now. The low-level entry
>>> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL. The exit
>>> code does the reverse. There is no need to track what CONTEXT_XYZ state we came
>>> from, because we already know. Similarly, SCHEDULE_USER is gone, since we can
>>> reschedule if needed by simply calling schedule() from C code.
>>>
>>> The main things that are missing are that I haven't done the 32-bit parts
>>> (anyone want to help?) and therefore I haven't deleted the old C code. I also
>>> think this may break UML for trivial reasons.
>>>
>>> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
>>> unnecessary calls to exception_enter are still present in traps.c.
>>>
>>> IRQ context tracking is still duplicated. We should probably clean it up by
>>> changing the core code to supply something like
>>> irq_enter_we_are_already_in_context_kernel.
>>>
>>> Thoughts?
>>
>> So assuming you fix the UML build I'm inclined to go for it, even in this
>> incomplete form, to increase testing coverage.
>
> Andy, can you please share the build breakage you're facing?
> I'll happily help you fixing it.
>

The do_signal declaration in arch/um/include/shared/kern_util.h
conflicts with the one I added to arch/x86/include/asm/signal.h. The
latter shouldn't really be included in UML, I think, but I didn't see
how to fix it. Just renaming one of the functions would resolve the
conflict.

--Andy

2015-06-17 14:24:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code. I also
>> think this may break UML for trivial reasons.
>
> So I'd suggest moving most of the SYSRET fast path to C too.
>
> This is how it looks like now after your patches:
>
> testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz tracesys
> entry_SYSCALL_64_fastpath:
> #if __SYSCALL_MASK == ~0
> cmpq $__NR_syscall_max, %rax
> #else
> andl $__SYSCALL_MASK, %eax
> cmpl $__NR_syscall_max, %eax
> #endif
> ja 1f /* return -ENOSYS (already in pt_regs->ax) */
> movq %r10, %rcx
> call *sys_call_table(, %rax, 8)
> movq %rax, RAX(%rsp)
> 1:
> /*
> * Syscall return path ending with SYSRET (fast path).
> * Has incompletely filled pt_regs.
> */
> LOCKDEP_SYS_EXIT
> /*
> * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
> * it is too small to ever cause noticeable irq latency.
> */
> DISABLE_INTERRUPTS(CLBR_NONE)
>
> /*
> * We must check ti flags with interrupts (or at least preemption)
> * off because we must *never* return to userspace without
> * processing exit work that is enqueued if we're preempted here.
> * In particular, returning to userspace with any of the one-shot
> * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
> * very bad.
> */
> testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */
>
> Most of that can be done in C.
>
> And I think we could also convert the IRET syscall return slow path to C too:
>
> GLOBAL(int_ret_from_sys_call)
> SAVE_EXTRA_REGS
> movq %rsp, %rdi
> call syscall_return_slowpath /* returns with IRQs disabled */
> RESTORE_EXTRA_REGS
>
> /*
> * Try to use SYSRET instead of IRET if we're returning to
> * a completely clean 64-bit userspace context.
> */
> movq RCX(%rsp), %rcx
> movq RIP(%rsp), %r11
> cmpq %rcx, %r11 /* RCX == RIP */
> jne opportunistic_sysret_failed
>
> /*
> * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> * in kernel space. This essentially lets the user take over
> * the kernel, since userspace controls RSP.
> *
> * If width of "canonical tail" ever becomes variable, this will need
> * to be updated to remain correct on both old and new CPUs.
> */
> .ifne __VIRTUAL_MASK_SHIFT - 47
> .error "virtual address width changed -- SYSRET checks need update"
> .endif
>
> /* Change top 16 bits to be the sign-extension of 47th bit */
> shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>
> /* If this changed %rcx, it was not canonical */
> cmpq %rcx, %r11
> jne opportunistic_sysret_failed
>
> cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
> jne opportunistic_sysret_failed
>
> movq R11(%rsp), %r11
> cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
> jne opportunistic_sysret_failed
>
> /*
> * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
> * restoring TF results in a trap from userspace immediately after
> * SYSRET. This would cause an infinite loop whenever #DB happens
> * with register state that satisfies the opportunistic SYSRET
> * conditions. For example, single-stepping this user code:
> *
> * movq $stuck_here, %rcx
> * pushfq
> * popq %r11
> * stuck_here:
> *
> * would never get past 'stuck_here'.
> */
> testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
> jnz opportunistic_sysret_failed
>
> /* nothing to check for RSP */
>
> cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
> jne opportunistic_sysret_failed
>
> /*
> * We win! This label is here just for ease of understanding
> * perf profiles. Nothing jumps here.
> */
> syscall_return_via_sysret:
> /* rcx and r11 are already restored (see code above) */
> RESTORE_C_REGS_EXCEPT_RCX_R11
> movq RSP(%rsp), %rsp
> USERGS_SYSRET64
>
> opportunistic_sysret_failed:
> SWAPGS
> jmp restore_c_regs_and_iret
> END(entry_SYSCALL_64)
>
>
> Basically there would be a single C function we'd call, which returns a condition
> (or fixes up its return address on the stack directly) to determine between the
> SYSRET and IRET return paths.
>
> Moving this to C too has immediate benefits: that way we could easily add
> instrumentation to see how efficient these various return methods are, etc.
>
> I.e. I don't think there's two ways about this: once the entry code moves to the
> domain of C code, we get the best benefits by moving as much of it as possible.

This is almost certainly true. There are a lot more cleanups possible here.

I want to nail down the 32-bit case first so we can delete the old code.

>
> The only low level bits remaining in assembly will be low level hardware ABI
> details: saving registers and restoring registers to the expected format - no
> 'active' code whatsoever.

I think this is true for syscalls. Getting the weird special cases
(IRET and GS fault) for error_entry to work correctly in C could be
tricky.

--Andy

2015-06-17 15:17:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

On Jun 17, 2015 2:49 AM, "Ingo Molnar" <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>

>
> > I tried to clean it up incrementally, but I decided it was too hard. Instead,
> > this series just replaces the code. It seems to work.
>
> Any known bugs beyond UML build breakage?

One minor one: the 64-bit and compat syscall asm changes should be
folded together. I was overly optimistic about bisectability -- the
intermediate step doesn't build. I haven't tested well enough to be
really comfortable with it, though.

There's another minor error in my description: the 32-bit code is not
a prerequisite for the exception_enter removal, so v2 will remove a
whole bunch of exception_enter calls. This considerably improves the
quality of the debugging checks.

>
> > Context tracking in particular works very differently now. The low-level entry
> > code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL. The exit
> > code does the reverse. There is no need to track what CONTEXT_XYZ state we came
> > from, because we already know. Similarly, SCHEDULE_USER is gone, since we can
> > reschedule if needed by simply calling schedule() from C code.
> >
> > The main things that are missing are that I haven't done the 32-bit parts
> > (anyone want to help?) and therefore I haven't deleted the old C code. I also
> > think this may break UML for trivial reasons.
> >
> > Because I haven't converted the 32-bit code yet, all of the now-unnecessary
> > unnecessary calls to exception_enter are still present in traps.c.
> >
> > IRQ context tracking is still duplicated. We should probably clean it up by
> > changing the core code to supply something like
> > irq_enter_we_are_already_in_context_kernel.
> >
> > Thoughts?
>
> So assuming you fix the UML build I'm inclined to go for it, even in this
> incomplete form, to increase testing coverage.
>
> Doing that will also decrease ongoing merge friction between your work and other
> entry code cleanups ...

Sounds good to me. I'm not convinced this is 4.2 material, though.
Would it go in a separate branch for now?

On a related note, do you have any idea what work_notifysig_v86 in
entry_32.S is for? It seems unnecessary and wrong to me. Unnecessary
because we have return_to_32bit. Wrong because I don't see how we can
reliably enter vm86 mode if we have exit work enabled -- one of the
giant turds in vm86_32.c literally jumps from C code to
resume_userspace on vm86 entry, and resume_userspace promptly checks
for work and might land in work_notifysig_v86 before we ever make it
to v8086/user mode.

I think it may actually be impossible to use vm86 under ptrace. ISTR
I had some trouble when trying to strace my test case...

--Andy

2015-06-17 15:28:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Wed, Jun 17, 2015 at 11:41:14AM +0200, Ingo Molnar wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > This will let us sprinkle sanity checks around the kernel without
> > making too much of a mess.
> >
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > include/linux/context_tracking.h | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 2821838256b4..0fbea4b152e1 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > if (context_tracking_is_enabled())
> > __context_tracking_task_switch(prev, next);
> > }
> > +
> > +static inline void context_tracking_assert_state(enum ctx_state state)
> > +{
> > + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > + this_cpu_read(context_tracking.state) == state,
> > + "context tracking state was wrong");
> > +}
>
> Please don't introduce assert() style debug check interfaces!
>
> (And RCU should be fixed too I suspect.)

The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
by analogy to WARN()? Easy to do if so! Or am I missing the point?

Thanx, Paul

> They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> which are the dominant runtime check interface in the kernel.
>
> Instead make it something like:
>
> #define ct_state() (this_cpu_read(context_tracking.state))
>
> #define CT_WARN_ON(cond) \
> WARN_ON(context_tracking_is_enabled() && (cond))
>
> and then the debug checks can be written as:
>
> CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>
> This is IMHO _far_ more readable than:
>
> context_tracking_assert_state(CONTEXT_KERNEL);
>
> ok?
>
> (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>
> Thanks,
>
> Ingo
>

2015-06-18 09:57:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state


* Andy Lutomirski <[email protected]> wrote:

> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> >> This will let us sprinkle sanity checks around the kernel without
> >> making too much of a mess.
> >>
> >> Signed-off-by: Andy Lutomirski <[email protected]>
> >> ---
> >> include/linux/context_tracking.h | 8 ++++++++
> >> 1 file changed, 8 insertions(+)
> >>
> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> >> index 2821838256b4..0fbea4b152e1 100644
> >> --- a/include/linux/context_tracking.h
> >> +++ b/include/linux/context_tracking.h
> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> >> if (context_tracking_is_enabled())
> >> __context_tracking_task_switch(prev, next);
> >> }
> >> +
> >> +static inline void context_tracking_assert_state(enum ctx_state state)
> >> +{
> >> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> >> + this_cpu_read(context_tracking.state) == state,
> >> + "context tracking state was wrong");
> >> +}
> >
> > Please don't introduce assert() style debug check interfaces!
> >
> > (And RCU should be fixed too I suspect.)
> >
> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> > which are the dominant runtime check interface in the kernel.
> >
> > Instead make it something like:
> >
> > #define ct_state() (this_cpu_read(context_tracking.state))
> >
> > #define CT_WARN_ON(cond) \
> > WARN_ON(context_tracking_is_enabled() && (cond))
> >
> > and then the debug checks can be written as:
> >
> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> >
> > This is IMHO _far_ more readable than:
> >
> > context_tracking_assert_state(CONTEXT_KERNEL);
> >
> > ok?
> >
> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>
> Hmm, ok I guess. The part I don't like is having ct_state() at all on
> non-context-tracking kernels -- it seems like it's asking for trouble.

Well:

- if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.

- if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
CT_WARN_ON() will evaluate 'cond', but won't calculate it.

- only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
get as far as ct_state() evaluation.

so I'm not sure I see the problem you are seeing.

> We could make CT_WARN_ON not even evaluate its argument if
> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
> !context_tracking_is_enabled().

My understanding is that if !context_tracking_is_enabled() then the compiler
should not even try to evaluate the rest. This is why doing a NULL pointer check
like this is safe:

if (tsk && tsk->field) {
...
}

> The assert macro avoids all these problems despite being a bit ugly.

but writing good kernel code is all about not being ugly...

Thanks,

Ingo

2015-06-18 10:00:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state


* Paul E. McKenney <[email protected]> wrote:

> On Wed, Jun 17, 2015 at 11:41:14AM +0200, Ingo Molnar wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> > > This will let us sprinkle sanity checks around the kernel without
> > > making too much of a mess.
> > >
> > > Signed-off-by: Andy Lutomirski <[email protected]>
> > > ---
> > > include/linux/context_tracking.h | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > > index 2821838256b4..0fbea4b152e1 100644
> > > --- a/include/linux/context_tracking.h
> > > +++ b/include/linux/context_tracking.h
> > > @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > > if (context_tracking_is_enabled())
> > > __context_tracking_task_switch(prev, next);
> > > }
> > > +
> > > +static inline void context_tracking_assert_state(enum ctx_state state)
> > > +{
> > > + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > > + this_cpu_read(context_tracking.state) == state,
> > > + "context tracking state was wrong");
> > > +}
> >
> > Please don't introduce assert() style debug check interfaces!
> >
> > (And RCU should be fixed too I suspect.)
>
> The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
> by analogy to WARN()? Easy to do if so! Or am I missing the point?

Yeah, and inverting the condition. Assuming the condition was assert()-style
inverted to begin with! :-)

and lockdep should be fixed too I suspect, lockdep_assert_held() was really a
poorly chosen name I suspect, it should be 'lockdep_check_held()' or so? It has
very little to do with the assert() interface.

Thanks,

Ingo

2015-06-18 10:11:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Andy Lutomirski <[email protected]> wrote:

> > The only low level bits remaining in assembly will be low level hardware ABI
> > details: saving registers and restoring registers to the expected format - no
> > 'active' code whatsoever.
>
> I think this is true for syscalls. Getting the weird special cases (IRET and GS
> fault) for error_entry to work correctly in C could be tricky.

Correct, and I double checked the IRET fault path yesterday (fixup_bad_iret), and
it looks like a straightforward exception handler with limited control flow. It
can stay in asm just fine, it seems mostly orthogonal to the rest.

I didn't check the GS fault path, but that only affects 32-bit, as we use SWAPGS
on 64-bit, right? In any case, that code too (32-bit RESTORE_REGS) belongs into
the natural 'hardware ABI preparation code' that should stay in assembly. (Unless
I missed some other code that might cause trouble.)

The most deadly complexity in our asm code are IMHO the intertwined threads of
control flow - all of that should go into C, where it's much easier to see what's
going on, because we have named variables, established code patterns and a
compiler checking for common mistakes and such.

The other big area of complexity are our partial save/restore tricks, which makes
tracking of what is saved (and what is not) tricky and fragile.

Thanks,

Ingo

2015-06-18 10:14:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Andy Lutomirski <[email protected]> wrote:

> > So assuming you fix the UML build I'm inclined to go for it, even in this
> > incomplete form, to increase testing coverage.
> >
> > Doing that will also decrease ongoing merge friction between your work and
> > other entry code cleanups ...
>
> Sounds good to me. I'm not convinced this is 4.2 material, though. Would it go
> in a separate branch for now?

Please send it out as a separate series, on top of -tip, I'll probably stick it
into a separate branch for the time being.

> On a related note, do you have any idea what work_notifysig_v86 in entry_32.S is
> for? It seems unnecessary and wrong to me. Unnecessary because we have
> return_to_32bit. Wrong because I don't see how we can reliably enter vm86 mode
> if we have exit work enabled -- one of the giant turds in vm86_32.c literally
> jumps from C code to resume_userspace on vm86 entry, and resume_userspace
> promptly checks for work and might land in work_notifysig_v86 before we ever
> make it to v8086/user mode.
>
> I think it may actually be impossible to use vm86 under ptrace. ISTR I had some
> trouble when trying to strace my test case...

Should be tested really, I'm all for removing it if simple vm86 mode games
continue to work ;-)

Thanks,

Ingo

2015-06-18 10:17:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks


* Andy Lutomirski <[email protected]> wrote:

> >> Any reason why irq state tracking cannot be done in C as well, like the rest
> >> of the irq state tracking code?
> >
> > Never mind, I see you've done exactly that in patch #12.
>
> There are still some TRACE_IRQS_ON, LOCKDEP_SYS_EXIT, and such scattered
> throughout the asm. it's plausible that even more of that could be moved to C.
>
> We could also benchmark and find out how bad it would be if we just always
> filled pt_regs in completely in syscalls. If the performance hit isn't enough
> to matter, then we could potentially move the entire syscall path except pt_regs
> setup and sysret/iret into three C functions.

The thing is, I'd not be against simplifying pt_regs handling even if it slows
down things a tiny bit. If anyone wants to reintroduce that complexity we'll see
how it looks like in isolation, done cleanly.

Thanks,

Ingo

2015-06-18 10:19:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks


* Ingo Molnar <[email protected]> wrote:

>
> * Andy Lutomirski <[email protected]> wrote:
>
> > >> Any reason why irq state tracking cannot be done in C as well, like the
> > >> rest of the irq state tracking code?
> > >
> > > Never mind, I see you've done exactly that in patch #12.
> >
> > There are still some TRACE_IRQS_ON, LOCKDEP_SYS_EXIT, and such scattered
> > throughout the asm. it's plausible that even more of that could be moved to
> > C.
> >
> > We could also benchmark and find out how bad it would be if we just always
> > filled pt_regs in completely in syscalls. If the performance hit isn't enough
> > to matter, then we could potentially move the entire syscall path except
> > pt_regs setup and sysret/iret into three C functions.
>
> The thing is, I'd not be against simplifying pt_regs handling even if it slows
> down things a tiny bit. If anyone wants to reintroduce that complexity we'll see
> how it looks like in isolation, done cleanly.

... and I suspect the reduction of entry points will allow the compiler to do a
better job - so some of the overhead might be won back.

So I'd say we try this approach and complicate it back in the future only if the
numbers warrant it.

Thanks,

Ingo

2015-06-18 11:06:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

On Thu, Jun 18, 2015 at 3:11 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> > The only low level bits remaining in assembly will be low level hardware ABI
>> > details: saving registers and restoring registers to the expected format - no
>> > 'active' code whatsoever.
>>
>> I think this is true for syscalls. Getting the weird special cases (IRET and GS
>> fault) for error_entry to work correctly in C could be tricky.
>
> Correct, and I double checked the IRET fault path yesterday (fixup_bad_iret), and
> it looks like a straightforward exception handler with limited control flow. It
> can stay in asm just fine, it seems mostly orthogonal to the rest.
>
> I didn't check the GS fault path, but that only affects 32-bit, as we use SWAPGS
> on 64-bit, right? In any case, that code too (32-bit RESTORE_REGS) belongs into
> the natural 'hardware ABI preparation code' that should stay in assembly. (Unless
> I missed some other code that might cause trouble.)

Look for "gs_change". To change the gs selector, we do swapgs, then
load gs, then swapgs again. If the gs load fails, then we trigger a
special fixup.

--Andy

2015-06-18 11:08:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Andy Lutomirski <[email protected]> wrote:
>> >
>> >> This will let us sprinkle sanity checks around the kernel without
>> >> making too much of a mess.
>> >>
>> >> Signed-off-by: Andy Lutomirski <[email protected]>
>> >> ---
>> >> include/linux/context_tracking.h | 8 ++++++++
>> >> 1 file changed, 8 insertions(+)
>> >>
>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>> >> index 2821838256b4..0fbea4b152e1 100644
>> >> --- a/include/linux/context_tracking.h
>> >> +++ b/include/linux/context_tracking.h
>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>> >> if (context_tracking_is_enabled())
>> >> __context_tracking_task_switch(prev, next);
>> >> }
>> >> +
>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
>> >> +{
>> >> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
>> >> + this_cpu_read(context_tracking.state) == state,
>> >> + "context tracking state was wrong");
>> >> +}
>> >
>> > Please don't introduce assert() style debug check interfaces!
>> >
>> > (And RCU should be fixed too I suspect.)
>> >
>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
>> > which are the dominant runtime check interface in the kernel.
>> >
>> > Instead make it something like:
>> >
>> > #define ct_state() (this_cpu_read(context_tracking.state))
>> >
>> > #define CT_WARN_ON(cond) \
>> > WARN_ON(context_tracking_is_enabled() && (cond))
>> >
>> > and then the debug checks can be written as:
>> >
>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> >
>> > This is IMHO _far_ more readable than:
>> >
>> > context_tracking_assert_state(CONTEXT_KERNEL);
>> >
>> > ok?
>> >
>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>>
>> Hmm, ok I guess. The part I don't like is having ct_state() at all on
>> non-context-tracking kernels -- it seems like it's asking for trouble.
>
> Well:
>
> - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
>
> - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
> CT_WARN_ON() will evaluate 'cond', but won't calculate it.
>
> - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
> get as far as ct_state() evaluation.
>
> so I'm not sure I see the problem you are seeing.
>
>> We could make CT_WARN_ON not even evaluate its argument if
>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
>> !context_tracking_is_enabled().
>
> My understanding is that if !context_tracking_is_enabled() then the compiler
> should not even try to evaluate the rest. This is why doing a NULL pointer check
> like this is safe:

I'm fine with everything you just covered. My only objection is that,
if ct_state() exists, then someone might call it outside CT_WARN_ON,
in which case it will break on non-context-tracking setups.

--Andy

2015-06-18 15:52:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
>>> >
>>> > * Andy Lutomirski <[email protected]> wrote:
>>> >
>>> >> This will let us sprinkle sanity checks around the kernel without
>>> >> making too much of a mess.
>>> >>
>>> >> Signed-off-by: Andy Lutomirski <[email protected]>
>>> >> ---
>>> >> include/linux/context_tracking.h | 8 ++++++++
>>> >> 1 file changed, 8 insertions(+)
>>> >>
>>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>>> >> index 2821838256b4..0fbea4b152e1 100644
>>> >> --- a/include/linux/context_tracking.h
>>> >> +++ b/include/linux/context_tracking.h
>>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>>> >> if (context_tracking_is_enabled())
>>> >> __context_tracking_task_switch(prev, next);
>>> >> }
>>> >> +
>>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
>>> >> +{
>>> >> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
>>> >> + this_cpu_read(context_tracking.state) == state,
>>> >> + "context tracking state was wrong");
>>> >> +}
>>> >
>>> > Please don't introduce assert() style debug check interfaces!
>>> >
>>> > (And RCU should be fixed too I suspect.)
>>> >
>>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
>>> > which are the dominant runtime check interface in the kernel.
>>> >
>>> > Instead make it something like:
>>> >
>>> > #define ct_state() (this_cpu_read(context_tracking.state))
>>> >
>>> > #define CT_WARN_ON(cond) \
>>> > WARN_ON(context_tracking_is_enabled() && (cond))
>>> >
>>> > and then the debug checks can be written as:
>>> >
>>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>>> >
>>> > This is IMHO _far_ more readable than:
>>> >
>>> > context_tracking_assert_state(CONTEXT_KERNEL);
>>> >
>>> > ok?
>>> >
>>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>>>
>>> Hmm, ok I guess. The part I don't like is having ct_state() at all on
>>> non-context-tracking kernels -- it seems like it's asking for trouble.
>>
>> Well:
>>
>> - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
>>
>> - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
>> CT_WARN_ON() will evaluate 'cond', but won't calculate it.
>>
>> - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
>> get as far as ct_state() evaluation.
>>
>> so I'm not sure I see the problem you are seeing.
>>
>>> We could make CT_WARN_ON not even evaluate its argument if
>>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
>>> !context_tracking_is_enabled().
>>
>> My understanding is that if !context_tracking_is_enabled() then the compiler
>> should not even try to evaluate the rest. This is why doing a NULL pointer check
>> like this is safe:
>
> I'm fine with everything you just covered. My only objection is that,
> if ct_state() exists, then someone might call it outside CT_WARN_ON,
> in which case it will break on non-context-tracking setups.

The more I think about it, the more I dislike ct_state(). We have
in_atomic(), which is already problematic because the return value
isn't reliable. ct_state(), if callable on non context-tracking
kernels, will also be unreliable. I prefer things like
lockdep_assert_held because they can't be misused.

It would be far too easy for someone to read:

CT_WARN_ON(ct_state() != CONTEXT_KERNEL);

and add:

if (ct_state() == CONTEXT_KERNEL)
do_something();

and that would be bad.

--Andy

>
> --Andy



--
Andy Lutomirski
AMA Capital Management, LLC

2015-06-18 16:17:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state


* Andy Lutomirski <[email protected]> wrote:

> On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <[email protected]> wrote:
> > On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <[email protected]> wrote:
> >>
> >> * Andy Lutomirski <[email protected]> wrote:
> >>
> >>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
> >>> >
> >>> > * Andy Lutomirski <[email protected]> wrote:
> >>> >
> >>> >> This will let us sprinkle sanity checks around the kernel without
> >>> >> making too much of a mess.
> >>> >>
> >>> >> Signed-off-by: Andy Lutomirski <[email protected]>
> >>> >> ---
> >>> >> include/linux/context_tracking.h | 8 ++++++++
> >>> >> 1 file changed, 8 insertions(+)
> >>> >>
> >>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> >>> >> index 2821838256b4..0fbea4b152e1 100644
> >>> >> --- a/include/linux/context_tracking.h
> >>> >> +++ b/include/linux/context_tracking.h
> >>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> >>> >> if (context_tracking_is_enabled())
> >>> >> __context_tracking_task_switch(prev, next);
> >>> >> }
> >>> >> +
> >>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
> >>> >> +{
> >>> >> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> >>> >> + this_cpu_read(context_tracking.state) == state,
> >>> >> + "context tracking state was wrong");
> >>> >> +}
> >>> >
> >>> > Please don't introduce assert() style debug check interfaces!
> >>> >
> >>> > (And RCU should be fixed too I suspect.)
> >>> >
> >>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> >>> > which are the dominant runtime check interface in the kernel.
> >>> >
> >>> > Instead make it something like:
> >>> >
> >>> > #define ct_state() (this_cpu_read(context_tracking.state))
> >>> >
> >>> > #define CT_WARN_ON(cond) \
> >>> > WARN_ON(context_tracking_is_enabled() && (cond))
> >>> >
> >>> > and then the debug checks can be written as:
> >>> >
> >>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> >>> >
> >>> > This is IMHO _far_ more readable than:
> >>> >
> >>> > context_tracking_assert_state(CONTEXT_KERNEL);
> >>> >
> >>> > ok?
> >>> >
> >>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
> >>>
> >>> Hmm, ok I guess. The part I don't like is having ct_state() at all on
> >>> non-context-tracking kernels -- it seems like it's asking for trouble.
> >>
> >> Well:
> >>
> >> - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
> >>
> >> - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
> >> CT_WARN_ON() will evaluate 'cond', but won't calculate it.
> >>
> >> - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
> >> get as far as ct_state() evaluation.
> >>
> >> so I'm not sure I see the problem you are seeing.
> >>
> >>> We could make CT_WARN_ON not even evaluate its argument if
> >>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
> >>> !context_tracking_is_enabled().
> >>
> >> My understanding is that if !context_tracking_is_enabled() then the compiler
> >> should not even try to evaluate the rest. This is why doing a NULL pointer check
> >> like this is safe:
> >
> > I'm fine with everything you just covered. My only objection is that,
> > if ct_state() exists, then someone might call it outside CT_WARN_ON,
> > in which case it will break on non-context-tracking setups.
>
> The more I think about it, the more I dislike ct_state(). We have
> in_atomic(), which is already problematic because the return value
> isn't reliable. ct_state(), if callable on non context-tracking
> kernels, will also be unreliable. I prefer things like
> lockdep_assert_held because they can't be misused.
>
> It would be far too easy for someone to read:
>
> CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>
> and add:
>
> if (ct_state() == CONTEXT_KERNEL)
> do_something();
>
> and that would be bad.

But ct_state() could be made reliable: if !context_tracking_is_enabled() then it
should return -1 or so.

I.e. we could make it something like:

enum ctx_state {
CONTEXT_DISABLED = -1,
CONTEXT_KERNEL = 0,
CONTEXT_USER = 1,
CONTEXT_GUEST = 2,
} state;

static inline enum ctx_state ct_state(void)
{
if (context_tracking_is_enabled())
return this_cpu_read(context_tracking.state))

return CONTEXT_DISABLED;
}

and then CT_WARN_ON() DTRT.

Thanks,

Ingo

2015-06-18 16:24:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Andy Lutomirski <[email protected]> wrote:

> On Thu, Jun 18, 2015 at 3:11 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> >> > The only low level bits remaining in assembly will be low level hardware ABI
> >> > details: saving registers and restoring registers to the expected format - no
> >> > 'active' code whatsoever.
> >>
> >> I think this is true for syscalls. Getting the weird special cases (IRET and
> >> GS fault) for error_entry to work correctly in C could be tricky.
> >
> > Correct, and I double checked the IRET fault path yesterday (fixup_bad_iret),
> > and it looks like a straightforward exception handler with limited control
> > flow. It can stay in asm just fine, it seems mostly orthogonal to the rest.
> >
> > I didn't check the GS fault path, but that only affects 32-bit, as we use
> > SWAPGS on 64-bit, right? In any case, that code too (32-bit RESTORE_REGS)
> > belongs into the natural 'hardware ABI preparation code' that should stay in
> > assembly. (Unless I missed some other code that might cause trouble.)
>
> Look for "gs_change". To change the gs selector, we do swapgs, then load gs,
> then swapgs again. If the gs load fails, then we trigger a special fixup.

Yes, but I don't see the connection to moving the syscall (and IRQ) entry code to
.c: native_load_gs_index() is a separate API we call from regular kernel code, and
it has a regular exception fixup entry plus a trap handler special case.

It's fine in entry_64.S, but it would be equally fine in inline asm() as well.

I think it's fine in entry_64.S as long as the error trap code (which refers to
the change_gs RIP) lives there. But it could live in .c as well, as we can
generate global symbols within inline asm() too.

Thanks,

Ingo

2015-06-18 16:26:44

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Thu, Jun 18, 2015 at 06:17:29PM +0200, Ingo Molnar wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <[email protected]> wrote:
> > > On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <[email protected]> wrote:
> > >>
> > >> * Andy Lutomirski <[email protected]> wrote:
> > >>
> > >>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
> > >>> >
> > >>> > * Andy Lutomirski <[email protected]> wrote:
> > >>> >
> > >>> >> This will let us sprinkle sanity checks around the kernel without
> > >>> >> making too much of a mess.
> > >>> >>
> > >>> >> Signed-off-by: Andy Lutomirski <[email protected]>
> > >>> >> ---
> > >>> >> include/linux/context_tracking.h | 8 ++++++++
> > >>> >> 1 file changed, 8 insertions(+)
> > >>> >>
> > >>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > >>> >> index 2821838256b4..0fbea4b152e1 100644
> > >>> >> --- a/include/linux/context_tracking.h
> > >>> >> +++ b/include/linux/context_tracking.h
> > >>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > >>> >> if (context_tracking_is_enabled())
> > >>> >> __context_tracking_task_switch(prev, next);
> > >>> >> }
> > >>> >> +
> > >>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
> > >>> >> +{
> > >>> >> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > >>> >> + this_cpu_read(context_tracking.state) == state,
> > >>> >> + "context tracking state was wrong");
> > >>> >> +}
> > >>> >
> > >>> > Please don't introduce assert() style debug check interfaces!
> > >>> >
> > >>> > (And RCU should be fixed too I suspect.)
> > >>> >
> > >>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> > >>> > which are the dominant runtime check interface in the kernel.
> > >>> >
> > >>> > Instead make it something like:
> > >>> >
> > >>> > #define ct_state() (this_cpu_read(context_tracking.state))
> > >>> >
> > >>> > #define CT_WARN_ON(cond) \
> > >>> > WARN_ON(context_tracking_is_enabled() && (cond))
> > >>> >
> > >>> > and then the debug checks can be written as:
> > >>> >
> > >>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> > >>> >
> > >>> > This is IMHO _far_ more readable than:
> > >>> >
> > >>> > context_tracking_assert_state(CONTEXT_KERNEL);
> > >>> >
> > >>> > ok?
> > >>> >
> > >>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
> > >>>
> > >>> Hmm, ok I guess. The part I don't like is having ct_state() at all on
> > >>> non-context-tracking kernels -- it seems like it's asking for trouble.
> > >>
> > >> Well:
> > >>
> > >> - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
> > >>
> > >> - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
> > >> CT_WARN_ON() will evaluate 'cond', but won't calculate it.
> > >>
> > >> - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
> > >> get as far as ct_state() evaluation.
> > >>
> > >> so I'm not sure I see the problem you are seeing.
> > >>
> > >>> We could make CT_WARN_ON not even evaluate its argument if
> > >>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
> > >>> !context_tracking_is_enabled().
> > >>
> > >> My understanding is that if !context_tracking_is_enabled() then the compiler
> > >> should not even try to evaluate the rest. This is why doing a NULL pointer check
> > >> like this is safe:
> > >
> > > I'm fine with everything you just covered. My only objection is that,
> > > if ct_state() exists, then someone might call it outside CT_WARN_ON,
> > > in which case it will break on non-context-tracking setups.
> >
> > The more I think about it, the more I dislike ct_state(). We have
> > in_atomic(), which is already problematic because the return value
> > isn't reliable. ct_state(), if callable on non context-tracking
> > kernels, will also be unreliable. I prefer things like
> > lockdep_assert_held because they can't be misused.
> >
> > It would be far too easy for someone to read:
> >
> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> >
> > and add:
> >
> > if (ct_state() == CONTEXT_KERNEL)
> > do_something();
> >
> > and that would be bad.
>
> But ct_state() could be made reliable: if !context_tracking_is_enabled() then it
> should return -1 or so.
>
> I.e. we could make it something like:
>
> enum ctx_state {
> CONTEXT_DISABLED = -1,
> CONTEXT_KERNEL = 0,
> CONTEXT_USER = 1,
> CONTEXT_GUEST = 2,
> } state;
>
> static inline enum ctx_state ct_state(void)
> {
> if (context_tracking_is_enabled())
> return this_cpu_read(context_tracking.state))
>
> return CONTEXT_DISABLED;
> }
>
> and then CT_WARN_ON() DTRT.

That sounds good. With good layout of these things, the compiler should still be
able to nop related code when context tracking is disabled.

>
> Thanks,
>
> Ingo

2015-06-18 19:27:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Thu, Jun 18, 2015 at 9:26 AM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Jun 18, 2015 at 06:17:29PM +0200, Ingo Molnar wrote:
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>> > On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <[email protected]> wrote:
>> > > On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <[email protected]> wrote:
>> > >>
>> > >> * Andy Lutomirski <[email protected]> wrote:
>> > >>
>> > >>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <[email protected]> wrote:
>> > >>> >
>> > >>> > * Andy Lutomirski <[email protected]> wrote:
>> > >>> >
>> > >>> >> This will let us sprinkle sanity checks around the kernel without
>> > >>> >> making too much of a mess.
>> > >>> >>
>> > >>> >> Signed-off-by: Andy Lutomirski <[email protected]>
>> > >>> >> ---
>> > >>> >> include/linux/context_tracking.h | 8 ++++++++
>> > >>> >> 1 file changed, 8 insertions(+)
>> > >>> >>
>> > >>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>> > >>> >> index 2821838256b4..0fbea4b152e1 100644
>> > >>> >> --- a/include/linux/context_tracking.h
>> > >>> >> +++ b/include/linux/context_tracking.h
>> > >>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>> > >>> >> if (context_tracking_is_enabled())
>> > >>> >> __context_tracking_task_switch(prev, next);
>> > >>> >> }
>> > >>> >> +
>> > >>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
>> > >>> >> +{
>> > >>> >> + rcu_lockdep_assert(!context_tracking_is_enabled() ||
>> > >>> >> + this_cpu_read(context_tracking.state) == state,
>> > >>> >> + "context tracking state was wrong");
>> > >>> >> +}
>> > >>> >
>> > >>> > Please don't introduce assert() style debug check interfaces!
>> > >>> >
>> > >>> > (And RCU should be fixed too I suspect.)
>> > >>> >
>> > >>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
>> > >>> > which are the dominant runtime check interface in the kernel.
>> > >>> >
>> > >>> > Instead make it something like:
>> > >>> >
>> > >>> > #define ct_state() (this_cpu_read(context_tracking.state))
>> > >>> >
>> > >>> > #define CT_WARN_ON(cond) \
>> > >>> > WARN_ON(context_tracking_is_enabled() && (cond))
>> > >>> >
>> > >>> > and then the debug checks can be written as:
>> > >>> >
>> > >>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> > >>> >
>> > >>> > This is IMHO _far_ more readable than:
>> > >>> >
>> > >>> > context_tracking_assert_state(CONTEXT_KERNEL);
>> > >>> >
>> > >>> > ok?
>> > >>> >
>> > >>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>> > >>>
>> > >>> Hmm, ok I guess. The part I don't like is having ct_state() at all on
>> > >>> non-context-tracking kernels -- it seems like it's asking for trouble.
>> > >>
>> > >> Well:
>> > >>
>> > >> - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
>> > >>
>> > >> - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
>> > >> CT_WARN_ON() will evaluate 'cond', but won't calculate it.
>> > >>
>> > >> - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
>> > >> get as far as ct_state() evaluation.
>> > >>
>> > >> so I'm not sure I see the problem you are seeing.
>> > >>
>> > >>> We could make CT_WARN_ON not even evaluate its argument if
>> > >>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
>> > >>> !context_tracking_is_enabled().
>> > >>
>> > >> My understanding is that if !context_tracking_is_enabled() then the compiler
>> > >> should not even try to evaluate the rest. This is why doing a NULL pointer check
>> > >> like this is safe:
>> > >
>> > > I'm fine with everything you just covered. My only objection is that,
>> > > if ct_state() exists, then someone might call it outside CT_WARN_ON,
>> > > in which case it will break on non-context-tracking setups.
>> >
>> > The more I think about it, the more I dislike ct_state(). We have
>> > in_atomic(), which is already problematic because the return value
>> > isn't reliable. ct_state(), if callable on non context-tracking
>> > kernels, will also be unreliable. I prefer things like
>> > lockdep_assert_held because they can't be misused.
>> >
>> > It would be far too easy for someone to read:
>> >
>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> >
>> > and add:
>> >
>> > if (ct_state() == CONTEXT_KERNEL)
>> > do_something();
>> >
>> > and that would be bad.
>>
>> But ct_state() could be made reliable: if !context_tracking_is_enabled() then it
>> should return -1 or so.
>>
>> I.e. we could make it something like:
>>
>> enum ctx_state {
>> CONTEXT_DISABLED = -1,
>> CONTEXT_KERNEL = 0,
>> CONTEXT_USER = 1,
>> CONTEXT_GUEST = 2,
>> } state;
>>
>> static inline enum ctx_state ct_state(void)
>> {
>> if (context_tracking_is_enabled())
>> return this_cpu_read(context_tracking.state))
>>
>> return CONTEXT_DISABLED;
>> }
>>
>> and then CT_WARN_ON() DTRT.
>
> That sounds good. With good layout of these things, the compiler should still be
> able to nop related code when context tracking is disabled.

Done.

--Andy

>
>>
>> Thanks,
>>
>> Ingo



--
Andy Lutomirski
AMA Capital Management, LLC

2015-06-18 22:54:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Thu, Jun 18, 2015 at 11:59:55AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 11:41:14AM +0200, Ingo Molnar wrote:
> > >
> > > * Andy Lutomirski <[email protected]> wrote:
> > >
> > > > This will let us sprinkle sanity checks around the kernel without
> > > > making too much of a mess.
> > > >
> > > > Signed-off-by: Andy Lutomirski <[email protected]>
> > > > ---
> > > > include/linux/context_tracking.h | 8 ++++++++
> > > > 1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > > > index 2821838256b4..0fbea4b152e1 100644
> > > > --- a/include/linux/context_tracking.h
> > > > +++ b/include/linux/context_tracking.h
> > > > @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > > > if (context_tracking_is_enabled())
> > > > __context_tracking_task_switch(prev, next);
> > > > }
> > > > +
> > > > +static inline void context_tracking_assert_state(enum ctx_state state)
> > > > +{
> > > > + rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > > > + this_cpu_read(context_tracking.state) == state,
> > > > + "context tracking state was wrong");
> > > > +}
> > >
> > > Please don't introduce assert() style debug check interfaces!
> > >
> > > (And RCU should be fixed too I suspect.)
> >
> > The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
> > by analogy to WARN()? Easy to do if so! Or am I missing the point?
>
> Yeah, and inverting the condition. Assuming the condition was assert()-style
> inverted to begin with! :-)

It appears to have been. ;-)

Please see below for an untested patch. I made this be one big patch,
but could have one patch add RCU_LOCKDEP_WARN(), a series convert uses
from rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch
remove rcu_lockdep_assert(). Let me know!

> and lockdep should be fixed too I suspect, lockdep_assert_held() was really a
> poorly chosen name I suspect, it should be 'lockdep_check_held()' or so? It has
> very little to do with the assert() interface.

------------------------------------------------------------------------

commit a3053da26e914ab9d7fe25b03d35bbe5a2a718c0
Author: Paul E. McKenney <[email protected]>
Date: Thu Jun 18 15:50:02 2015 -0700

rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()

This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
consistency with the WARN() series of macros. This also requires
inverting the sense of the conditional, which this commit also does.

Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 324ab5247687..37f827cb78ed 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -136,7 +136,7 @@ enum ctx_state ist_enter(struct pt_regs *regs)
preempt_count_add(HARDIRQ_OFFSET);

/* This code is a bit fragile. Test it. */
- rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");

return prev_state;
}
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 677fb2843553..3b188f20b43f 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -110,8 +110,8 @@ static DEFINE_MUTEX(dev_opp_list_lock);

#define opp_rcu_lockdep_assert() \
do { \
- rcu_lockdep_assert(rcu_read_lock_held() || \
- lockdep_is_held(&dev_opp_list_lock), \
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
+ !lockdep_is_held(&dev_opp_list_lock), \
"Missing rcu_read_lock() or " \
"dev_opp_list_lock protection"); \
} while (0)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 230f87bdf5ad..910c75f6cd0b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,8 +83,8 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

static inline struct file *fcheck_files(struct files_struct *files, unsigned int fd)
{
- rcu_lockdep_assert(rcu_read_lock_held() ||
- lockdep_is_held(&files->file_lock),
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+ !lockdep_is_held(&files->file_lock),
"suspicious rcu_dereference_check() usage");
return __fcheck_files(files, fd);
}
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2f4a93a8ceab..e9e04c32cb4c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -537,14 +537,14 @@ static inline int rcu_read_lock_sched_held(void)
#ifdef CONFIG_PROVE_RCU

/**
- * rcu_lockdep_assert - emit lockdep splat if specified condition not met
+ * RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met
* @c: condition to check
* @s: informative message
*/
-#define rcu_lockdep_assert(c, s) \
+#define RCU_LOCKDEP_WARN(c, s) \
do { \
static bool __section(.data.unlikely) __warned; \
- if (debug_lockdep_rcu_enabled() && !__warned && !(c)) { \
+ if (debug_lockdep_rcu_enabled() && !__warned && (c)) { \
__warned = true; \
lockdep_rcu_suspicious(__FILE__, __LINE__, s); \
} \
@@ -553,8 +553,8 @@ static inline int rcu_read_lock_sched_held(void)
#if defined(CONFIG_PROVE_RCU) && !defined(CONFIG_PREEMPT_RCU)
static inline void rcu_preempt_sleep_check(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
- "Illegal context switch in RCU read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+ "Illegal context switch in RCU read-side critical section");
}
#else /* #ifdef CONFIG_PROVE_RCU */
static inline void rcu_preempt_sleep_check(void)
@@ -565,15 +565,15 @@ static inline void rcu_preempt_sleep_check(void)
#define rcu_sleep_check() \
do { \
rcu_preempt_sleep_check(); \
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map), \
- "Illegal context switch in RCU-bh read-side critical section"); \
- rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map), \
- "Illegal context switch in RCU-sched read-side critical section"); \
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map), \
+ "Illegal context switch in RCU-bh read-side critical section"); \
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map), \
+ "Illegal context switch in RCU-sched read-side critical section"); \
} while (0)

#else /* #ifdef CONFIG_PROVE_RCU */

-#define rcu_lockdep_assert(c, s) do { } while (0)
+#define RCU_LOCKDEP_WARN(c, s) do { } while (0)
#define rcu_sleep_check() do { } while (0)

#endif /* #else #ifdef CONFIG_PROVE_RCU */
@@ -604,13 +604,13 @@ static inline void rcu_preempt_sleep_check(void)
({ \
/* Dependency order vs. p above. */ \
typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
- rcu_lockdep_assert(c, "suspicious rcu_dereference_check() usage"); \
+ RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
rcu_dereference_sparse(p, space); \
((typeof(*p) __force __kernel *)(________p1)); \
})
#define __rcu_dereference_protected(p, c, space) \
({ \
- rcu_lockdep_assert(c, "suspicious rcu_dereference_protected() usage"); \
+ RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_protected() usage"); \
rcu_dereference_sparse(p, space); \
((typeof(*p) __force __kernel *)(p)); \
})
@@ -849,8 +849,8 @@ static inline void rcu_read_lock(void)
__rcu_read_lock();
__acquire(RCU);
rcu_lock_acquire(&rcu_lock_map);
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_lock() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_lock() used illegally while idle");
}

/*
@@ -900,8 +900,8 @@ static inline void rcu_read_lock(void)
*/
static inline void rcu_read_unlock(void)
{
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_unlock() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_unlock() used illegally while idle");
__release(RCU);
__rcu_read_unlock();
rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
@@ -929,8 +929,8 @@ static inline void rcu_read_lock_bh(void)
local_bh_disable();
__acquire(RCU_BH);
rcu_lock_acquire(&rcu_bh_lock_map);
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_lock_bh() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_lock_bh() used illegally while idle");
}

/*
@@ -940,8 +940,8 @@ static inline void rcu_read_lock_bh(void)
*/
static inline void rcu_read_unlock_bh(void)
{
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_unlock_bh() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_unlock_bh() used illegally while idle");
rcu_lock_release(&rcu_bh_lock_map);
__release(RCU_BH);
local_bh_enable();
@@ -965,8 +965,8 @@ static inline void rcu_read_lock_sched(void)
preempt_disable();
__acquire(RCU_SCHED);
rcu_lock_acquire(&rcu_sched_lock_map);
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_lock_sched() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_lock_sched() used illegally while idle");
}

/* Used by lockdep and tracing: cannot be traced, cannot call lockdep. */
@@ -983,8 +983,8 @@ static inline notrace void rcu_read_lock_sched_notrace(void)
*/
static inline void rcu_read_unlock_sched(void)
{
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_unlock_sched() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_unlock_sched() used illegally while idle");
rcu_lock_release(&rcu_sched_lock_map);
__release(RCU_SCHED);
preempt_enable();
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd547770c..ad2d0162532a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -104,8 +104,8 @@ static DEFINE_SPINLOCK(cgroup_idr_lock);
static DEFINE_SPINLOCK(release_agent_path_lock);

#define cgroup_assert_mutex_or_rcu_locked() \
- rcu_lockdep_assert(rcu_read_lock_held() || \
- lockdep_is_held(&cgroup_mutex), \
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
+ !lockdep_is_held(&cgroup_mutex), \
"cgroup_mutex or RCU read lock required");

/*
diff --git a/kernel/pid.c b/kernel/pid.c
index 4fd07d5b7baf..ca368793808e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -451,9 +451,8 @@ EXPORT_SYMBOL(pid_task);
*/
struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
{
- rcu_lockdep_assert(rcu_read_lock_held(),
- "find_task_by_pid_ns() needs rcu_read_lock()"
- " protection");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+ "find_task_by_pid_ns() needs rcu_read_lock() protection");
return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
}

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index de35087c92a5..d3fcb2ec8536 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -415,11 +415,11 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
struct rcu_head *head = &rcu.head;
bool done = false;

- rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
- !lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&sp->dep_map) ||
+ lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");

might_sleep();
init_completion(&rcu.completion);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 591af0cb7b9f..e888602d5c56 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -191,10 +191,10 @@ static void rcu_process_callbacks(struct softirq_action *unused)
*/
void synchronize_sched(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_sched() in RCU read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_sched() in RCU read-side critical section");
cond_resched();
}
EXPORT_SYMBOL_GPL(synchronize_sched);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c844ef3c2fae..78d0a87ff354 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,12 +644,12 @@ static void rcu_eqs_enter_common(long long oldval, bool user)
* It is illegal to enter an extended quiescent state while
* in an RCU read-side critical section.
*/
- rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
- "Illegal idle entry in RCU read-side critical section.");
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
- "Illegal idle entry in RCU-bh read-side critical section.");
- rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
- "Illegal idle entry in RCU-sched read-side critical section.");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+ "Illegal idle entry in RCU read-side critical section.");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map),
+ "Illegal idle entry in RCU-bh read-side critical section.");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map),
+ "Illegal idle entry in RCU-sched read-side critical section.");
}

/*
@@ -3162,10 +3162,10 @@ static inline int rcu_blocking_is_gp(void)
*/
void synchronize_sched(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_sched() in RCU-sched read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_sched() in RCU-sched read-side critical section");
if (rcu_blocking_is_gp())
return;
if (rcu_gp_is_expedited())
@@ -3189,10 +3189,10 @@ EXPORT_SYMBOL_GPL(synchronize_sched);
*/
void synchronize_rcu_bh(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
if (rcu_blocking_is_gp())
return;
if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 85addcf7d184..07bd9fb393b4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -522,10 +522,10 @@ EXPORT_SYMBOL_GPL(call_rcu);
*/
void synchronize_rcu(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_rcu() in RCU read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_rcu() in RCU read-side critical section");
if (!rcu_scheduler_active)
return;
if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a0a0dd03c73a..47268fb1d27b 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
void synchronize_rcu_tasks(void)
{
/* Complain if the scheduler has not started. */
- rcu_lockdep_assert(!rcu_scheduler_active,
- "synchronize_rcu_tasks called too soon");
+ RCU_LOCKDEP_WARN(rcu_scheduler_active,
+ "synchronize_rcu_tasks called too soon");

/* Wait for the grace period. */
wait_rcu_gp(call_rcu_tasks);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f7510bce..7956c016e750 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1987,8 +1987,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
#ifdef CONFIG_SMP
inline struct dl_bw *dl_bw_of(int i)
{
- rcu_lockdep_assert(rcu_read_lock_sched_held(),
- "sched RCU must be held");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
return &cpu_rq(i)->rd->dl_bw;
}

@@ -1997,8 +1997,8 @@ static inline int dl_bw_cpus(int i)
struct root_domain *rd = cpu_rq(i)->rd;
int cpus = 0;

- rcu_lockdep_assert(rcu_read_lock_sched_held(),
- "sched RCU must be held");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
for_each_cpu_and(i, rd->span, cpu_active_mask)
cpus++;

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 586ad91300b0..80bc8dc8d286 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -338,14 +338,14 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
#include <trace/events/workqueue.h>

#define assert_rcu_or_pool_mutex() \
- rcu_lockdep_assert(rcu_read_lock_sched_held() || \
- lockdep_is_held(&wq_pool_mutex), \
- "sched RCU or wq_pool_mutex should be held")
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() && \
+ !lockdep_is_held(&wq_pool_mutex), \
+ "sched RCU or wq_pool_mutex should be held")

#define assert_rcu_or_wq_mutex(wq) \
- rcu_lockdep_assert(rcu_read_lock_sched_held() || \
- lockdep_is_held(&wq->mutex), \
- "sched RCU or wq->mutex should be held")
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() && \
+ !lockdep_is_held(&wq->mutex), \
+ "sched RCU or wq->mutex should be held")

#define for_each_cpu_worker_pool(pool, cpu) \
for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 188c1d26393b..73455089feef 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -400,9 +400,9 @@ static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
{
bool match = false;

- rcu_lockdep_assert(rcu_read_lock_held() ||
- lockdep_is_held(&devcgroup_mutex),
- "device_cgroup:verify_new_ex called without proper synchronization");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+ lockdep_is_held(&devcgroup_mutex),
+ "device_cgroup:verify_new_ex called without proper synchronization");

if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
if (behavior == DEVCG_DEFAULT_ALLOW) {

2015-06-19 02:19:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Thu, Jun 18, 2015 at 03:54:20PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 18, 2015 at 11:59:55AM +0200, Ingo Molnar wrote:
> > * Paul E. McKenney <[email protected]> wrote:

[ . . . ]

> > > The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
> > > by analogy to WARN()? Easy to do if so! Or am I missing the point?
> >
> > Yeah, and inverting the condition. Assuming the condition was assert()-style
> > inverted to begin with! :-)
>
> It appears to have been. ;-)
>
> Please see below for an untested patch. I made this be one big patch,
> but could have one patch add RCU_LOCKDEP_WARN(), a series convert uses
> from rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch
> remove rcu_lockdep_assert(). Let me know!
>
> > and lockdep should be fixed too I suspect, lockdep_assert_held() was really a
> > poorly chosen name I suspect, it should be 'lockdep_check_held()' or so? It has
> > very little to do with the assert() interface.

And this one actually builds and passes light rcutorture testing. ;-)

Thanx, Paul

------------------------------------------------------------------------

commit eeacf89826376570bff42edec59fd1606ce1829f
Author: Paul E. McKenney <[email protected]>
Date: Thu Jun 18 15:50:02 2015 -0700

rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()

This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
consistency with the WARN() series of macros. This also requires
inverting the sense of the conditional, which this commit also does.

Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 5746b0c77f3e..adc2184009c5 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -883,7 +883,7 @@ All: lockdep-checked RCU-protected pointer access

rcu_access_pointer
rcu_dereference_raw
- rcu_lockdep_assert
+ RCU_LOCKDEP_WARN
rcu_sleep_check
RCU_NONIDLE

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d4298feaa6fb..127ecd687c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -54,9 +54,9 @@ static DEFINE_MUTEX(mce_chrdev_read_mutex);

#define rcu_dereference_check_mce(p) \
({ \
- rcu_lockdep_assert(rcu_read_lock_sched_held() || \
- lockdep_is_held(&mce_chrdev_read_mutex), \
- "suspicious rcu_dereference_check_mce() usage"); \
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() && \
+ !lockdep_is_held(&mce_chrdev_read_mutex), \
+ "suspicious rcu_dereference_check_mce() usage"); \
smp_load_acquire(&(p)); \
})

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 324ab5247687..37f827cb78ed 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -136,7 +136,7 @@ enum ctx_state ist_enter(struct pt_regs *regs)
preempt_count_add(HARDIRQ_OFFSET);

/* This code is a bit fragile. Test it. */
- rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");

return prev_state;
}
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 677fb2843553..3b188f20b43f 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -110,8 +110,8 @@ static DEFINE_MUTEX(dev_opp_list_lock);

#define opp_rcu_lockdep_assert() \
do { \
- rcu_lockdep_assert(rcu_read_lock_held() || \
- lockdep_is_held(&dev_opp_list_lock), \
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
+ !lockdep_is_held(&dev_opp_list_lock), \
"Missing rcu_read_lock() or " \
"dev_opp_list_lock protection"); \
} while (0)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 230f87bdf5ad..910c75f6cd0b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,8 +83,8 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

static inline struct file *fcheck_files(struct files_struct *files, unsigned int fd)
{
- rcu_lockdep_assert(rcu_read_lock_held() ||
- lockdep_is_held(&files->file_lock),
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+ !lockdep_is_held(&files->file_lock),
"suspicious rcu_dereference_check() usage");
return __fcheck_files(files, fd);
}
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2f4a93a8ceab..e9e04c32cb4c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -537,14 +537,14 @@ static inline int rcu_read_lock_sched_held(void)
#ifdef CONFIG_PROVE_RCU

/**
- * rcu_lockdep_assert - emit lockdep splat if specified condition not met
+ * RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met
* @c: condition to check
* @s: informative message
*/
-#define rcu_lockdep_assert(c, s) \
+#define RCU_LOCKDEP_WARN(c, s) \
do { \
static bool __section(.data.unlikely) __warned; \
- if (debug_lockdep_rcu_enabled() && !__warned && !(c)) { \
+ if (debug_lockdep_rcu_enabled() && !__warned && (c)) { \
__warned = true; \
lockdep_rcu_suspicious(__FILE__, __LINE__, s); \
} \
@@ -553,8 +553,8 @@ static inline int rcu_read_lock_sched_held(void)
#if defined(CONFIG_PROVE_RCU) && !defined(CONFIG_PREEMPT_RCU)
static inline void rcu_preempt_sleep_check(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
- "Illegal context switch in RCU read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+ "Illegal context switch in RCU read-side critical section");
}
#else /* #ifdef CONFIG_PROVE_RCU */
static inline void rcu_preempt_sleep_check(void)
@@ -565,15 +565,15 @@ static inline void rcu_preempt_sleep_check(void)
#define rcu_sleep_check() \
do { \
rcu_preempt_sleep_check(); \
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map), \
- "Illegal context switch in RCU-bh read-side critical section"); \
- rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map), \
- "Illegal context switch in RCU-sched read-side critical section"); \
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map), \
+ "Illegal context switch in RCU-bh read-side critical section"); \
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map), \
+ "Illegal context switch in RCU-sched read-side critical section"); \
} while (0)

#else /* #ifdef CONFIG_PROVE_RCU */

-#define rcu_lockdep_assert(c, s) do { } while (0)
+#define RCU_LOCKDEP_WARN(c, s) do { } while (0)
#define rcu_sleep_check() do { } while (0)

#endif /* #else #ifdef CONFIG_PROVE_RCU */
@@ -604,13 +604,13 @@ static inline void rcu_preempt_sleep_check(void)
({ \
/* Dependency order vs. p above. */ \
typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
- rcu_lockdep_assert(c, "suspicious rcu_dereference_check() usage"); \
+ RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
rcu_dereference_sparse(p, space); \
((typeof(*p) __force __kernel *)(________p1)); \
})
#define __rcu_dereference_protected(p, c, space) \
({ \
- rcu_lockdep_assert(c, "suspicious rcu_dereference_protected() usage"); \
+ RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_protected() usage"); \
rcu_dereference_sparse(p, space); \
((typeof(*p) __force __kernel *)(p)); \
})
@@ -849,8 +849,8 @@ static inline void rcu_read_lock(void)
__rcu_read_lock();
__acquire(RCU);
rcu_lock_acquire(&rcu_lock_map);
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_lock() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_lock() used illegally while idle");
}

/*
@@ -900,8 +900,8 @@ static inline void rcu_read_lock(void)
*/
static inline void rcu_read_unlock(void)
{
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_unlock() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_unlock() used illegally while idle");
__release(RCU);
__rcu_read_unlock();
rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
@@ -929,8 +929,8 @@ static inline void rcu_read_lock_bh(void)
local_bh_disable();
__acquire(RCU_BH);
rcu_lock_acquire(&rcu_bh_lock_map);
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_lock_bh() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_lock_bh() used illegally while idle");
}

/*
@@ -940,8 +940,8 @@ static inline void rcu_read_lock_bh(void)
*/
static inline void rcu_read_unlock_bh(void)
{
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_unlock_bh() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_unlock_bh() used illegally while idle");
rcu_lock_release(&rcu_bh_lock_map);
__release(RCU_BH);
local_bh_enable();
@@ -965,8 +965,8 @@ static inline void rcu_read_lock_sched(void)
preempt_disable();
__acquire(RCU_SCHED);
rcu_lock_acquire(&rcu_sched_lock_map);
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_lock_sched() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_lock_sched() used illegally while idle");
}

/* Used by lockdep and tracing: cannot be traced, cannot call lockdep. */
@@ -983,8 +983,8 @@ static inline notrace void rcu_read_lock_sched_notrace(void)
*/
static inline void rcu_read_unlock_sched(void)
{
- rcu_lockdep_assert(rcu_is_watching(),
- "rcu_read_unlock_sched() used illegally while idle");
+ RCU_LOCKDEP_WARN(!rcu_is_watching(),
+ "rcu_read_unlock_sched() used illegally while idle");
rcu_lock_release(&rcu_sched_lock_map);
__release(RCU_SCHED);
preempt_enable();
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd547770c..ad2d0162532a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -104,8 +104,8 @@ static DEFINE_SPINLOCK(cgroup_idr_lock);
static DEFINE_SPINLOCK(release_agent_path_lock);

#define cgroup_assert_mutex_or_rcu_locked() \
- rcu_lockdep_assert(rcu_read_lock_held() || \
- lockdep_is_held(&cgroup_mutex), \
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
+ !lockdep_is_held(&cgroup_mutex), \
"cgroup_mutex or RCU read lock required");

/*
diff --git a/kernel/pid.c b/kernel/pid.c
index 4fd07d5b7baf..ca368793808e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -451,9 +451,8 @@ EXPORT_SYMBOL(pid_task);
*/
struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
{
- rcu_lockdep_assert(rcu_read_lock_held(),
- "find_task_by_pid_ns() needs rcu_read_lock()"
- " protection");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+ "find_task_by_pid_ns() needs rcu_read_lock() protection");
return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
}

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index de35087c92a5..d3fcb2ec8536 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -415,11 +415,11 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
struct rcu_head *head = &rcu.head;
bool done = false;

- rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
- !lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&sp->dep_map) ||
+ lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");

might_sleep();
init_completion(&rcu.completion);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 591af0cb7b9f..e888602d5c56 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -191,10 +191,10 @@ static void rcu_process_callbacks(struct softirq_action *unused)
*/
void synchronize_sched(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_sched() in RCU read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_sched() in RCU read-side critical section");
cond_resched();
}
EXPORT_SYMBOL_GPL(synchronize_sched);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c844ef3c2fae..78d0a87ff354 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,12 +644,12 @@ static void rcu_eqs_enter_common(long long oldval, bool user)
* It is illegal to enter an extended quiescent state while
* in an RCU read-side critical section.
*/
- rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
- "Illegal idle entry in RCU read-side critical section.");
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
- "Illegal idle entry in RCU-bh read-side critical section.");
- rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
- "Illegal idle entry in RCU-sched read-side critical section.");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+ "Illegal idle entry in RCU read-side critical section.");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map),
+ "Illegal idle entry in RCU-bh read-side critical section.");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map),
+ "Illegal idle entry in RCU-sched read-side critical section.");
}

/*
@@ -3162,10 +3162,10 @@ static inline int rcu_blocking_is_gp(void)
*/
void synchronize_sched(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_sched() in RCU-sched read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_sched() in RCU-sched read-side critical section");
if (rcu_blocking_is_gp())
return;
if (rcu_gp_is_expedited())
@@ -3189,10 +3189,10 @@ EXPORT_SYMBOL_GPL(synchronize_sched);
*/
void synchronize_rcu_bh(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
if (rcu_blocking_is_gp())
return;
if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 85addcf7d184..07bd9fb393b4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -522,10 +522,10 @@ EXPORT_SYMBOL_GPL(call_rcu);
*/
void synchronize_rcu(void)
{
- rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
- !lock_is_held(&rcu_lock_map) &&
- !lock_is_held(&rcu_sched_lock_map),
- "Illegal synchronize_rcu() in RCU read-side critical section");
+ RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+ lock_is_held(&rcu_lock_map) ||
+ lock_is_held(&rcu_sched_lock_map),
+ "Illegal synchronize_rcu() in RCU read-side critical section");
if (!rcu_scheduler_active)
return;
if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a0a0dd03c73a..47268fb1d27b 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
void synchronize_rcu_tasks(void)
{
/* Complain if the scheduler has not started. */
- rcu_lockdep_assert(!rcu_scheduler_active,
- "synchronize_rcu_tasks called too soon");
+ RCU_LOCKDEP_WARN(rcu_scheduler_active,
+ "synchronize_rcu_tasks called too soon");

/* Wait for the grace period. */
wait_rcu_gp(call_rcu_tasks);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f7510bce..7956c016e750 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1987,8 +1987,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
#ifdef CONFIG_SMP
inline struct dl_bw *dl_bw_of(int i)
{
- rcu_lockdep_assert(rcu_read_lock_sched_held(),
- "sched RCU must be held");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
return &cpu_rq(i)->rd->dl_bw;
}

@@ -1997,8 +1997,8 @@ static inline int dl_bw_cpus(int i)
struct root_domain *rd = cpu_rq(i)->rd;
int cpus = 0;

- rcu_lockdep_assert(rcu_read_lock_sched_held(),
- "sched RCU must be held");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
for_each_cpu_and(i, rd->span, cpu_active_mask)
cpus++;

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 586ad91300b0..80bc8dc8d286 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -338,14 +338,14 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
#include <trace/events/workqueue.h>

#define assert_rcu_or_pool_mutex() \
- rcu_lockdep_assert(rcu_read_lock_sched_held() || \
- lockdep_is_held(&wq_pool_mutex), \
- "sched RCU or wq_pool_mutex should be held")
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() && \
+ !lockdep_is_held(&wq_pool_mutex), \
+ "sched RCU or wq_pool_mutex should be held")

#define assert_rcu_or_wq_mutex(wq) \
- rcu_lockdep_assert(rcu_read_lock_sched_held() || \
- lockdep_is_held(&wq->mutex), \
- "sched RCU or wq->mutex should be held")
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() && \
+ !lockdep_is_held(&wq->mutex), \
+ "sched RCU or wq->mutex should be held")

#define for_each_cpu_worker_pool(pool, cpu) \
for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 188c1d26393b..73455089feef 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -400,9 +400,9 @@ static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
{
bool match = false;

- rcu_lockdep_assert(rcu_read_lock_held() ||
- lockdep_is_held(&devcgroup_mutex),
- "device_cgroup:verify_new_ex called without proper synchronization");
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+ lockdep_is_held(&devcgroup_mutex),
+ "device_cgroup:verify_new_ex called without proper synchronization");

if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
if (behavior == DEVCG_DEFAULT_ALLOW) {

2015-06-30 11:04:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state


* Paul E. McKenney <[email protected]> wrote:

> > Yeah, and inverting the condition. Assuming the condition was assert()-style
> > inverted to begin with! :-)
>
> It appears to have been. ;-)
>
> Please see below for an untested patch. I made this be one big patch, but could
> have one patch add RCU_LOCKDEP_WARN(), a series convert uses from
> rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch remove
> rcu_lockdep_assert(). Let me know!

One big patch is perfect I think - it's a rename and a relatively mechanic
inversion of conditions, no point in splitting it up any more IMHO. (But it's your
call really.)

So I had a quick look at this patch, and IMHO the RCU_LOCKDEP_WARN() lines read a
lot more 'naturally', because the new, inverted conditions now highlight buggy
scenarios - which has the same logic parity as the kernel's historic
WARN_ON()/BUG_ON() patterns:

Reviewed-by: Ingo Molnar <[email protected]>

This one looked a bit weird:

> index a0a0dd03c73a..47268fb1d27b 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
> void synchronize_rcu_tasks(void)
> {
> /* Complain if the scheduler has not started. */
> - rcu_lockdep_assert(!rcu_scheduler_active,
> - "synchronize_rcu_tasks called too soon");
> + RCU_LOCKDEP_WARN(rcu_scheduler_active,
> + "synchronize_rcu_tasks called too soon");
>

So I'd assume that a flag called 'rcu_scheduler_active' would be 1 if the RCU
scheduler is active.

So why do we warn on it being active? Shouldn't the condition be:

RCU_LOCKDEP_WARN(!rcu_scheduler_active,
"synchronize_rcu_tasks called too soon");

I.e. we warn when the RCU scheduler is not yet active and we called
synchronize_rcu_tasks() too soon?

So either the original assert() was wrong, or I'm missing something obvious?

Thanks,

Ingo

2015-06-30 16:16:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state

On Tue, Jun 30, 2015 at 01:04:14PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > > Yeah, and inverting the condition. Assuming the condition was assert()-style
> > > inverted to begin with! :-)
> >
> > It appears to have been. ;-)
> >
> > Please see below for an untested patch. I made this be one big patch, but could
> > have one patch add RCU_LOCKDEP_WARN(), a series convert uses from
> > rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch remove
> > rcu_lockdep_assert(). Let me know!
>
> One big patch is perfect I think - it's a rename and a relatively mechanic
> inversion of conditions, no point in splitting it up any more IMHO. (But it's your
> call really.)
>
> So I had a quick look at this patch, and IMHO the RCU_LOCKDEP_WARN() lines read a
> lot more 'naturally', because the new, inverted conditions now highlight buggy
> scenarios - which has the same logic parity as the kernel's historic
> WARN_ON()/BUG_ON() patterns:
>
> Reviewed-by: Ingo Molnar <[email protected]>

Thank you, added!

> This one looked a bit weird:
>
> > index a0a0dd03c73a..47268fb1d27b 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
> > void synchronize_rcu_tasks(void)
> > {
> > /* Complain if the scheduler has not started. */
> > - rcu_lockdep_assert(!rcu_scheduler_active,
> > - "synchronize_rcu_tasks called too soon");
> > + RCU_LOCKDEP_WARN(rcu_scheduler_active,
> > + "synchronize_rcu_tasks called too soon");
> >
>
> So I'd assume that a flag called 'rcu_scheduler_active' would be 1 if the RCU
> scheduler is active.
>
> So why do we warn on it being active? Shouldn't the condition be:
>
> RCU_LOCKDEP_WARN(!rcu_scheduler_active,
> "synchronize_rcu_tasks called too soon");
>
> I.e. we warn when the RCU scheduler is not yet active and we called
> synchronize_rcu_tasks() too soon?
>
> So either the original assert() was wrong, or I'm missing something obvious?

You are missing nothing! But I really do test this stuff...

Ah, I see... I need the following patch in order to enable lockdep-RCU
on one of my RCU-tasks rcutorture scenarios... :-/

Good catch! There were at least three bugs hiding behind that one! ;-)

Thanx, Paul

------------------------------------------------------------------------

commit dc883f1668c83f9525a13ee1b3cd45e9e85a0fe5
Author: Paul E. McKenney <[email protected]>
Date: Tue Jun 30 09:14:01 2015 -0700

rcutorture: Enable lockdep-RCU on TASKS01

Currently none of the RCU-tasks scenarios enables lockdep-RCU, which
causes bugs to be missed. This commit therefore enables lockdep-RCU
on TASKS01.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
index 2cc0e60eba6e..bafe94cbd739 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
@@ -5,6 +5,6 @@ CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
CONFIG_DEBUG_LOCK_ALLOC=y
-CONFIG_PROVE_LOCKING=n
-#CHECK#CONFIG_PROVE_RCU=n
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
CONFIG_RCU_EXPERT=y