by Chris Metcalf

[permalink] [raw]

The cover email for the patch series is getting a little unwieldy
so I will provide a terser summary here, and just update the
list of changes from version to version. Please see the previous
versions linked by the In-Reply-To for more detailed comments
about changes in earlier versions of the patch series.

v6:
restructured to be a "task_isolation" mode not a "cpu_isolated"
mode (Frederic)

v5:
rebased on kernel v4.2-rc3
converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
incorporates Christoph Lameter's quiet_vmstat() call

v4:
rebased on kernel v4.2-rc1
added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
remove dependency on cpu_idle subsystem (Thomas Gleixner)
use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
use seconds for console messages instead of jiffies (Thomas Gleixner)
updated commit description for patch 5/5

v2:
rename "dataplane" to "cpu_isolated"
drop ksoftirqd suppression changes (believed no longer needed)
merge previous "QUIESCE" functionality into baseline functionality
explicitly track syscalls and exceptions for "STRICT" functionality
allow configuring a signal to be delivered for STRICT mode failures
move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control. Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on task_isolation cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, task_isolation threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
task_isolation: add initial support
task_isolation: support PR_TASK_ISOLATION_STRICT mode
task_isolation: provide strict mode configurable signal
task_isolation: add debug boot flag
nohz: task_isolation: allow tick to be fully disabled

Christoph Lameter (1):
vmstat: provide a function to quiet down the diff processing

Documentation/kernel-parameters.txt | 7 +++
arch/arm64/kernel/ptrace.c | 5 ++
arch/tile/kernel/process.c | 9 +++
arch/tile/kernel/ptrace.c | 5 +-
arch/tile/mm/homecache.c | 5 +-
arch/x86/kernel/ptrace.c | 2 +
include/linux/context_tracking.h | 11 +++-
include/linux/isolation.h | 42 +++++++++++++
include/linux/sched.h | 3 +
include/linux/vmstat.h | 2 +
include/uapi/linux/prctl.h | 8 +++
init/Kconfig | 20 ++++++
kernel/Makefile | 1 +
kernel/context_tracking.c | 12 +++-
kernel/irq_work.c | 5 +-
kernel/isolation.c | 122 ++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 21 +++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 7 +++
kernel/sys.c | 8 +++
kernel/time/tick-sched.c | 3 +-
mm/vmstat.c | 14 +++++
23 files changed, 311 insertions(+), 10 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c

--
2.1.2

2015-08-25 19:56:26

by Chris Metcalf

[permalink] [raw]

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies. Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile. For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION
so we can both achieve no overhead for !TASK_ISOLATION, but also
achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
This "v6.1" is just a tweak to the existing v6 series to reflect
Will Deacon's suggestions about the arm64 syscall entry code.
I've updated the git tree with this updated patch in the series.
A more disruptive change would be to capture the thread flags
up front like x86 and tile, which allows the test itself to be
optimized away if the task_isolation call becomes a no-op.

arch/arm64/kernel/ptrace.c | 6 ++++++
arch/tile/kernel/ptrace.c | 5 ++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/isolation.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++
8 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..5d4284445f70 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/isolation.h>

#include <asm/compat.h>
#include <asm/debug-monitors.h>
@@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
if (secure_computing() == -1)
return -1;

+#ifdef CONFIG_TASK_ISOLATION
+ if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+ task_isolation_syscall(regs->syscallno);
+#endif
+
if (test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/isolation.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);

@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (task_isolation_strict())
+ task_isolation_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
}

extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
extern void task_isolation_wait(void);
#else
static inline bool task_isolation_enabled(void) { return false; }
static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
#endif

+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->task_isolation_flags &
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_TASK_ISOLATION 47
#define PR_GET_TASK_ISOLATION 48
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
context_tracking_recursion_exit();
out_irq_restore:
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
#include <linux/swap.h>
#include <linux/vmstat.h>
#include <linux/isolation.h>
+#include <asm/unistd.h>
#include "time/tick-sched.h"

/*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
dump_stack();
}
}
+
+static void kill_task_isolation_strict_task(void)
+{
+ dump_stack();
+ current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+ pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_task_isolation_strict_task();
+}
--
2.1.2

2015-08-28 19:23:03

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v6 4/6] task_isolation: provide strict mode configurable signal

On Tue, Aug 25, 2015 at 12:55 PM, Chris Metcalf <[email protected]> wrote:
> Allow userspace to override the default SIGKILL delivered
> when a task_isolation process in STRICT mode does a syscall
> or otherwise synchronously enters the kernel.
>
> In addition to being able to set the signal, we now also
> pass whether or not the interruption was from a syscall in
> the si_code field of the siginfo.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> include/uapi/linux/prctl.h | 2 ++
> kernel/isolation.c | 17 +++++++++++++----
> 2 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index e16e13911e8a..2a4ddc890e22 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -195,5 +195,7 @@ struct prctl_mm_map {
> #define PR_GET_TASK_ISOLATION 48
> # define PR_TASK_ISOLATION_ENABLE (1 << 0)
> # define PR_TASK_ISOLATION_STRICT (1 << 1)
> +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
>
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> index a89a6e9adfb4..b776aa632c8f 100644
> --- a/kernel/isolation.c
> +++ b/kernel/isolation.c
> @@ -75,11 +75,20 @@ void task_isolation_enter(void)
> }
> }
>
> -static void kill_task_isolation_strict_task(void)
> +static void kill_task_isolation_strict_task(int is_syscall)
> {
> + siginfo_t info = {};
> + int sig;
> +
> dump_stack();
> current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
> - send_sig(SIGKILL, current, 1);
> +
> + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
> + if (sig == 0)
> + sig = SIGKILL;
> + info.si_signo = sig;
> + info.si_code = is_syscall;
> + send_sig_info(sig, &info, current);

The stuff you're doing here is sufficiently nasty that I think you
should add something like:

rcu_lockdep_assert(rcu_is_watching(), "some message here");

Because as it stands this is just asking for trouble.

For the record, I am *extremely* unhappy with the state of the context
tracking hooks.

--Andy