2015-05-08 17:58:56

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 0/6] support "dataplane" mode for nohz_full

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. A prctl()
option (PR_SET_DATAPLANE) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on dataplane cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (6):
nohz_full: add support for "dataplane" mode
nohz: dataplane: allow tick to be fully disabled for dataplane
dataplane nohz: run softirqs synchronously on user entry
nohz: support PR_DATAPLANE_QUIESCE
nohz: support PR_DATAPLANE_STRICT mode
nohz: add dataplane_debug boot flag

Documentation/kernel-parameters.txt | 6 ++
arch/tile/mm/homecache.c | 5 +-
include/linux/sched.h | 3 +
include/linux/tick.h | 12 ++++
include/uapi/linux/prctl.h | 8 +++
kernel/context_tracking.c | 3 +
kernel/irq_work.c | 4 +-
kernel/sched/core.c | 18 ++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 15 ++++-
kernel/sys.c | 8 +++
kernel/time/tick-sched.c | 112 +++++++++++++++++++++++++++++++++++-
13 files changed, 198 insertions(+), 5 deletions(-)

--
2.1.2


2015-05-08 17:59:05

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 1/6] nohz_full: add support for "dataplane" mode

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_DATAPLANE, PR_DATAPLANE_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The dataplane state is indicated by setting a new task struct
field, dataplane_flags, to the value passed by prctl(). When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_dataplane_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

For this first patch, the only action taken is to call
lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/sched.h | 3 +++
include/linux/tick.h | 10 ++++++++++
include/uapi/linux/prctl.h | 5 +++++
kernel/context_tracking.c | 3 +++
kernel/sys.c | 8 ++++++++
kernel/time/tick-sched.c | 13 +++++++++++++
6 files changed, 42 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..3680aa07c9ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
+#ifdef CONFIG_NO_HZ_FULL
+ unsigned int dataplane_flags;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..d191cda9b71a 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
#include <linux/context_tracking_state.h>
#include <linux/cpumask.h>
#include <linux/sched.h>
+#include <linux/prctl.h>

#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}

+static inline bool tick_nohz_is_dataplane(void)
+{
+ return tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->dataplane_flags & PR_DATAPLANE_ENABLE);
+}
+
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_dataplane_enter(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick(void) { }
static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_dataplane(void) { return false; }
+static inline void tick_nohz_dataplane_enter(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1aa8fa8a8b05 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */

+/* Enable/disable or query dataplane mode for NO_HZ_FULL kernels. */
+#define PR_SET_DATAPLANE 47
+#define PR_GET_DATAPLANE 48
+# define PR_DATAPLANE_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..dd6bdd6197b6 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/tick.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
* on the tick.
*/
if (state == CONTEXT_USER) {
+ if (tick_nohz_is_dataplane())
+ tick_nohz_dataplane_enter();
trace_user_enter(0);
vtime_user_enter(current);
}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..930b750aefde 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_NO_HZ_FULL
+ case PR_SET_DATAPLANE:
+ me->dataplane_flags = arg2;
+ break;
+ case PR_GET_DATAPLANE:
+ error = me->dataplane_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..31c674719647 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/swap.h>

#include <asm/irq_regs.h>

@@ -389,6 +390,18 @@ void __init tick_nohz_init(void)
pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
cpumask_pr_args(tick_nohz_full_mask));
}
+
+/*
+ * When returning to userspace on a nohz_full core after doing
+ * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively
+ * to prevent this core from being interrupted later.
+ */
+void tick_nohz_dataplane_enter(void)
+{
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+}
+
#endif

/*
--
2.1.2

2015-05-08 17:59:07

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.

This was previously discussed in

https://lkml.org/lkml/2014/10/31/364

and Thomas Gleixner observed that vruntime, load balancing data,
load accounting, and other things might be impacted. Frederic
Weisbecker similarly observed that allowing the tick to be indefinitely
deferred just meant that no one would ever fix the underlying bugs.
However it's at least true that the mode proposed in this patch can
only be enabled on an isolcpus core, which may limit how important
it is to maintain scheduler data correctly, for example.

It's also worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2005) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc). So this semantics is very
useful if we can convince ourselves that doing this is safe.

Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 31c674719647..25fdd6bdd1eb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -644,7 +644,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
}

#ifdef CONFIG_NO_HZ_FULL
- if (!ts->inidle) {
+ if (!ts->inidle && !tick_nohz_is_dataplane()) {
time_delta = min(time_delta,
scheduler_tick_max_deferment());
}
--
2.1.2

2015-05-08 17:59:12

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

For tasks which have elected dataplane functionality, we run
any pending softirqs for the core before returning to userspace,
rather than ever scheduling ksoftirqd to run. The problem we
fix is that by allowing another task to run on the core, we
guarantee more interrupts in the future to the dataplane task,
which is exactly what dataplane mode is required to prevent.

This may be an alternate approach to what Mike Galbraith
recently proposed in e.g.:

https://lkml.org/lkml/2015/3/13/11

Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/softirq.c | 14 +++++++++++++-
kernel/time/tick-sched.c | 26 +++++++++++++++++++++++++-
2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..bc9406337f82 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -291,6 +291,15 @@ restart:
--max_restart)
goto restart;

+ /*
+ * For dataplane tasks, waking ksoftirqd because the
+ * softirqs are slow is a bad idea; we would rather
+ * synchronously finish whatever is interrupting us,
+ * and then be able to cleanly enter dataplane mode.
+ */
+ if (tick_nohz_is_dataplane())
+ goto restart;
+
wakeup_softirqd();
}

@@ -410,8 +419,11 @@ inline void raise_softirq_irqoff(unsigned int nr)
*
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
+ *
+ * For dataplane tasks, we will handle the softirq
+ * synchronously on return to userspace.
*/
- if (!in_interrupt())
+ if (!in_interrupt() && !tick_nohz_is_dataplane())
wakeup_softirqd();
}

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 25fdd6bdd1eb..fd0e6e5c931c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -398,8 +398,26 @@ void __init tick_nohz_init(void)
*/
void tick_nohz_dataplane_enter(void)
{
+ /*
+ * Check for softirqs as close as possible to our return to
+ * userspace, and run any that are waiting. We need to ensure
+ * that we can safely avoid running softirqd, which will cause
+ * interrupts for nohz_full tasks. Note that interrupts may
+ * be enabled internally by do_softirq().
+ */
+ do_softirq();
+
/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
lru_add_drain();
+
+ /*
+ * Disable interrupts again since other code running in this
+ * function may have enabled them, and the caller expects
+ * interrupts to be disabled on return. Enabling them during
+ * this call is safe since the caller is not assuming any
+ * state that might have been altered by an interrupt.
+ */
+ local_irq_disable();
}

#endif
@@ -771,7 +789,13 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
if (need_resched())
return false;

- if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
+ /*
+ * If we are running dataplane for this process, don't worry
+ * about pending softirqs; we will force them to run
+ * synchronously before returning to userspace.
+ */
+ if (unlikely(local_softirq_pending() && cpu_online(cpu) &&
+ !tick_nohz_is_dataplane())) {
static int ratelimit;

if (ratelimit < 10 &&
--
2.1.2

2015-05-08 18:00:04

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
kernel to quiesce any pending timer interrupts prior to returning
to userspace. When running with this mode set, sys calls (and page
faults, etc.) can be inordinately slow. However, user applications
that want to guarantee that no unexpected interrupts will occur
(even if they call into the kernel) can set this flag to guarantee
that semantics.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 1 +
kernel/time/tick-sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 1aa8fa8a8b05..8b735651304a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_DATAPLANE 47
#define PR_GET_DATAPLANE 48
# define PR_DATAPLANE_ENABLE (1 << 0)
+# define PR_DATAPLANE_QUIESCE (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fd0e6e5c931c..69d908c6cef8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -392,6 +392,53 @@ void __init tick_nohz_init(void)
}

/*
+ * We normally return immediately to userspace.
+ *
+ * The PR_DATAPLANE_QUIESCE flag causes us to wait until no more
+ * interrupts are pending. Otherwise we nap with interrupts enabled
+ * and wait for the next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two processes on the same core and both
+ * specify PR_DATAPLANE_QUIESCE, neither will ever leave the kernel,
+ * and one will have to be killed manually. Otherwise in situations
+ * where another process is in the runqueue on this cpu, this task
+ * will just wait for that other task to go idle before returning to
+ * user space.
+ */
+static void dataplane_quiesce(void)
+{
+ struct clock_event_device *dev =
+ __this_cpu_read(tick_cpu_device.evtdev);
+ struct task_struct *task = current;
+ unsigned long start = jiffies;
+ bool warned = false;
+
+ while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+ if (!warned && (jiffies - start) >= (5 * HZ)) {
+ pr_warn("%s/%d: cpu %d: dataplane task blocked for %ld jiffies\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start));
+ warned = true;
+ }
+ if (should_resched())
+ schedule();
+ if (test_thread_flag(TIF_SIGPENDING))
+ break;
+
+ /* Idle with interrupts enabled and wait for the tick. */
+ set_current_state(TASK_INTERRUPTIBLE);
+ arch_cpu_idle();
+ set_current_state(TASK_RUNNING);
+ }
+ if (warned) {
+ pr_warn("%s/%d: cpu %d: dataplane task unblocked after %ld jiffies\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start));
+ dump_stack();
+ }
+}
+
+/*
* When returning to userspace on a nohz_full core after doing
* prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively
* to prevent this core from being interrupted later.
@@ -411,6 +458,13 @@ void tick_nohz_dataplane_enter(void)
lru_add_drain();

/*
+ * Quiesce any timer ticks if requested. On return from this
+ * function, no timer ticks are pending.
+ */
+ if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0)
+ dataplane_quiesce();
+
+ /*
* Disable interrupts again since other code running in this
* function may have enabled them, and the caller expects
* interrupts to be disabled on return. Enabling them during
--
2.1.2

2015-05-08 17:59:19

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

With QUIESCE mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of
a number of other synchronous traps, it may be unexpectedly
exposed to long latencies. Add a simple flag that puts the process
into a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we add an internal
bit to current->dataplane_flags that is set when prctl() sets the
flags. That way, when we are exiting the kernel after calling
prctl() to forbid future kernel exits, we don't get immediately
killed.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 2 ++
kernel/sys.c | 2 +-
kernel/time/tick-sched.c | 17 +++++++++++++++++
3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 8b735651304a..9cf79aa1e73f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
#define PR_GET_DATAPLANE 48
# define PR_DATAPLANE_ENABLE (1 << 0)
# define PR_DATAPLANE_QUIESCE (1 << 1)
+# define PR_DATAPLANE_STRICT (1 << 2)
+# define PR_DATAPLANE_PRCTL (1U << 31) /* kernel internal */

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 930b750aefde..8102433c9edd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2245,7 +2245,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
break;
#ifdef CONFIG_NO_HZ_FULL
case PR_SET_DATAPLANE:
- me->dataplane_flags = arg2;
+ me->dataplane_flags = arg2 | PR_DATAPLANE_PRCTL;
break;
case PR_GET_DATAPLANE:
error = me->dataplane_flags;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69d908c6cef8..22ed0decb363 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
(jiffies - start));
dump_stack();
}
+
+ /*
+ * Kill the process if it violates STRICT mode. Note that this
+ * code also results in killing the task if a kernel bug causes an
+ * irq to be delivered to this core.
+ */
+ if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
+ == PR_DATAPLANE_STRICT) {
+ pr_warn("Dataplane STRICT mode violated; process killed.\n");
+ dump_stack();
+ task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
+ local_irq_enable();
+ do_group_exit(SIGKILL);
+ }
}

/*
@@ -464,6 +478,9 @@ void tick_nohz_dataplane_enter(void)
if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0)
dataplane_quiesce();

+ /* Clear the bit set by prctl() when it updates the flags. */
+ current->dataplane_flags &= ~PR_DATAPLANE_PRCTL;
+
/*
* Disable interrupts again since other code running in this
* function may have enabled them, and the caller expects
--
2.1.2

2015-05-08 17:59:27

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 6/6] nohz: add dataplane_debug boot flag

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_DATAPLANE_QUIESCE mode. Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a dataplane core
has unexpectedly entered the kernel. But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a dataplane core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/kernel-parameters.txt | 6 ++++++
arch/tile/mm/homecache.c | 5 ++++-
include/linux/tick.h | 2 ++
kernel/irq_work.c | 4 +++-
kernel/sched/core.c | 18 ++++++++++++++++++
kernel/signal.c | 5 +++++
kernel/smp.c | 4 ++++
kernel/softirq.c | 1 +
8 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f6befa9855c1..5c5af5258e17 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -794,6 +794,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
dasd= [HW,NET]
See header of drivers/s390/block/dasd_devmap.c.

+ dataplane_debug [KNL]
+ In kernels built with CONFIG_NO_HZ_FULL and booted
+ in nohz_full= mode, this setting will generate console
+ backtraces when the kernel is about to interrupt a
+ task that has requested PR_DATAPLANE_QUIESCE.
+
db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port
(one device per port)
Format: <port#>,<type>
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..dd5ec7eca9a8 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/tick.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
- for_each_cpu(cpu, &mask)
+ for_each_cpu(cpu, &mask) {
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+ tick_nohz_dataplane_debug(cpu);
+ }
}

/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index d191cda9b71a..4610cdf0f972 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,7 @@ extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_dataplane_enter(void);
+extern void tick_nohz_dataplane_debug(int cpu);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +158,7 @@ static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
static inline bool tick_nohz_is_dataplane(void) { return false; }
static inline void tick_nohz_dataplane_enter(void) { }
+static inline void tick_nohz_dataplane_debug(int cpu) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..0adc53c4e899 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ tick_nohz_dataplane_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a82cbb6..202fab0c41cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,24 @@ bool sched_can_stop_tick(void)

return true;
}
+
+/* Enable debugging of any interrupts of dataplane cores. */
+static int dataplane_debug;
+static int __init dataplane_debug_func(char *str)
+{
+ dataplane_debug = true;
+ return 1;
+}
+__setup("dataplane_debug", dataplane_debug_func);
+
+void tick_nohz_dataplane_debug(int cpu)
+{
+ if (dataplane_debug && tick_nohz_full_cpu(cpu) &&
+ (cpu_curr(cpu)->dataplane_flags & PR_DATAPLANE_QUIESCE)) {
+ pr_err("Interrupt detected for dataplane cpu %d\n", cpu);
+ dump_stack();
+ }
+}
#endif /* CONFIG_NO_HZ_FULL */

void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..ebc552cafff5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_NO_HZ_FULL
+ /* If the task is being killed, don't complain about dataplane. */
+ if (state & TASK_WAKEKILL)
+ t->dataplane_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..9518fc80321b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/tick.h>

#include "smpboot.h"

@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
+ tick_nohz_dataplane_debug(cpu);
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ for_each_cpu(cpu, cfd->cpumask)
+ tick_nohz_dataplane_debug(cpu);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bc9406337f82..eeacabf08ca6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -394,6 +394,7 @@ void irq_exit(void)
WARN_ON_ONCE(!irqs_disabled());
#endif

+ tick_nohz_dataplane_debug(smp_processor_id());
account_irq_exit_time(current);
preempt_count_sub(HARDIRQ_OFFSET);
if (!in_interrupt() && local_softirq_pending())
--
2.1.2

2015-05-08 21:18:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:

> A prctl() option (PR_SET_DATAPLANE) is added

Dumb question: what does the term "dataplane" mean in this context? I
can't see the relationship between those words and what this patch
does.

2015-05-08 21:22:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Fri, 8 May 2015 14:18:24 -0700
Andrew Morton <[email protected]> wrote:

> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:
>
> > A prctl() option (PR_SET_DATAPLANE) is added
>
> Dumb question: what does the term "dataplane" mean in this context? I
> can't see the relationship between those words and what this patch
> does.

I was thinking the same thing. I haven't gotten around to searching
DATAPLANE yet.

I would assume we want a name that is more meaningful for what is
happening.

-- Steve

2015-05-08 23:11:28

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> On Fri, 8 May 2015 14:18:24 -0700
> Andrew Morton <[email protected]> wrote:
>
>> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:
>>
>>> A prctl() option (PR_SET_DATAPLANE) is added
>> Dumb question: what does the term "dataplane" mean in this context? I
>> can't see the relationship between those words and what this patch
>> does.
> I was thinking the same thing. I haven't gotten around to searching
> DATAPLANE yet.
>
> I would assume we want a name that is more meaningful for what is
> happening.

The text in the commit message and the 0/6 cover letter do try to explain
the concept. The terminology comes, I think, from networking line cards,
where the "dataplane" is the part of the application that handles all the
fast path processing of network packets, and the "control plane" is the part
that handles routing updates, etc., generally slow-path stuff. I've probably
just been using the terms so long they seem normal to me.

That said, what would be clearer? NO_HZ_STRICT as a superset of
NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all,
we're talking about no interrupts of any kind, and maybe NO_HZ is too
limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look
to vendors who ship bare-metal runtimes and call it BARE_METAL?
Borrow the Tilera marketing name and call it ZERO_OVERHEAD?

Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
of course :-)

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-08 23:19:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <[email protected]> wrote:

> On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > On Fri, 8 May 2015 14:18:24 -0700
> > Andrew Morton <[email protected]> wrote:
> >
> >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:
> >>
> >>> A prctl() option (PR_SET_DATAPLANE) is added
> >> Dumb question: what does the term "dataplane" mean in this context? I
> >> can't see the relationship between those words and what this patch
> >> does.
> > I was thinking the same thing. I haven't gotten around to searching
> > DATAPLANE yet.
> >
> > I would assume we want a name that is more meaningful for what is
> > happening.
>
> The text in the commit message and the 0/6 cover letter do try to explain
> the concept. The terminology comes, I think, from networking line cards,
> where the "dataplane" is the part of the application that handles all the
> fast path processing of network packets, and the "control plane" is the part
> that handles routing updates, etc., generally slow-path stuff. I've probably
> just been using the terms so long they seem normal to me.
>
> That said, what would be clearer? NO_HZ_STRICT as a superset of
> NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all,
> we're talking about no interrupts of any kind, and maybe NO_HZ is too
> limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look
> to vendors who ship bare-metal runtimes and call it BARE_METAL?
> Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
>
> Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> of course :-)

I like NO_INTERRUPTS. Simple, direct.

2015-05-09 07:04:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
> For tasks which have elected dataplane functionality, we run
> any pending softirqs for the core before returning to userspace,
> rather than ever scheduling ksoftirqd to run. The problem we
> fix is that by allowing another task to run on the core, we
> guarantee more interrupts in the future to the dataplane task,
> which is exactly what dataplane mode is required to prevent.

If ksoftirqd were rt class, softirqs would be gone when the soloist gets
the CPU back and heads to userspace. Being a soloist, it has no use for
a priority, so why can't it just let ksoftirqd run if it raises the
occasional softirq? Meeting a contended lock while processing it will
wreck the soloist regardless of who does that processing.

-Mike

2015-05-09 07:05:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full


* Andrew Morton <[email protected]> wrote:

> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <[email protected]> wrote:
>
> > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > On Fri, 8 May 2015 14:18:24 -0700
> > > Andrew Morton <[email protected]> wrote:
> > >
> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:
> > >>
> > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > >> Dumb question: what does the term "dataplane" mean in this context? I
> > >> can't see the relationship between those words and what this patch
> > >> does.
> > > I was thinking the same thing. I haven't gotten around to searching
> > > DATAPLANE yet.
> > >
> > > I would assume we want a name that is more meaningful for what is
> > > happening.
> >
> > The text in the commit message and the 0/6 cover letter do try to explain
> > the concept. The terminology comes, I think, from networking line cards,
> > where the "dataplane" is the part of the application that handles all the
> > fast path processing of network packets, and the "control plane" is the part
> > that handles routing updates, etc., generally slow-path stuff. I've probably
> > just been using the terms so long they seem normal to me.
> >
> > That said, what would be clearer? NO_HZ_STRICT as a superset of
> > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all,
> > we're talking about no interrupts of any kind, and maybe NO_HZ is too
> > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look
> > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> >
> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > of course :-)

'baremetal' has uses in virtualization speak, so I think that would be
confusing.

> I like NO_INTERRUPTS. Simple, direct.

NO_HZ_PURE?

That's what it's really about: user-space wants to run exclusively, in
pure user-mode, without any interrupts.

So I don't like 'NO_HZ_NO_INTERRUPTS' for a couple of reasons:

- It is similar to a term we use in perf: PERF_PMU_CAP_NO_INTERRUPT.

- Another reason is that 'NO_INTERRUPTS', in most existing uses in
the kernel generally relates to some sort of hardware weakness,
limitation, a negative property: that we try to limp along without
having a hardware interrupt and have to poll. In other driver code
that uses variants of NO_INTERRUPT it appears to be similar. So I
think there's some confusion potential here.

- Here the fact that we don't disturb user-space is an absolutely
positive property, not a limitation, a kernel feature we work hard
to achieve. NO_HZ_PURE would convey that while NO_HZ_NO_INTERRUPTS
wouldn't.

- NO_HZ_NO_INTERRUPTS has a double negation, and it's also too long,
compared to NO_HZ_FULL or NO_HZ_PURE ;-) The term 'no HZ' already
expresses that we don't have periodic interruptions. We just
duplicate that information with NO_HZ_NO_INTERRUPTS, while
NO_HZ_FULL or NO_HZ_PURE qualifies it, makes it a stronger
property - which is what we want I think.

So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep
it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be
such a 'zero overhead' mode of operation, where if user-space runs, it
won't get interrupted in any way.

There's no need to add yet another Kconfig variant - lets just enhance
the current stuff and maybe rename it to NO_HZ_PURE to better express
its intent.

Thanks,

Ingo

2015-05-09 07:19:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Sat, May 9, 2015 at 12:05 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
>> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <[email protected]> wrote:
>>
>> > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
>> > > On Fri, 8 May 2015 14:18:24 -0700
>> > > Andrew Morton <[email protected]> wrote:
>> > >
>> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:
>> > >>
>> > >>> A prctl() option (PR_SET_DATAPLANE) is added
>> > >> Dumb question: what does the term "dataplane" mean in this context? I
>> > >> can't see the relationship between those words and what this patch
>> > >> does.
>> > > I was thinking the same thing. I haven't gotten around to searching
>> > > DATAPLANE yet.
>> > >
>> > > I would assume we want a name that is more meaningful for what is
>> > > happening.
>> >
>> > The text in the commit message and the 0/6 cover letter do try to explain
>> > the concept. The terminology comes, I think, from networking line cards,
>> > where the "dataplane" is the part of the application that handles all the
>> > fast path processing of network packets, and the "control plane" is the part
>> > that handles routing updates, etc., generally slow-path stuff. I've probably
>> > just been using the terms so long they seem normal to me.
>> >
>> > That said, what would be clearer? NO_HZ_STRICT as a superset of
>> > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all,
>> > we're talking about no interrupts of any kind, and maybe NO_HZ is too
>> > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look
>> > to vendors who ship bare-metal runtimes and call it BARE_METAL?
>> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
>> >
>> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
>> > of course :-)
>
> 'baremetal' has uses in virtualization speak, so I think that would be
> confusing.
>
>> I like NO_INTERRUPTS. Simple, direct.
>
> NO_HZ_PURE?
>

Naming aside, I don't think this should be a per-task flag at all. We
already have way too much overhead per syscall in nohz mode, and it
would be nice to get the per-syscall overhead as low as possible. We
should strive, for all tasks, to keep syscall overhead down *and*
avoid as many interrupts as possible.

That being said, I do see a legitimate use for a way to tell the
kernel "I'm going to run in userspace for a long time; stay away".
But shouldn't that be a single operation, not an ongoing flag? IOW, I
think that we should have a new syscall quiesce() or something rather
than a prctl.

--Andy

2015-05-09 07:19:51

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
> > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <[email protected]> wrote:
> >
> > > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > > On Fri, 8 May 2015 14:18:24 -0700
> > > > Andrew Morton <[email protected]> wrote:
> > > >
> > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <[email protected]> wrote:
> > > >>
> > > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > > >> Dumb question: what does the term "dataplane" mean in this context? I
> > > >> can't see the relationship between those words and what this patch
> > > >> does.
> > > > I was thinking the same thing. I haven't gotten around to searching
> > > > DATAPLANE yet.
> > > >
> > > > I would assume we want a name that is more meaningful for what is
> > > > happening.
> > >
> > > The text in the commit message and the 0/6 cover letter do try to explain
> > > the concept. The terminology comes, I think, from networking line cards,
> > > where the "dataplane" is the part of the application that handles all the
> > > fast path processing of network packets, and the "control plane" is the part
> > > that handles routing updates, etc., generally slow-path stuff. I've probably
> > > just been using the terms so long they seem normal to me.
> > >
> > > That said, what would be clearer? NO_HZ_STRICT as a superset of
> > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all,
> > > we're talking about no interrupts of any kind, and maybe NO_HZ is too
> > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look
> > > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > >
> > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > > of course :-)
>
> 'baremetal' has uses in virtualization speak, so I think that would be
> confusing.
>
> > I like NO_INTERRUPTS. Simple, direct.
>
> NO_HZ_PURE?

Hm, coke light, coke zero... OS_LIGHT and OS_ZERO?

-Mike

2015-05-09 07:29:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On May 8, 2015 11:44 PM, "Chris Metcalf" <[email protected]> wrote:
>
> With QUIESCE mode, the task is in principle guaranteed not to be
> interrupted by the kernel, but only if it behaves. In particular,
> if it enters the kernel via system call, page fault, or any of
> a number of other synchronous traps, it may be unexpectedly
> exposed to long latencies. Add a simple flag that puts the process
> into a state where any such kernel entry is fatal.
>
> To allow the state to be entered and exited, we add an internal
> bit to current->dataplane_flags that is set when prctl() sets the
> flags. That way, when we are exiting the kernel after calling
> prctl() to forbid future kernel exits, we don't get immediately
> killed.

Is there any reason this can't already be addressed in userspace using
/proc/interrupts or perf_events? ISTM the real goal here is to detect
when we screw up and fail to avoid an interrupt, and killing the task
seems like overkill to me.

Also, can we please stop further torturing the exit paths? We have a
disaster of assembly code that calls into syscall_trace_leave and
do_notify_resume. Those functions, in turn, *both* call user_enter
(WTF?), and on very brief inspection user_enter makes it into the nohz
code through multiple levels of indirection, which, with these
patches, has yet another conditionally enabled helper, which does this
new stuff. It's getting to be impossible to tell what happens when we
exit to user space any more.

Also, I think your code is buggy. There's no particular guarantee
that user_enter is only called once between sys_prctl and the final
exit to user mode (see the above WTF), so you might spuriously kill
the process.

Also, I think that most users will be quite surprised if "strict
dataplane" code causes any machine check on the system to kill your
dataplane task. Similarly, a user accidentally running perf record -a
probably should have some reasonable semantics. /proc/interrupts gets
that right as is. Sure, MCEs will hurt your RT performance, but Intel
screwed up the way that MCEs work, so we should make do.

--Andy

2015-05-09 10:51:31

by Gilad Ben Yossef

[permalink] [raw]
Subject: RE: [PATCH 0/6] support "dataplane" mode for nohz_full


> From: Mike Galbraith [mailto:[email protected]]
> Sent: Saturday, May 09, 2015 10:20 AM
> To: Ingo Molnar
> Cc: Andrew Morton; Chris Metcalf; Steven Rostedt; Gilad Ben Yossef; Ingo
> Molnar; Peter Zijlstra; Rik van Riel; Tejun Heo; Frederic Weisbecker;
> Thomas Gleixner; Paul E. McKenney; Christoph Lameter; Srivatsa S. Bhat;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full
>
> On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote:
> > * Andrew Morton <[email protected]> wrote:
> >
> > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <[email protected]>
> wrote:
> > >
> > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > > > On Fri, 8 May 2015 14:18:24 -0700
> > > > > Andrew Morton <[email protected]> wrote:
> > > > >
> > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf
> <[email protected]> wrote:
> > > > >>
> > > > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > > > >> Dumb question: what does the term "dataplane" mean in this
> context? I
> > > > >> can't see the relationship between those words and what this
> patch
> > > > >> does.
> > > > > I was thinking the same thing. I haven't gotten around to
> searching
> > > > > DATAPLANE yet.
> > > > >
> > > > > I would assume we want a name that is more meaningful for what is
> > > > > happening.
> > > >
> > > > The text in the commit message and the 0/6 cover letter do try to
> explain
> > > > the concept. The terminology comes, I think, from networking line
> cards,
> > > > where the "dataplane" is the part of the application that handles
> all the
> > > > fast path processing of network packets, and the "control plane" is
> the part
> > > > that handles routing updates, etc., generally slow-path stuff. I've
> probably
> > > > just been using the terms so long they seem normal to me.
> > > >
> > > > That said, what would be clearer? NO_HZ_STRICT as a superset of
> > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after
> all,
> > > > we're talking about no interrupts of any kind, and maybe NO_HZ is
> too
> > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look
> > > > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > > >
> > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > > > of course :-)
> >
> > 'baremetal' has uses in virtualization speak, so I think that would be
> > confusing.
> >
> > > I like NO_INTERRUPTS. Simple, direct.
> >
> > NO_HZ_PURE?
>
> Hm, coke light, coke zero... OS_LIGHT and OS_ZERO?
LOL... you forgot OS_CLASSIC for backwards compatibility :-)
How about TASK_SOLO?
Yes, you are trying to achieve the least amount of interference but the bigger context is about monopolizing a single CPU for yourself.
Anyway it is worth pointing out that while NO_HZ_FULL is very useful in conjunction with this turning the tick off is useful also if you have multiple tasks runnable (e.g. if you know you only need to context switch in 100 ms, why keep a periodic interrupt running?) even though we don't support it *right now*. It might be a good idea not to entangle these concepts too much.

Gilad
Gilad Ben-Yossef
Chief Software Architect
EZchip Technologies Ltd.
37 Israel Pollak Ave, Kiryat Gat 82025 ,Israel
Tel: +972-4-959-6666 ext. 576, Fax: +972-8-681-1483
Mobile: +972-52-826-0388, US Mobile: +1-973-826-0388
Email: [email protected], Web: http://www.ezchip.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-05-09 10:53:36

by Gilad Ben Yossef

[permalink] [raw]
Subject: RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

> From: Andy Lutomirski [mailto:[email protected]]
> Sent: Saturday, May 09, 2015 10:29 AM
> To: Chris Metcalf
> Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar;
> Rik van Riel; [email protected]; Andrew Morton; linux-
> [email protected]; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven
> Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API
> Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
>
> On May 8, 2015 11:44 PM, "Chris Metcalf" <[email protected]> wrote:
> >
> > With QUIESCE mode, the task is in principle guaranteed not to be
> > interrupted by the kernel, but only if it behaves. In particular,
> > if it enters the kernel via system call, page fault, or any of
> > a number of other synchronous traps, it may be unexpectedly
> > exposed to long latencies. Add a simple flag that puts the process
> > into a state where any such kernel entry is fatal.
> >
> > To allow the state to be entered and exited, we add an internal
> > bit to current->dataplane_flags that is set when prctl() sets the
> > flags. That way, when we are exiting the kernel after calling
> > prctl() to forbid future kernel exits, we don't get immediately
> > killed.
>
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events? ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.
>
> Also, can we please stop further torturing the exit paths?
So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path?

Gilad
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-05-11 12:58:07

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full


NO_HZ_LEAVE_ME_THE_FSCK_ALONE!


On Sat, 9 May 2015 09:05:38 +0200
Ingo Molnar <[email protected]> wrote:

> So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep
> it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be
> such a 'zero overhead' mode of operation, where if user-space runs, it
> won't get interrupted in any way.


All kidding aside, I think this is the real answer. We don't need a new
NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
what it was created to do. That should be fixed.

Please lets get NO_HZ_FULL up to par. That should be the main focus.

-- Steve

2015-05-11 15:36:11

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
>
> NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
>
>
> On Sat, 9 May 2015 09:05:38 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep
> > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be
> > such a 'zero overhead' mode of operation, where if user-space runs, it
> > won't get interrupted in any way.
>
>
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.
>
> Please lets get NO_HZ_FULL up to par. That should be the main focus.

Now if we can achieve to make NO_HZ_FULL behave in a specific way
that fits everyone's usecase, I'll be happy.

But some people may expect hard isolation requirement (Real Time, deterministic
latency) and others softer isolation (HPC, only interested in performance, can
live with one rare random tick, so no need to loop before returning to userspace
until we have the no-noise guarantee).

I expect some Real Time users may want this kind of dataplane mode where a syscall
or whatever sleeps until the system is ready to provide the guarantee that no
disturbance is going to happen for a given time. I'm not sure HPC users are interested
in that.

In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about
the tick and now people are introducing more and more kernel default presetting that
assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs,
...). Which is true but what kind of ISOLATION?

Probably NO_HZ_FULL should really only be about stopping the tick then some sort
of CONFIG_ISOLATION would drive the kind of isolation we are interested in
and hereby the behaviour of NO_HZ_FULL, workqueues, timers, tasks affinity, irqs
affinity, dataplane mode, ...

2015-05-11 17:19:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
>
> NO_HZ_LEAVE_ME_THE_FSCK_ALONE!

NO_HZ_OVERFLOWING?

Kconfig naming controversy aside, I believe this patchset is addressing
a real need. Might need additional adjustment, but something useful.

Thanx, Paul

> On Sat, 9 May 2015 09:05:38 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep
> > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be
> > such a 'zero overhead' mode of operation, where if user-space runs, it
> > won't get interrupted in any way.
>
>
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.
>
> Please lets get NO_HZ_FULL up to par. That should be the main focus.
>
> -- Steve
>

2015-05-11 17:27:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <[email protected]> wrote:

> On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> >
> > NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
>
> NO_HZ_OVERFLOWING?

Actually, "NO_HZ" shouldn't appear in the name at all. The objective
is to permit userspace to execute without interruption. NO_HZ is a
part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical
artifact from an early partial implementation.

2015-05-11 17:33:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote:
> On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <[email protected]> wrote:
>
> > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> > >
> > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
> >
> > NO_HZ_OVERFLOWING?
>
> Actually, "NO_HZ" shouldn't appear in the name at all. The objective
> is to permit userspace to execute without interruption. NO_HZ is a
> part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical
> artifact from an early partial implementation.

Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION.

2015-05-11 18:00:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, 11 May 2015 19:33:06 +0200
Frederic Weisbecker <[email protected]> wrote:

> On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote:
> > On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <[email protected]> wrote:
> >
> > > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> > > >
> > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
> > >
> > > NO_HZ_OVERFLOWING?
> >
> > Actually, "NO_HZ" shouldn't appear in the name at all. The objective
> > is to permit userspace to execute without interruption. NO_HZ is a
> > part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical
> > artifact from an early partial implementation.
>
> Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION.

Then we should have CONFIG_LEAVE_ME_THE_FSCK_ALONE. Hmm, I guess that's
just an synonym for CONFIG_ISOLATION.

-- Steve

2015-05-11 18:10:25

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

A bunch of issues have been raised by various folks (thanks!) and
I'll try to break them down and respond to them in a few different
emails. This email is just about the issue of naming and whether the
proposed patch series should even have its own "name" or just be part
of NO_HZ_FULL.

First, Ingo and Steven both suggested that this new "dataplane" mode
(or whatever we want to call it; see below) should just be rolled into
the existing NO_HZ_FULL and that we should focus on making that work
better.

Steven writes:
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.

The claim I'm making is that it's worthwhile to differentiate the two
semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to
avoid periodic interrupts without incurring any serious overhead". My
patch series allows an app to request "kernel makes an absolute
commitment to avoid all interrupts regardless of cost when leaving
kernel space". These are different enough ideas, and serve different
enough application needs, that I think they should be kept distinct.

Frederic actually summed this up very nicely in his recent email when
he wrote "some people may expect hard isolation requirement (Real
Time, deterministic latency) and others softer isolation (HPC, only
interested in performance, can live with one rare random tick, so no
need to loop before returning to userspace until we have the no-noise
guarantee)."

So we need a way for apps to ask for the "harder" mode and let
the softer mode be the default.

What about naming? We may or may not want to have a Kconfig flag
for this, and we may or may not have a separate mode for it, but
we still will need some kind of name to talk about it with. (In
particular there's the prctl name, if we take that approach, and
potential boot command-line flags to consider naming for.)

I'll quickly cover the suggestions that have been raised:

- DATAPLANE. My suggestion, seemingly broadly disliked by folks
who felt it wasn't apparent what it meant. Probably a fair point.

- NO_INTERRUPTS (Andrew). Captures some of the sense, but was
criticized pretty fairly by Ingo as being too negative, confusing
with perf nomenclature, and too long :-)

- PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could
use it as a name for this new mode. However, I think it's not clear
enough how FULL and PURE can/should relate to each other from the
names alone.

- BARE_METAL (me). Ingo observes it's confusing with respect to
virtualization.

- TASK_SOLO (Gilad). Not sure this conveys enough of the semantics.

- OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent
ideas :-)

- ISOLATION (Frederic). I like this but it conflicts with other uses
of "isolation" in the kernel: cgroup isolation, lru page isolation,
iommu isolation, scheduler isolation (at least it's a superset of
that one), etc. Also, we're not exactly isolating a task - often
a "dataplane" app consists of a bunch of interacting threads in
userspace, so not exactly isolated. So perhaps it's too confusing.

- OVERFLOWING (Steven) - not sure I understood this one, honestly.

I suggested earlier a few other candidates that I don't love, but no
one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD.

One thing I'm leaning towards is to remove the intermediate state of
DATAPLANE_ENABLE and say that there is really only one primary state,
DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no
quiesce" state probably isn't that useful, since it doesn't offer the
hard guarantee that is the entire point of this patch series. So that
opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the
word that describes the mode; of course this sort of conflicts with
RCU quiesce (though it is a superset of that so maybe that's OK).

One new idea I had is to use NO_HZ_HARD to reflect what Frederic was
suggesting about "soft" and "hard" requirements for NO_HZ. So
enabling NO_HZ_HARD would enable my suggested QUIESCE mode.

One way to focus this discussion is on the user API naming. I had
prctl(PR_SET_DATAPLANE), which was attractive in being a "positive"
noun. A lot of the other suggestions fail this test in various way.
Reasonable candidates seem to be:

PR_SET_OS_ZERO
PR_SET_TASK_SOLO
PR_SET_ISOLATION

Another possibility:

PR_SET_NONSTOP

Or take Andrew's NO_INTERRUPTS and have:

PR_SET_UNINTERRUPTED

I slightly favor ISOLATION at this point despite the overlap with
other kernel concepts.

Let the bike-shedding continue! :-)

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-11 18:36:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, 11 May 2015 14:09:59 -0400
Chris Metcalf <[email protected]> wrote:

> Steven writes:
> > All kidding aside, I think this is the real answer. We don't need a new
> > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> > what it was created to do. That should be fixed.
>
> The claim I'm making is that it's worthwhile to differentiate the two
> semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to
> avoid periodic interrupts without incurring any serious overhead". My
> patch series allows an app to request "kernel makes an absolute
> commitment to avoid all interrupts regardless of cost when leaving
> kernel space". These are different enough ideas, and serve different
> enough application needs, that I think they should be kept distinct.
>
> Frederic actually summed this up very nicely in his recent email when
> he wrote "some people may expect hard isolation requirement (Real
> Time, deterministic latency) and others softer isolation (HPC, only
> interested in performance, can live with one rare random tick, so no
> need to loop before returning to userspace until we have the no-noise
> guarantee)."
>
> So we need a way for apps to ask for the "harder" mode and let
> the softer mode be the default.

Fair enough. But I would hope that this would improve on NO_HZ_FULL as
well.

>
> What about naming? We may or may not want to have a Kconfig flag
> for this, and we may or may not have a separate mode for it, but
> we still will need some kind of name to talk about it with. (In
> particular there's the prctl name, if we take that approach, and
> potential boot command-line flags to consider naming for.)
>
> I'll quickly cover the suggestions that have been raised:
>
> - DATAPLANE. My suggestion, seemingly broadly disliked by folks
> who felt it wasn't apparent what it meant. Probably a fair point.
>
> - NO_INTERRUPTS (Andrew). Captures some of the sense, but was
> criticized pretty fairly by Ingo as being too negative, confusing
> with perf nomenclature, and too long :-)

What about NO_INTERRUPTIONS

>
> - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could
> use it as a name for this new mode. However, I think it's not clear
> enough how FULL and PURE can/should relate to each other from the
> names alone.

I would find the two confusing as well.

>
> - BARE_METAL (me). Ingo observes it's confusing with respect to
> virtualization.

This is also confusing.

>
> - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics.

Agreed.

>
> - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent
> ideas :-)

At least the LEAVE_ME_ALONE conveys the semantics ;-)

>
> - ISOLATION (Frederic). I like this but it conflicts with other uses
> of "isolation" in the kernel: cgroup isolation, lru page isolation,
> iommu isolation, scheduler isolation (at least it's a superset of
> that one), etc. Also, we're not exactly isolating a task - often
> a "dataplane" app consists of a bunch of interacting threads in
> userspace, so not exactly isolated. So perhaps it's too confusing.
>
> - OVERFLOWING (Steven) - not sure I understood this one, honestly.

Actually, that was suggested by Paul McKenney.

>
> I suggested earlier a few other candidates that I don't love, but no
> one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD.
>
> One thing I'm leaning towards is to remove the intermediate state of
> DATAPLANE_ENABLE and say that there is really only one primary state,
> DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no
> quiesce" state probably isn't that useful, since it doesn't offer the
> hard guarantee that is the entire point of this patch series. So that
> opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the
> word that describes the mode; of course this sort of conflicts with
> RCU quiesce (though it is a superset of that so maybe that's OK).
>
> One new idea I had is to use NO_HZ_HARD to reflect what Frederic was
> suggesting about "soft" and "hard" requirements for NO_HZ. So
> enabling NO_HZ_HARD would enable my suggested QUIESCE mode.
>
> One way to focus this discussion is on the user API naming. I had
> prctl(PR_SET_DATAPLANE), which was attractive in being a "positive"
> noun. A lot of the other suggestions fail this test in various way.
> Reasonable candidates seem to be:
>
> PR_SET_OS_ZERO
> PR_SET_TASK_SOLO
> PR_SET_ISOLATION
>
> Another possibility:
>
> PR_SET_NONSTOP
>
> Or take Andrew's NO_INTERRUPTS and have:
>
> PR_SET_UNINTERRUPTED

For another possible answer, what about

SET_TRANQUILITY

A state with no disturbances.

-- Steve

>
> I slightly favor ISOLATION at this point despite the overlap with
> other kernel concepts.
>
> Let the bike-shedding continue! :-)
>

2015-05-11 19:13:58

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On 05/09/2015 03:28 AM, Andy Lutomirski wrote:
> On May 8, 2015 11:44 PM, "Chris Metcalf" <[email protected]> wrote:
>> With QUIESCE mode, the task is in principle guaranteed not to be
>> interrupted by the kernel, but only if it behaves. In particular,
>> if it enters the kernel via system call, page fault, or any of
>> a number of other synchronous traps, it may be unexpectedly
>> exposed to long latencies. Add a simple flag that puts the process
>> into a state where any such kernel entry is fatal.
>>
>> To allow the state to be entered and exited, we add an internal
>> bit to current->dataplane_flags that is set when prctl() sets the
>> flags. That way, when we are exiting the kernel after calling
>> prctl() to forbid future kernel exits, we don't get immediately
>> killed.
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events? ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.

Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode. Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say "give me a
SIGBUS when that happens" and in production you might say
"fix it up and let's try to keep going".

You can argue that this is something that can be done by ftrace,
but certainly you'd want to have a way to programmatically
turn on ftrace at the moment when you're entering userspace-only
mode, so we'd want some API around that anyway. And honestly,
it's so easy to test a task state bit in a couple of places and
generate the failurel on the spot, vs. the relative complexity
of setting up and understanding ftrace, that I think it merits
inclusion on that basis alone.

> Also, can we please stop further torturing the exit paths? We have a
> disaster of assembly code that calls into syscall_trace_leave and
> do_notify_resume. Those functions, in turn, *both* call user_enter
> (WTF?), and on very brief inspection user_enter makes it into the nohz
> code through multiple levels of indirection, which, with these
> patches, has yet another conditionally enabled helper, which does this
> new stuff. It's getting to be impossible to tell what happens when we
> exit to user space any more.
>
> Also, I think your code is buggy. There's no particular guarantee
> that user_enter is only called once between sys_prctl and the final
> exit to user mode (see the above WTF), so you might spuriously kill
> the process.

This is a good point; I also find the x86 kernel entry and exit
paths confusing, although I've reviewed them a bunch of times.
The tile architecture paths are a little easier to understand.

That said, I think the answer here is avoid non-idempotent
actions in the dataplane code, such as clearing a syscall bit.

A better implementation, I think, is to put the tests for "you
screwed up and synchronously entered the kernel" in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into; there, we can test if the dataplane "strict" bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.

> Also, I think that most users will be quite surprised if "strict
> dataplane" code causes any machine check on the system to kill your
> dataplane task.

Fair point, and avoided by testing as described above instead.
(Though presumably in development it's not such a big deal,
and as I said you'd likely turn it off in production.)

> Similarly, a user accidentally running perf record -a
> probably should have some reasonable semantics.

Yes, also avoided by doing this as above, though I'd argue we
could also just say that running perf disables this mode.
But it's not as clean as the above suggestion.

On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote:
> So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case?
> The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path?

We could certainly do this now that I'm planning to do the
test at kernel entry rather than super-late in kernel exit.
Rather than just do_group_exit(SIGKILL), we should raise
a proper SIGKILL signal via send_sig(SIGKILL, current, 1),
and then we could catch it in the debugger; the pc should
help identify if it was a syscall, page fault, or other trap.

I'm not sure there's an argument to be made for the user
process being able to catch the signal itself; presumably in
production you don't turn this mode on anyway, and in
development, assuming a debugger is probably fine.

But if you want to argue for another signal (SIGILL?) please
do; I'm curious to hear if you think it would make more sense.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-11 19:19:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, 2015-05-11 at 17:36 +0200, Frederic Weisbecker wrote:

> I expect some Real Time users may want this kind of dataplane mode where a syscall
> or whatever sleeps until the system is ready to provide the guarantee that no
> disturbance is going to happen for a given time. I'm not sure HPC users are interested
> in that.

I bet they are. RT is just a different way to spell HPC, and reverse.

> In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about
> the tick and now people are introducing more and more kernel default presetting that
> assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs,
> ...). Which is true but what kind of ISOLATION?

True, nohz mode and various isolation measures are distinct properties.
NO_HZ_FULL is kinda pointless without isolation measures to go with it,
but you're right.

I really shouldn't have acked nohz_full -> isolcpus. Beside the fact
that old static isolcpus was _supposed_ to crawl off and die, I know
beyond doubt that having isolated a cpu as well as you can definitely
does NOT imply that said cpu should become tickless. I routinely run a
load model that wants all the isolation it can get. It's not single
task compute though, rt executive coordinating rt workers, and of course
wants every cycle it can get, so nohz_full is less than helpful.

-Mike

2015-05-11 19:25:38

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> I really shouldn't have acked nohz_full -> isolcpus. Beside the fact
> that old static isolcpus was_supposed_ to crawl off and die, I know
> beyond doubt that having isolated a cpu as well as you can definitely
> does NOT imply that said cpu should become tickless.

True, at a high level, I agree that it would be better to have a
top-level concept like Frederic's proposed ISOLATION that includes
isolcpus and nohz_cpu (and other stuff as needed).

That said, what you wrote above is wrong; even with the patch you
acked, setting isolcpus does not automatically turn on nohz_full for
a given cpu. The patch made it true the other way around: when
you say nohz_full, you automatically get isolcpus on that cpu too.
That does, at least, make sense for the semantics of nohz_full.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-11 19:54:51

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

(Oops, resending and forcing html off.)

On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
> Naming aside, I don't think this should be a per-task flag at all. We
> already have way too much overhead per syscall in nohz mode, and it
> would be nice to get the per-syscall overhead as low as possible. We
> should strive, for all tasks, to keep syscall overhead down*and*
> avoid as many interrupts as possible.
>
> That being said, I do see a legitimate use for a way to tell the
> kernel "I'm going to run in userspace for a long time; stay away".
> But shouldn't that be a single operation, not an ongoing flag? IOW, I
> think that we should have a new syscall quiesce() or something rather
> than a prctl.

Yes, if all you are concerned about is quiescing the tick, we could
probably do it as a new syscall.

I do note that you'd want to try to actually do the quiesce as late as
possible - in particular, if you just did it in the usual syscall, you
might miss out on a timer that is set by softirq, or even something
that happened when you called schedule() on the syscall exit path.
Doing it as late as we are doing helps to ensure that that doesn't
happen. We could still arrange for this semantics by having a new
quiesce() syscall set a temporary task bit that was cleared on
return to userspace, but as you pointed out in a different email,
that gets tricky if you end up doing multiple user_exit() calls on
your way back to userspace.

More to the point, I think it's actually important to know when an
application believes it's in userspace-only mode as an actual state
bit, rather than just during its transitional moment. If an
application calls the kernel at an unexpected time (third-party code
is the usual culprit for our customers, whether it's syscalls, page
faults, or other things) we would prefer to have the "quiesce"
semantics stay in force and cause the third-party code to be
visibly very slow, rather than cause a totally unexpected and
hard-to-diagnose interrupt show up later as we are still going
around the loop that we thought was safely userspace-only.

And, for debugging the kernel, it's crazy helpful to have that state
bit in place: see patch 6/6 in the series for how we can diagnose
things like "a different core just queued an IPI that will hit a
dataplane core unexpectedly". Having that state bit makes this sort
of thing a trivial check in the kernel and relatively easy to debug.

Finally, I proposed a "strict" mode in patch 5/6 where we kill the
process if it voluntarily enters the kernel by mistake after saying it
wasn't going to any more. To do this requires a state bit, so
carrying another state bit for "quiesce on user entry" seems pretty
reasonable.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-11 20:13:39

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

On 05/09/2015 03:04 AM, Mike Galbraith wrote:
> On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
>> For tasks which have elected dataplane functionality, we run
>> any pending softirqs for the core before returning to userspace,
>> rather than ever scheduling ksoftirqd to run. The problem we
>> fix is that by allowing another task to run on the core, we
>> guarantee more interrupts in the future to the dataplane task,
>> which is exactly what dataplane mode is required to prevent.
> If ksoftirqd were rt class

I realize I actually don't know if this is true or not. Is
ksoftirqd rt class? If not, it does seem pretty plausible that
it should be...

> softirqs would be gone when the soloist gets
> the CPU back and heads to userspace. Being a soloist, it has no use for
> a priority, so why can't it just let ksoftirqd run if it raises the
> occasional softirq? Meeting a contended lock while processing it will
> wreck the soloist regardless of who does that processing.

The thing you want to avoid is having two processes both
runnable at once, since then the "quiesce" mode can't make
forward progress and basically spins in cpu_idle() until ksoftirqd
can come in. Alas, my recollection of the precise failure mode
is somewhat dimmed; my commit notes from a year ago (for
a variant of the patch I'm upstreaming now):

- Trying to return to userspace with pending softirqs is not
currently allowed. Prior to this patch, when this happened
we would just wait in cpu_idle. Instead, what we now do is
directly run any pending softirqs, then go back and retry the
path where we return to userspace.

- Raising softirqs (in this case for hrtimer support) could
cause the ksoftirqd daemon to be woken on a core. This is
bad because on a dataplane core, a QUIESCE process will
then block until the ksoftirqd runs, and the system sometimes
seems to flag that soft irqs are available but not schedule
the timer to arrange for a context switch to ksoftirqd.
To handle this, we avoid bailing out in __do_softirq() when
we've been working for a while, if we're on a dataplane core,
and just keep working until done. Similarly, on a dataplane
core running a userspace task, we don't wake ksoftirqd when
we are raising a softirq, even if we're not in an interrupt
context where it will run promptly, since a non-interrupt
context will also run promptly.

I'm happy to drop this patch entirely from the series for now, and
if ksoftirqd shows up as a problem going forward, we can address it
as necessary at that time. What do you think?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-11 22:15:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On May 12, 2015 4:54 AM, "Chris Metcalf" <[email protected]> wrote:
>
> (Oops, resending and forcing html off.)
>
>
> On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
>>
>> Naming aside, I don't think this should be a per-task flag at all. We
>> already have way too much overhead per syscall in nohz mode, and it
>> would be nice to get the per-syscall overhead as low as possible. We
>> should strive, for all tasks, to keep syscall overhead down*and*
>> avoid as many interrupts as possible.
>>
>> That being said, I do see a legitimate use for a way to tell the
>> kernel "I'm going to run in userspace for a long time; stay away".
>> But shouldn't that be a single operation, not an ongoing flag? IOW, I
>> think that we should have a new syscall quiesce() or something rather
>> than a prctl.
>
>
> Yes, if all you are concerned about is quiescing the tick, we could
> probably do it as a new syscall.
>
> I do note that you'd want to try to actually do the quiesce as late as
> possible - in particular, if you just did it in the usual syscall, you
> might miss out on a timer that is set by softirq, or even something
> that happened when you called schedule() on the syscall exit path.
> Doing it as late as we are doing helps to ensure that that doesn't
> happen. We could still arrange for this semantics by having a new
> quiesce() syscall set a temporary task bit that was cleared on
> return to userspace, but as you pointed out in a different email,
> that gets tricky if you end up doing multiple user_exit() calls on
> your way back to userspace.

We should fix that, then. A quiesce() syscall can certainly arrange
to clean up on final exit.

>
> More to the point, I think it's actually important to know when an
> application believes it's in userspace-only mode as an actual state
> bit, rather than just during its transitional moment.

We can do that, too, with a new flag that's cleared on the next entry.

> If an
> application calls the kernel at an unexpected time (third-party code
> is the usual culprit for our customers, whether it's syscalls, page
> faults, or other things) we would prefer to have the "quiesce"
> semantics stay in force and cause the third-party code to be
> visibly very slow, rather than cause a totally unexpected and
> hard-to-diagnose interrupt show up later as we are still going
> around the loop that we thought was safely userspace-only.

I'm not really convinced that we should design this feature around
ease of debugging userspace screwups. There are already plenty of
ways to do that part. Userspace getting an interrupt because
userspace accidentally did a syscall is very different from userspace
getting interrupted due to an IPI.

>
> And, for debugging the kernel, it's crazy helpful to have that state
> bit in place: see patch 6/6 in the series for how we can diagnose
> things like "a different core just queued an IPI that will hit a
> dataplane core unexpectedly". Having that state bit makes this sort
> of thing a trivial check in the kernel and relatively easy to debug.

As above, this can be done with a one-time operation, too.

>
> Finally, I proposed a "strict" mode in patch 5/6 where we kill the
> process if it voluntarily enters the kernel by mistake after saying it
> wasn't going to any more. To do this requires a state bit, so
> carrying another state bit for "quiesce on user entry" seems pretty
> reasonable.

I still dislike that in the form you chose. It's too deadly to be
useful for anyone but the hardest RT users.

I think I'd be okay with variants, though: let a suitably privileged
process ask for a signal on inadvertent kernel entry or rig up an fd
to be notified when one of these bad entries happens. Queueing
something to a pollable fd would work, too.

See that thread for more comments.

--Andy

2015-05-11 22:29:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

[add peterz due to perf stuff]

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <[email protected]> wrote:
> On 05/09/2015 03:28 AM, Andy Lutomirski wrote:
>>
>> On May 8, 2015 11:44 PM, "Chris Metcalf" <[email protected]> wrote:
>>>
>>> With QUIESCE mode, the task is in principle guaranteed not to be
>>> interrupted by the kernel, but only if it behaves. In particular,
>>> if it enters the kernel via system call, page fault, or any of
>>> a number of other synchronous traps, it may be unexpectedly
>>> exposed to long latencies. Add a simple flag that puts the process
>>> into a state where any such kernel entry is fatal.
>>>
>>> To allow the state to be entered and exited, we add an internal
>>> bit to current->dataplane_flags that is set when prctl() sets the
>>> flags. That way, when we are exiting the kernel after calling
>>> prctl() to forbid future kernel exits, we don't get immediately
>>> killed.
>>
>> Is there any reason this can't already be addressed in userspace using
>> /proc/interrupts or perf_events? ISTM the real goal here is to detect
>> when we screw up and fail to avoid an interrupt, and killing the task
>> seems like overkill to me.
>
>
> Patch 6/6 proposes a mechanism to track down times when the
> kernel screws up and delivers an IRQ to a userspace-only task.
> Here, we're just trying to identify the times when an application
> screws itself up out of cluelessness, and provide a mechanism
> that allows the developer to easily figure out why and fix it.
>
> In particular, /proc/interrupts won't show syscalls or page faults,
> which are two easy ways applications can screw themselves
> when they think they're in userspace-only mode. Also, they don't
> provide sufficient precision to make it clear what part of the
> application caused the undesired kernel entry.

Perf does, though, complete with context.

>
> In this case, killing the task is appropriate, since that's exactly
> the semantics that have been asked for - it's like on architectures
> that don't natively support unaligned accesses, but fake it relatively
> slowly in the kernel, and in development you just say "give me a
> SIGBUS when that happens" and in production you might say
> "fix it up and let's try to keep going".

I think more control is needed. I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries. They're quite
different.

>
> You can argue that this is something that can be done by ftrace,
> but certainly you'd want to have a way to programmatically
> turn on ftrace at the moment when you're entering userspace-only
> mode, so we'd want some API around that anyway. And honestly,
> it's so easy to test a task state bit in a couple of places and
> generate the failurel on the spot, vs. the relative complexity
> of setting up and understanding ftrace, that I think it merits
> inclusion on that basis alone.

perf_event, not ftrace.

>
>> Also, can we please stop further torturing the exit paths? We have a
>> disaster of assembly code that calls into syscall_trace_leave and
>> do_notify_resume. Those functions, in turn, *both* call user_enter
>> (WTF?), and on very brief inspection user_enter makes it into the nohz
>> code through multiple levels of indirection, which, with these
>> patches, has yet another conditionally enabled helper, which does this
>> new stuff. It's getting to be impossible to tell what happens when we
>> exit to user space any more.
>>
>> Also, I think your code is buggy. There's no particular guarantee
>> that user_enter is only called once between sys_prctl and the final
>> exit to user mode (see the above WTF), so you might spuriously kill
>> the process.
>
>
> This is a good point; I also find the x86 kernel entry and exit
> paths confusing, although I've reviewed them a bunch of times.
> The tile architecture paths are a little easier to understand.
>
> That said, I think the answer here is avoid non-idempotent
> actions in the dataplane code, such as clearing a syscall bit.
>
> A better implementation, I think, is to put the tests for "you
> screwed up and synchronously entered the kernel" in
> the syscall_trace_enter() code, which TIF_NOHZ already
> gets us into;

No, not unless you're planning on using that to distinguish syscalls
from other stuff *and* people think that's justified.

It's far to easy to just make a tiny change to the entry code. Add a
tiny trivial change here, a few lines of asm (that's you, audit!)
there, some weird written-in-asm scheduling code over here, and you
end up with the truly awful mess that we currently have.

If it really makes sense for this stuff to go with context tracking,
then fine, but we should *fix* the context tracking first rather than
kludging around it. I already have a prototype patch for the relevant
part of that.

> there, we can test if the dataplane "strict" bit is
> set and the syscall is not prctl(), then we generate the error.
> (We'd exclude exit and exit_group here too, since we don't
> need to shoot down a task that's just trying to kill itself.)
> This needs a bit of platform-specific code for each platform,
> but that doesn't seem like too big a problem.

I'd rather avoid that, too. This feature isn't really arch-specific,
so let's avoid the arch stuff if at all possible.

>
> Likewise we can test in exception_enter() since that's only
> called for all the synchronous user entries like page faults.

Let's try to generalize a bit. There's also irq_entry and ist_enter,
and some of the exception_enter cases are for synchronous entries
while (IIRC -- could be wrong) others aren't always like that.

>
>> Also, I think that most users will be quite surprised if "strict
>> dataplane" code causes any machine check on the system to kill your
>> dataplane task.
>
>
> Fair point, and avoided by testing as described above instead.
> (Though presumably in development it's not such a big deal,
> and as I said you'd likely turn it off in production.)

Until you forget to turn it off in production because it worked so
nicely in development.

What if we added a mode to perf where delivery of a sample
synchronously (or semi-synchronously by catching it on the next exit
to userspace) freezes the delivering task? It would be like debugger
support via perf.

peterz, do you think this would be a sensible thing to add to perf?
It would only make sense for some types of events (tracepoints and
hw_breakpoints mostly, I think).

>> So, I don't know if it is a practical suggestion or not, but would it
>> better/easier to mark a pending signal on kernel entry for this case?
>> The upsides I see is that the user gets her notification (killing the task
>> or just logging the event in a signal handler) and hopefully since return to
>> userspace with a pending signal is already handled we don't need new code in
>> the exit path?
>
>
> We could certainly do this now that I'm planning to do the
> test at kernel entry rather than super-late in kernel exit.
> Rather than just do_group_exit(SIGKILL), we should raise
> a proper SIGKILL signal via send_sig(SIGKILL, current, 1),
> and then we could catch it in the debugger; the pc should
> help identify if it was a syscall, page fault, or other trap.
>
> I'm not sure there's an argument to be made for the user
> process being able to catch the signal itself; presumably in
> production you don't turn this mode on anyway, and in
> development, assuming a debugger is probably fine.
>
> But if you want to argue for another signal (SIGILL?) please
> do; I'm curious to hear if you think it would make more sense.

Make it configurable as part of the prctl.

--Andy

2015-05-12 01:48:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact
> > that old static isolcpus was_supposed_ to crawl off and die, I know
> > beyond doubt that having isolated a cpu as well as you can definitely
> > does NOT imply that said cpu should become tickless.
>
> True, at a high level, I agree that it would be better to have a
> top-level concept like Frederic's proposed ISOLATION that includes
> isolcpus and nohz_cpu (and other stuff as needed).
>
> That said, what you wrote above is wrong; even with the patch you
> acked, setting isolcpus does not automatically turn on nohz_full for
> a given cpu. The patch made it true the other way around: when
> you say nohz_full, you automatically get isolcpus on that cpu too.
> That does, at least, make sense for the semantics of nohz_full.

I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
Yes, with nohz_full currently being static, the old allegedly dying but
also static isolcpus scheduler off switch is a convenient thing to wire
the nohz_full CPU SET (<- hint;) property to.

-Mike

2015-05-12 02:21:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

On Mon, 2015-05-11 at 16:13 -0400, Chris Metcalf wrote:
> On 05/09/2015 03:04 AM, Mike Galbraith wrote:
> > On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
> >> For tasks which have elected dataplane functionality, we run
> >> any pending softirqs for the core before returning to userspace,
> >> rather than ever scheduling ksoftirqd to run. The problem we
> >> fix is that by allowing another task to run on the core, we
> >> guarantee more interrupts in the future to the dataplane task,
> >> which is exactly what dataplane mode is required to prevent.
> > If ksoftirqd were rt class
>
> I realize I actually don't know if this is true or not. Is
> ksoftirqd rt class? If not, it does seem pretty plausible that
> it should be...

It is in an rt kernel, not in a stock kernel, it's malleable in both ;-)

> > softirqs would be gone when the soloist gets
> > the CPU back and heads to userspace. Being a soloist, it has no use for
> > a priority, so why can't it just let ksoftirqd run if it raises the
> > occasional softirq? Meeting a contended lock while processing it will
> > wreck the soloist regardless of who does that processing.
>
> The thing you want to avoid is having two processes both
> runnable at once, since then the "quiesce" mode can't make
> forward progress and basically spins in cpu_idle() until ksoftirqd
> can come in.

The only way ksoftirqd can appear is the soloist woke it. If alleged
soloist is raising enough softirqs to matter, it ain't really an ultra
sensitive solo artist, it's part of a noise inducing (locks) chorus.

> Alas, my recollection of the precise failure mode
> is somewhat dimmed; my commit notes from a year ago (for
> a variant of the patch I'm upstreaming now):
>
> - Trying to return to userspace with pending softirqs is not
> currently allowed. Prior to this patch, when this happened
> we would just wait in cpu_idle. Instead, what we now do is
> directly run any pending softirqs, then go back and retry the
> path where we return to userspace.
>
> - Raising softirqs (in this case for hrtimer support) could
> cause the ksoftirqd daemon to be woken on a core. This is
> bad because on a dataplane core, a QUIESCE process will
> then block until the ksoftirqd runs, and the system sometimes
> seems to flag that soft irqs are available but not schedule
> the timer to arrange for a context switch to ksoftirqd.
> To handle this, we avoid bailing out in __do_softirq() when
> we've been working for a while, if we're on a dataplane core,
> and just keep working until done. Similarly, on a dataplane
> core running a userspace task, we don't wake ksoftirqd when
> we are raising a softirq, even if we're not in an interrupt
> context where it will run promptly, since a non-interrupt
> context will also run promptly.

Thomas has nuked the hrtimer softirq.

> I'm happy to drop this patch entirely from the series for now, and
> if ksoftirqd shows up as a problem going forward, we can address it
> as necessary at that time. What do you think?

Inlining softirqs may save a context switch, but adds cycles that we may
consume at higher frequency than the thing we're avoiding.

-Mike

2015-05-12 04:35:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Tue, 2015-05-12 at 03:47 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> > On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> > > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact
> > > that old static isolcpus was_supposed_ to crawl off and die, I know
> > > beyond doubt that having isolated a cpu as well as you can definitely
> > > does NOT imply that said cpu should become tickless.
> >
> > True, at a high level, I agree that it would be better to have a
> > top-level concept like Frederic's proposed ISOLATION that includes
> > isolcpus and nohz_cpu (and other stuff as needed).
> >
> > That said, what you wrote above is wrong; even with the patch you
> > acked, setting isolcpus does not automatically turn on nohz_full for
> > a given cpu. The patch made it true the other way around: when
> > you say nohz_full, you automatically get isolcpus on that cpu too.
> > That does, at least, make sense for the semantics of nohz_full.
>
> I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> Yes, with nohz_full currently being static, the old allegedly dying but
> also static isolcpus scheduler off switch is a convenient thing to wire
> the nohz_full CPU SET (<- hint;) property to.

BTW, another facet of this: Rik wants to make isolcpus immune to
cpusets, which makes some sense, user did say isolcpus=, but that also
makes isolcpus truly static. If the user now says nohz_full=, they lose
the ability to deactivate CPU isolation, making the set fairly useless
for anything other than HPC. Currently, the user can flip the isolation
switch as he sees fit. He takes a size extra large performance hit for
having said nohz_full=, but he doesn't lose generic utility.

-Mike

2015-05-12 09:10:58

by Ingo Molnar

[permalink] [raw]
Subject: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)


* Chris Metcalf <[email protected]> wrote:

> - ISOLATION (Frederic). I like this but it conflicts with other uses
> of "isolation" in the kernel: cgroup isolation, lru page isolation,
> iommu isolation, scheduler isolation (at least it's a superset of
> that one), etc. Also, we're not exactly isolating a task - often
> a "dataplane" app consists of a bunch of interacting threads in
> userspace, so not exactly isolated. So perhaps it's too confusing.

So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is
a high level kernel feature, so it won't conflict with isolation
concepts in lower level subsystems such as IOMMU isolation - and other
higher level features like scheduler isolation are basically another
partial implementation we want to merge with all this...

nohz, RCU tricks, watchdog defaults, isolcpus and various other
measures to keep these CPUs and workloads as isolated as possible
are (or should become) components of this high level concept.

Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost
zero overhead on normal workloads and on non-isolated CPUs, so that
Linux distributions can enable it.

Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step
needed: just like cpusets, the configuration of isolated CPUs should
be a completely boot option free excercise that can be dynamically
done and undone by the administrator via an intuitive interface.

Thanks,

Ingo

2015-05-12 09:26:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane

On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote:
> While the current fallback to 1-second tick is still helpful for
> maintaining completely correct kernel semantics, processes using
> prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
> completely tickless, so don't bound the time_delta for such processes.
>
> This was previously discussed in
>
> https://lkml.org/lkml/2014/10/31/364
>
> and Thomas Gleixner observed that vruntime, load balancing data,
> load accounting, and other things might be impacted. Frederic
> Weisbecker similarly observed that allowing the tick to be indefinitely
> deferred just meant that no one would ever fix the underlying bugs.
> However it's at least true that the mode proposed in this patch can
> only be enabled on an isolcpus core, which may limit how important
> it is to maintain scheduler data correctly, for example.

So how is making this available going to help people fix the actual
problem?

There is nothing fundamentally impossible about fixing this proper, its
just a lot of hard work.

NAK on this, do it right.

2015-05-12 09:29:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote:
> - Raising softirqs (in this case for hrtimer support) could

Note that Thomas recently killed all the softirq wreckage in hrtimers.
So that specific case is dealt with.

2015-05-12 09:32:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote:
> The thing you want to avoid is having two processes both
> runnable at once

Right, because as soon as nr_running > 1 we kill the entire nohz_full
thing. RT or not for ksoftirqd doesn't matter.

Then again, like interrupts, you basically want to avoid softirqs in
this mode.

So I think the right solution is to figure out why the softirqs get
raised and cure that.

2015-05-12 09:34:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> kernel to quiesce any pending timer interrupts prior to returning
> to userspace. When running with this mode set, sys calls (and page
> faults, etc.) can be inordinately slow. However, user applications
> that want to guarantee that no unexpected interrupts will occur
> (even if they call into the kernel) can set this flag to guarantee
> that semantics.

Currently people hot-unplug and hot-plug the CPU to do this. Obviously
that's a wee bit horrible :-)

Not sure if a prctl like this is any better though. This is a CPU
properly not a process one.

ISTR people talking about 'quiesce' sysfs file, along side the hotplug
stuff, I can't quite remember.

2015-05-12 09:39:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> +++ b/kernel/time/tick-sched.c
> @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
> (jiffies - start));
> dump_stack();
> }
> +
> + /*
> + * Kill the process if it violates STRICT mode. Note that this
> + * code also results in killing the task if a kernel bug causes an
> + * irq to be delivered to this core.
> + */
> + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> + == PR_DATAPLANE_STRICT) {
> + pr_warn("Dataplane STRICT mode violated; process killed.\n");
> + dump_stack();
> + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> + local_irq_enable();
> + do_group_exit(SIGKILL);
> + }
> }

So while I'm all for hard fails like this, can we not provide a wee bit
more information in the siginfo ? And maybe use a slightly less fatal
signal, such that userspace can actually catch it and dump state in
debug modes?

2015-05-12 09:50:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE


* Peter Zijlstra <[email protected]> wrote:

> On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> > kernel to quiesce any pending timer interrupts prior to returning
> > to userspace. When running with this mode set, sys calls (and page
> > faults, etc.) can be inordinately slow. However, user applications
> > that want to guarantee that no unexpected interrupts will occur
> > (even if they call into the kernel) can set this flag to guarantee
> > that semantics.
>
> Currently people hot-unplug and hot-plug the CPU to do this.
> Obviously that's a wee bit horrible :-)
>
> Not sure if a prctl like this is any better though. This is a CPU
> properly not a process one.

So if then a prctl() (or other system call) could be a shortcut to:

- move the task to an isolated CPU
- make sure there _is_ such an isolated domain available

I.e. have some programmatic, kernel provided way for an application to
be sure it's running in the right environment. Relying on random
administration flags here and there won't cut it.

Thanks,

Ingo

2015-05-12 10:38:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

On Tue, May 12, 2015 at 11:50:30AM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> > > kernel to quiesce any pending timer interrupts prior to returning
> > > to userspace. When running with this mode set, sys calls (and page
> > > faults, etc.) can be inordinately slow. However, user applications
> > > that want to guarantee that no unexpected interrupts will occur
> > > (even if they call into the kernel) can set this flag to guarantee
> > > that semantics.
> >
> > Currently people hot-unplug and hot-plug the CPU to do this.
> > Obviously that's a wee bit horrible :-)
> >
> > Not sure if a prctl like this is any better though. This is a CPU
> > properly not a process one.
>
> So if then a prctl() (or other system call) could be a shortcut to:
>
> - move the task to an isolated CPU
> - make sure there _is_ such an isolated domain available
>
> I.e. have some programmatic, kernel provided way for an application to
> be sure it's running in the right environment. Relying on random
> administration flags here and there won't cut it.

No, we already have sched_setaffinity() and we should not duplicate its
ability to move tasks about.

What this is about is 'clearing' CPU state, its nothing to do with
tasks.

Ideally we'd never have to clear the state because it should be
impossible to get into this predicament in the first place.

The typical example here is a periodic timer that found its way onto the
cpu and stays there. We're actually working on allowing such self arming
timers to migrate, so once we have that sorted this could be fixed
proper I think.

Not sure if there's more pollution that people worry about.

The hotplug hack worked because unplug force migrates the timers away.

2015-05-12 10:46:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
>
> Please lets get NO_HZ_FULL up to par. That should be the main focus.
>

ACK, much of this dataplane stuff is (useful) hacks working around the
fact that nohz_full just isn't complete.

2015-05-12 11:48:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)

On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
>
> So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is
> a high level kernel feature, so it won't conflict with isolation
> concepts in lower level subsystems such as IOMMU isolation - and other
> higher level features like scheduler isolation are basically another
> partial implementation we want to merge with all this...
>

But why do we need a CONFIG flag for something that has no content?

That is, I do not see anything much; except the 'I want to stay in
userspace and kill me otherwise' flag, and I'm not sure that warrants a
CONFIG flag like this.

Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus
and whatnot.

2015-05-12 12:34:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)


* Peter Zijlstra <[email protected]> wrote:

> On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
> >
> > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this
> > is a high level kernel feature, so it won't conflict with
> > isolation concepts in lower level subsystems such as IOMMU
> > isolation - and other higher level features like scheduler
> > isolation are basically another partial implementation we want to
> > merge with all this...
>
> But why do we need a CONFIG flag for something that has no content?
>
> That is, I do not see anything much; except the 'I want to stay in
> userspace and kill me otherwise' flag, and I'm not sure that
> warrants a CONFIG flag like this.
>
> Other than that, its all a combination of NOHZ_FULL and
> cpusets/isolcpus and whatnot.

Yes, that's what I meant: CONFIG_ISOLATION would trigger what is
NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as
an individual Kconfig option?

CONFIG_ISOLATION=y would express the guarantee from the kernel that
it's possible for user-space to configure itself to run undisturbed -
instead of the current inconsistent set of options and facilities.

A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks,
it also tries to offer various facilities and tune the defaults to
turn the kernel hard-rt.

Does that make sense to you?

Thanks,

Ingo

2015-05-12 12:39:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)

On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
> Yes, that's what I meant: CONFIG_ISOLATION would trigger what is
> NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as
> an individual Kconfig option?

Ah, as a rename of nohz_full, sure that might work.

2015-05-12 12:43:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)


* Peter Zijlstra <[email protected]> wrote:

> On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
>
> > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is
> > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL
> > as an individual Kconfig option?
>
> Ah, as a rename of nohz_full, sure that might work.

It could also be named CONFIG_CPU_ISOLATION=y, to make it more
explicit what it's about.

Thanks,

Ingo

2015-05-12 12:54:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE


* Peter Zijlstra <[email protected]> wrote:

> > So if then a prctl() (or other system call) could be a shortcut
> > to:
> >
> > - move the task to an isolated CPU
> > - make sure there _is_ such an isolated domain available
> >
> > I.e. have some programmatic, kernel provided way for an
> > application to be sure it's running in the right environment.
> > Relying on random administration flags here and there won't cut
> > it.
>
> No, we already have sched_setaffinity() and we should not duplicate
> its ability to move tasks about.

But sched_setaffinity() does not guarantee isolation - it's just a
syscall to move a task to a set of CPUs, which might be isolated or
not.

What I suggested is that it might make sense to offer a system call,
for example a sched_setparam() variant, that makes such guarantees.

Say if user-space does:

ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);

... then we would get the task moved to an isolated domain and get a 0
return code if the kernel is able to do all that and if the current
uid/namespace/etc. has the required permissions and such.

( BIND_ISOLATED will not replace the current p->policy value, so it's
still possible to use the regular policies as well on top of this. )

I.e. make it programatic instead of relying on a fragile, kernel
version dependent combination of sysctl, sysfs, kernel config and boot
parameter details to get us this result.

I.e. provide a central hub to offer this feature in a more structured,
easier to use fashion.

We might still require the admin (or distro) to separately set up the
domain of isolated CPUs, and it would still be possible to simply
'move' tasks there using existing syscalls - but I say that it's not a
bad idea at all to offer a single central syscall interface for apps
to request such treatment.

> What this is about is 'clearing' CPU state, its nothing to do with
> tasks.
>
> Ideally we'd never have to clear the state because it should be
> impossible to get into this predicament in the first place.

That I absolutely agree about, that bit is nonsense.

We might offer debugging facilities to debug such bugs, but we won't
work or hack it around.

Thanks,

Ingo

2015-05-12 13:08:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry

On Tue, May 12, 2015 at 11:32:02AM +0200, Peter Zijlstra wrote:
> On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote:
> > The thing you want to avoid is having two processes both
> > runnable at once
>
> Right, because as soon as nr_running > 1 we kill the entire nohz_full
> thing. RT or not for ksoftirqd doesn't matter.
>
> Then again, like interrupts, you basically want to avoid softirqs in
> this mode.
>
> So I think the right solution is to figure out why the softirqs get
> raised and cure that.

Makes sense, but it also makes sense to have something that detects
when that cure fails and clean up. And, in a test/debug environment,
also issuing some sort of diagnostic in that case.

Thanx, Paul

2015-05-12 13:13:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane

On Tue, May 12, 2015 at 11:26:07AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote:
> > While the current fallback to 1-second tick is still helpful for
> > maintaining completely correct kernel semantics, processes using
> > prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
> > completely tickless, so don't bound the time_delta for such processes.
> >
> > This was previously discussed in
> >
> > https://lkml.org/lkml/2014/10/31/364
> >
> > and Thomas Gleixner observed that vruntime, load balancing data,
> > load accounting, and other things might be impacted. Frederic
> > Weisbecker similarly observed that allowing the tick to be indefinitely
> > deferred just meant that no one would ever fix the underlying bugs.
> > However it's at least true that the mode proposed in this patch can
> > only be enabled on an isolcpus core, which may limit how important
> > it is to maintain scheduler data correctly, for example.
>
> So how is making this available going to help people fix the actual
> problem?

It will at least provide an environment where adding more of this
problem might get punished. This would be an improvement over what
we have today, namely that the 1HZ fallback timer silently forgives
adding more problems of this sort.

Thanx, Paul

> There is nothing fundamentally impossible about fixing this proper, its
> just a lot of hard work.
>
> NAK on this, do it right.
>

2015-05-12 13:36:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Mon, May 11, 2015 at 03:52:37PM -0400, Chris Metcalf wrote:
> On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
> >Naming aside, I don't think this should be a per-task flag at all. We
> >already have way too much overhead per syscall in nohz mode, and it
> >would be nice to get the per-syscall overhead as low as possible. We
> >should strive, for all tasks, to keep syscall overhead down*and*
> >avoid as many interrupts as possible.
> >
> >That being said, I do see a legitimate use for a way to tell the
> >kernel "I'm going to run in userspace for a long time; stay away".
> >But shouldn't that be a single operation, not an ongoing flag? IOW, I
> >think that we should have a new syscall quiesce() or something rather
> >than a prctl.
>
> Yes, if all you are concerned about is quiescing the tick, we could
> probably do it as a new syscall.
>
> I do note that you'd want to try to actually do the quiesce as late as
> possible - in particular, if you just did it in the usual syscall, you
> might miss out on a timer that is set by softirq, or even something
> that happened when you called schedule() on the syscall exit path.
> Doing it as late as we are doing helps to ensure that that doesn't
> happen. We could still arrange for this semantics by having a new
> quiesce() syscall set a temporary task bit that was cleared on
> return to userspace, but as you pointed out in a different email,
> that gets tricky if you end up doing multiple user_exit() calls on
> your way back to userspace.
>
> More to the point, I think it's actually important to know when an
> application believes it's in userspace-only mode as an actual state
> bit, rather than just during its transitional moment. If an
> application calls the kernel at an unexpected time (third-party code
> is the usual culprit for our customers, whether it's syscalls, page
> faults, or other things) we would prefer to have the "quiesce"
> semantics stay in force and cause the third-party code to be
> visibly very slow, rather than cause a totally unexpected and
> hard-to-diagnose interrupt show up later as we are still going
> around the loop that we thought was safely userspace-only.
>
> And, for debugging the kernel, it's crazy helpful to have that state
> bit in place: see patch 6/6 in the series for how we can diagnose
> things like "a different core just queued an IPI that will hit a
> dataplane core unexpectedly". Having that state bit makes this sort
> of thing a trivial check in the kernel and relatively easy to debug.

I agree with this! It is currently a bit painful to debug problems
that might result in multiple tasks runnable on a given CPU. If you
suspect a problem, you enable tracing and re-run. Not paricularly
friendly for chasing down intermittent problems, so some sort of
improvement would be a very good thing.

Thanx, Paul

> Finally, I proposed a "strict" mode in patch 5/6 where we kill the
> process if it voluntarily enters the kernel by mistake after saying it
> wasn't going to any more. To do this requires a state bit, so
> carrying another state bit for "quiesce on user entry" seems pretty
> reasonable.
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>

2015-05-12 13:20:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> > +++ b/kernel/time/tick-sched.c
> > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
> > (jiffies - start));
> > dump_stack();
> > }
> > +
> > + /*
> > + * Kill the process if it violates STRICT mode. Note that this
> > + * code also results in killing the task if a kernel bug causes an
> > + * irq to be delivered to this core.
> > + */
> > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> > + == PR_DATAPLANE_STRICT) {
> > + pr_warn("Dataplane STRICT mode violated; process killed.\n");
> > + dump_stack();
> > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> > + local_irq_enable();
> > + do_group_exit(SIGKILL);
> > + }
> > }
>
> So while I'm all for hard fails like this, can we not provide a wee bit
> more information in the siginfo ? And maybe use a slightly less fatal
> signal, such that userspace can actually catch it and dump state in
> debug modes?

Agreed, a bit more debug state would be helpful.

Thanx, Paul

2015-05-12 15:36:39

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)

On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
> > >
> > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this
> > > is a high level kernel feature, so it won't conflict with
> > > isolation concepts in lower level subsystems such as IOMMU
> > > isolation - and other higher level features like scheduler
> > > isolation are basically another partial implementation we want to
> > > merge with all this...
> >
> > But why do we need a CONFIG flag for something that has no content?
> >
> > That is, I do not see anything much; except the 'I want to stay in
> > userspace and kill me otherwise' flag, and I'm not sure that
> > warrants a CONFIG flag like this.
> >
> > Other than that, its all a combination of NOHZ_FULL and
> > cpusets/isolcpus and whatnot.
>
> Yes, that's what I meant: CONFIG_ISOLATION would trigger what is
> NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as
> an individual Kconfig option?

Right, we could return to what we had previously: CONFIG_NO_HZ. A config
that enables dynticks-idle by default and allows full dynticks if nohz_full=
boot option is passed (or something driven by higher level isolation interface).

Because eventually, distros enable NO_HZ_FULL so that their 0.0001% users
can use it. Well at least Red Hat does.

>
> CONFIG_ISOLATION=y would express the guarantee from the kernel that
> it's possible for user-space to configure itself to run undisturbed -
> instead of the current inconsistent set of options and facilities.
>
> A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks,
> it also tries to offer various facilities and tune the defaults to
> turn the kernel hard-rt.
>
> Does that make sense to you?

Right although distros tend to want features to be enabled dynamically
so that they have a single kernel to maintain. Things like PREEMPT_RT
really need to be a different kernel because fundamental primitives like
spinlocks must be implemented statically.

But isolation can be a boot-enabled, or even runtime-enabled, as it's only
about timer,irq,task affinity. Full Nohz is more complicated but it can
be runtime toggled in the future.

So we can bring CONFIG_CPU_ISOLATION, at least for distros that are really
not interested in that so they can disable it. CONFIG_CPU_ISOLATION=y would
bring an ability which is default-disabled and driven dynamically through whatever
interface.

2015-05-12 21:05:59

by Chris Metcalf

[permalink] [raw]
Subject: Re: CONFIG_ISOLATION=y

On 05/12/2015 05:10 AM, Ingo Molnar wrote:
> * Chris Metcalf <[email protected]> wrote:
>
>> - ISOLATION (Frederic). I like this but it conflicts with other uses
>> of "isolation" in the kernel: cgroup isolation, lru page isolation,
>> iommu isolation, scheduler isolation (at least it's a superset of
>> that one), etc. Also, we're not exactly isolating a task - often
>> a "dataplane" app consists of a bunch of interacting threads in
>> userspace, so not exactly isolated. So perhaps it's too confusing.
> So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is
> a high level kernel feature, so it won't conflict with isolation
> concepts in lower level subsystems such as IOMMU isolation - and other
> higher level features like scheduler isolation are basically another
> partial implementation we want to merge with all this...
>
> nohz, RCU tricks, watchdog defaults, isolcpus and various other
> measures to keep these CPUs and workloads as isolated as possible
> are (or should become) components of this high level concept.
>
> Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost
> zero overhead on normal workloads and on non-isolated CPUs, so that
> Linux distributions can enable it.

Using CONFIG_CPU_ISOLATION to capture all this stuff instead of
making CONFIG_NO_HZ_FULL do it seems plausible for naming.
However, this feels like just bombing the current naming to this
new name, right? I'd like to argue that this is orthogonal to adding
new isolation functionality into no_hz_full, as my patch series has
been doing. Perhaps we can defer this to a follow-up patch series?
I'm happy to do the work but I'm not sure we want to bundle all
that churn into the current patch series under consideration.
I can use cpu_isolation_xxx for naming in the current patch series
so we don't have to come back and bomb that later.

> Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step
> needed: just like cpusets, the configuration of isolated CPUs should
> be a completely boot option free excercise that can be dynamically
> done and undone by the administrator via an intuitive interface.

Eventually isolation can be runtime-enabled, but for now I think
it makes sense to be boot-enabled. As Frederic suggested, we
can arrange full nohz to be runtime toggled in the future.
I agree that it should be reasonable to compile it in by default.

On 05/12/2015 07:48 AM, Peter Zijlstra wrote:
> But why do we need a CONFIG flag for something that has no content?
>
> That is, I do not see anything much; except the 'I want to stay in
> userspace and kill me otherwise' flag, and I'm not sure that warrants a
> CONFIG flag like this.
>
> Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus
> and whatnot.

There are three major pieces here - one is the STRICT piece
that you allude to, but there is also the piece where we quiesce
tasks in the kernel until no timer interrupts are pending, and the
piece that allows easy debugging of stray IRQs etc to isolated cpus.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-12 21:06:17

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
> [add peterz due to perf stuff]
>
> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <[email protected]> wrote:
>> Patch 6/6 proposes a mechanism to track down times when the
>> kernel screws up and delivers an IRQ to a userspace-only task.
>> Here, we're just trying to identify the times when an application
>> screws itself up out of cluelessness, and provide a mechanism
>> that allows the developer to easily figure out why and fix it.
>>
>> In particular, /proc/interrupts won't show syscalls or page faults,
>> which are two easy ways applications can screw themselves
>> when they think they're in userspace-only mode. Also, they don't
>> provide sufficient precision to make it clear what part of the
>> application caused the undesired kernel entry.
> Perf does, though, complete with context.

The perf_event suggestions are interesting, but I think it's plausible
for this to be an alternate way to debug the issues that STRICT
addresses.

>> In this case, killing the task is appropriate, since that's exactly
>> the semantics that have been asked for - it's like on architectures
>> that don't natively support unaligned accesses, but fake it relatively
>> slowly in the kernel, and in development you just say "give me a
>> SIGBUS when that happens" and in production you might say
>> "fix it up and let's try to keep going".
> I think more control is needed. I also think that, if we go this
> route, we should distinguish syscalls, synchronous non-syscall
> entries, and asynchronous non-syscall entries. They're quite
> different.

I don't think it's necessary to distinguish the types. As long as we
have a PC pointing to the instruction that triggered the problem,
we can see if it's a system call instruction, a memory write that
caused a page fault, a trap instruction, etc. We certainly could
add infrastructure to capture syscall numbers, fault/signal numbers,
etc etc, but I think it's overkill if it adds kernel overhead on
entry/exit.

>> A better implementation, I think, is to put the tests for "you
>> screwed up and synchronously entered the kernel" in
>> the syscall_trace_enter() code, which TIF_NOHZ already
>> gets us into;
> No, not unless you're planning on using that to distinguish syscalls
> from other stuff *and* people think that's justified.

So, the question is how we separate synchronous entries
from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated
tasks), and synchronous entries are application bugs. We'd
like to deliver a signal for the latter, and do some kind of
kernel diagnostics for the former. So we can't just add the
test in the context tracking code, which doesn't actually know
why we're entering or exiting.

That's why I was thinking that the syscall_trace_entry and
exception_enter paths were the best choices. I'm fairly sure
that exception_enter is only done for synchronous traps,
page faults, etc.

Certainly on the tile architecture we include the trap number
in the pt_regs, so it's possible to just examine the pt_regs and
know why you entered or are exiting the kernel, but I don't
think we can rely on that for all architectures.

> It's far to easy to just make a tiny change to the entry code. Add a
> tiny trivial change here, a few lines of asm (that's you, audit!)
> there, some weird written-in-asm scheduling code over here, and you
> end up with the truly awful mess that we currently have.
>
> If it really makes sense for this stuff to go with context tracking,
> then fine, but we should *fix* the context tracking first rather than
> kludging around it. I already have a prototype patch for the relevant
> part of that.
>
>> there, we can test if the dataplane "strict" bit is
>> set and the syscall is not prctl(), then we generate the error.
>> (We'd exclude exit and exit_group here too, since we don't
>> need to shoot down a task that's just trying to kill itself.)
>> This needs a bit of platform-specific code for each platform,
>> but that doesn't seem like too big a problem.
> I'd rather avoid that, too. This feature isn't really arch-specific,
> so let's avoid the arch stuff if at all possible.

I'll put out a v2 of my patch that does both the things you
advise against :-) just so we can have a strawman to think
about how to do it better - unless you have a suggestion
offhand as to how we can better differentiate sync and async
entries into the kernel in a platform-independent way.

I could imagine modifying user_exit() and exception_enter()
to pass an identifier into the context system saying why they
were changing contexts, so we could have syscalls, trap
numbers, fault numbers, etc., and some way to query as
to whether they were synchronous or asynchronous, and
build this scheme on top of that, but I'm not sure the extra
infrastructure is worthwhile.

>> Likewise we can test in exception_enter() since that's only
>> called for all the synchronous user entries like page faults.
> Let's try to generalize a bit. There's also irq_entry and ist_enter,
> and some of the exception_enter cases are for synchronous entries
> while (IIRC -- could be wrong) others aren't always like that.

I don't think we need to generalize this piece. irq_entry()
shouldn't be reported by the STRICT mechanism but by
kernel bug reporting. For ist_enter(), it looks like if you're
coming from userspace it's just handled with exception_enter().
I'm more familiar with the tile architecture mechanisms than
with x86, though, to be honest.

>>> Also, I think that most users will be quite surprised if "strict
>>> dataplane" code causes any machine check on the system to kill your
>>> dataplane task.
>>
>> Fair point, and avoided by testing as described above instead.
>> (Though presumably in development it's not such a big deal,
>> and as I said you'd likely turn it off in production.)
> Until you forget to turn it off in production because it worked so
> nicely in development.

I guess that's an argument for using a non-fatal signal with a
handler from the get-go, since then even in production you'll
just end up with a slightly heavier-weight kernel overhead
(whatever stupid thing your application did, plus the time
spent in the signal handler), but then after that you can get
back to processing packets or whatever the app is doing.

You had mentioned some alternatives to a catchable signal
(a signal to some other process, or queuing to an fd); I think
it still seems reasonable to just deliver a signal to the process,
configurably by the prctl, and not do anything more complex.
Does this seem reasonable to you at this point?

> What if we added a mode to perf where delivery of a sample
> synchronously (or semi-synchronously by catching it on the next exit
> to userspace) freezes the delivering task? It would be like debugger
> support via perf.
>
> peterz, do you think this would be a sensible thing to add to perf?
> It would only make sense for some types of events (tracepoints and
> hw_breakpoints mostly, I think).

I suspect it's reasonable to consider this orthogonal, particularly
if there is some skid between the actual violation by the
application, and the freeze happening.

You pushed back somewhat on prctl() in favor of a quiesce()
syscall in your email, but it seemed like at the end of your
email you were adopting the prctl() perspective. Is that true?
I admit the prctl() still seems cleaner from my perspective.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-12 22:23:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On May 13, 2015 6:06 AM, "Chris Metcalf" <[email protected]> wrote:
>
> On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
>>
>> [add peterz due to perf stuff]
>>
>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <[email protected]> wrote:
>>>
>>> Patch 6/6 proposes a mechanism to track down times when the
>>> kernel screws up and delivers an IRQ to a userspace-only task.
>>> Here, we're just trying to identify the times when an application
>>> screws itself up out of cluelessness, and provide a mechanism
>>> that allows the developer to easily figure out why and fix it.
>>>
>>> In particular, /proc/interrupts won't show syscalls or page faults,
>>> which are two easy ways applications can screw themselves
>>> when they think they're in userspace-only mode. Also, they don't
>>> provide sufficient precision to make it clear what part of the
>>> application caused the undesired kernel entry.
>>
>> Perf does, though, complete with context.
>
>
> The perf_event suggestions are interesting, but I think it's plausible
> for this to be an alternate way to debug the issues that STRICT
> addresses.
>
>
>>> In this case, killing the task is appropriate, since that's exactly
>>> the semantics that have been asked for - it's like on architectures
>>> that don't natively support unaligned accesses, but fake it relatively
>>> slowly in the kernel, and in development you just say "give me a
>>> SIGBUS when that happens" and in production you might say
>>> "fix it up and let's try to keep going".
>>
>> I think more control is needed. I also think that, if we go this
>> route, we should distinguish syscalls, synchronous non-syscall
>> entries, and asynchronous non-syscall entries. They're quite
>> different.
>
>
> I don't think it's necessary to distinguish the types. As long as we
> have a PC pointing to the instruction that triggered the problem,
> we can see if it's a system call instruction, a memory write that
> caused a page fault, a trap instruction, etc.

Not true. PC right after a syscall insn could be any type of kernel
entry, and you can't even reliably tell whether the syscall insn was
executed or, on x86, whether it was a syscall at all. (x86 insns
can't be reliably decided backwards.)

PC pointing at a load could be a page fault or an IPI.

> We certainly could
> add infrastructure to capture syscall numbers, fault/signal numbers,
> etc etc, but I think it's overkill if it adds kernel overhead on
> entry/exit.
>

None of these should add overhead.

>
>>> A better implementation, I think, is to put the tests for "you
>>> screwed up and synchronously entered the kernel" in
>>> the syscall_trace_enter() code, which TIF_NOHZ already
>>> gets us into;
>>
>> No, not unless you're planning on using that to distinguish syscalls
>> from other stuff *and* people think that's justified.
>
>
> So, the question is how we separate synchronous entries
> from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated
> tasks), and synchronous entries are application bugs. We'd
> like to deliver a signal for the latter, and do some kind of
> kernel diagnostics for the former. So we can't just add the
> test in the context tracking code, which doesn't actually know
> why we're entering or exiting.

Synchronous entries could be VM bugs, too.

>
> That's why I was thinking that the syscall_trace_entry and
> exception_enter paths were the best choices. I'm fairly sure
> that exception_enter is only done for synchronous traps,
> page faults, etc.

Maybe. Doing it through the actual entry/exit slow paths would be
overhead-free, although I'm not sure that IRQs have real slow paths
for entry.

>
> Certainly on the tile architecture we include the trap number
> in the pt_regs, so it's possible to just examine the pt_regs and
> know why you entered or are exiting the kernel, but I don't
> think we can rely on that for all architectures.

x86 can't do this.

> I'll put out a v2 of my patch that does both the things you
> advise against :-) just so we can have a strawman to think
> about how to do it better - unless you have a suggestion
> offhand as to how we can better differentiate sync and async
> entries into the kernel in a platform-independent way.
>
> I could imagine modifying user_exit() and exception_enter()
> to pass an identifier into the context system saying why they
> were changing contexts, so we could have syscalls, trap
> numbers, fault numbers, etc., and some way to query as
> to whether they were synchronous or asynchronous, and
> build this scheme on top of that, but I'm not sure the extra
> infrastructure is worthwhile.
>

I'll take a look.

Again, though, I think we really do need to distinguish at least MCE
and NMI (on x86) from the others.

>
>> What if we added a mode to perf where delivery of a sample
>> synchronously (or semi-synchronously by catching it on the next exit
>> to userspace) freezes the delivering task? It would be like debugger
>> support via perf.
>>
>> peterz, do you think this would be a sensible thing to add to perf?
>> It would only make sense for some types of events (tracepoints and
>> hw_breakpoints mostly, I think).
>
>
> I suspect it's reasonable to consider this orthogonal, particularly
> if there is some skid between the actual violation by the
> application, and the freeze happening.
>

I think it could be done without skid, except for async entries, but
for asynx entries we don't care about exact user state anyway.

> You pushed back somewhat on prctl() in favor of a quiesce()
> syscall in your email, but it seemed like at the end of your
> email you were adopting the prctl() perspective. Is that true?
> I admit the prctl() still seems cleaner from my perspective.
>

Prctl for the strict thing seems much more reasonable to me than prctl
for quiescing. Also, the scheduler people seem to thing that
quiescing should be automatic.

Anyway, I'll happily look at code and maybe even write more coherent
emails when I'm back in town in a week. Since you're thinking that
async entries should give kernel diagnostics instead of signals, maybe
the right thing to do is to separate them out completely and try to
address the individual entry types separately and as needed.

--Andy

2015-05-13 04:35:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <[email protected]> wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> > So if then a prctl() (or other system call) could be a shortcut
>> > to:
>> >
>> > - move the task to an isolated CPU
>> > - make sure there _is_ such an isolated domain available
>> >
>> > I.e. have some programmatic, kernel provided way for an
>> > application to be sure it's running in the right environment.
>> > Relying on random administration flags here and there won't cut
>> > it.
>>
>> No, we already have sched_setaffinity() and we should not duplicate
>> its ability to move tasks about.
>
> But sched_setaffinity() does not guarantee isolation - it's just a
> syscall to move a task to a set of CPUs, which might be isolated or
> not.
>
> What I suggested is that it might make sense to offer a system call,
> for example a sched_setparam() variant, that makes such guarantees.
>
> Say if user-space does:
>
> ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
>
> ... then we would get the task moved to an isolated domain and get a 0
> return code if the kernel is able to do all that and if the current
> uid/namespace/etc. has the required permissions and such.
>
> ( BIND_ISOLATED will not replace the current p->policy value, so it's
> still possible to use the regular policies as well on top of this. )

I think we shouldn't have magic selection of an isolated domain.
Anyone using this has already configured some isolated CPUs and
probably wants to choose the CPU and, especially, NUMA node
themselves. Also, maybe it should be a special type of realtime
class/priority -- doing this should require RT permission IMO.

--Andy

2015-05-13 21:00:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

On Tue, May 12, 2015 at 09:35:25PM -0700, Andy Lutomirski wrote:
> On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> >> > So if then a prctl() (or other system call) could be a shortcut
> >> > to:
> >> >
> >> > - move the task to an isolated CPU
> >> > - make sure there _is_ such an isolated domain available
> >> >
> >> > I.e. have some programmatic, kernel provided way for an
> >> > application to be sure it's running in the right environment.
> >> > Relying on random administration flags here and there won't cut
> >> > it.
> >>
> >> No, we already have sched_setaffinity() and we should not duplicate
> >> its ability to move tasks about.
> >
> > But sched_setaffinity() does not guarantee isolation - it's just a
> > syscall to move a task to a set of CPUs, which might be isolated or
> > not.
> >
> > What I suggested is that it might make sense to offer a system call,
> > for example a sched_setparam() variant, that makes such guarantees.
> >
> > Say if user-space does:
> >
> > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
> >
> > ... then we would get the task moved to an isolated domain and get a 0
> > return code if the kernel is able to do all that and if the current
> > uid/namespace/etc. has the required permissions and such.
> >
> > ( BIND_ISOLATED will not replace the current p->policy value, so it's
> > still possible to use the regular policies as well on top of this. )
>
> I think we shouldn't have magic selection of an isolated domain.
> Anyone using this has already configured some isolated CPUs and
> probably wants to choose the CPU and, especially, NUMA node
> themselves. Also, maybe it should be a special type of realtime
> class/priority -- doing this should require RT permission IMO.

I have no real argument against special permissions, but this feature
is totally orthogonal to realtime classes/priorities. It is perfectly
legitimate for a given CPU's single runnable task to be SCHED_OTHER,
for example.

Thanx, Paul

2015-05-14 20:54:57

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

On 05/12/2015 05:33 AM, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
>> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
>> kernel to quiesce any pending timer interrupts prior to returning
>> to userspace. When running with this mode set, sys calls (and page
>> faults, etc.) can be inordinately slow. However, user applications
>> that want to guarantee that no unexpected interrupts will occur
>> (even if they call into the kernel) can set this flag to guarantee
>> that semantics.
> Currently people hot-unplug and hot-plug the CPU to do this. Obviously
> that's a wee bit horrible :-)
>
> Not sure if a prctl like this is any better though. This is a CPU
> properly not a process one.

The CPU property aspects, I think, should be largely handled by
fixing kernel bugs that let work end up running on nohz_full cores
without having been explicitly requested to run there.

As you said in a follow-up email:

On 05/12/2015 06:38 AM, Peter Zijlstra wrote:
> Ideally we'd never have to clear the state because it should be
> impossible to get into this predicament in the first place.

What my prctl() proposal does is quiesce things that end up
happening specifically because the user process called on purpose
into the kernel. For example, perhaps RCU was invoked in the
kernel, and the core has to wait a timer tick to quiesce RCU.
Whatever causes it, the intent is that you're not allowed back into
userspace until everything has settled down from your call into
the kernel; the presumption is that it's all due to the kernel entry
that was just made, and not from other stray work.

In that sense, it's very appropriate for it to be a process property.

> ISTR people talking about 'quiesce' sysfs file, along side the hotplug
> stuff, I can't quite remember.

It seems somewhat similar (adding Viresh to the cc's) but does
seem like it might have been more intended to address the
CPU properties rather than process properties:

https://lkml.org/lkml/2014/4/4/99

One thing the original Tilera dataplane code did was to require
setting dataplane flags to succeed only on dataplane cores,
and only when the task had been affinitized to that single core.
This did not protect the task from later being re-affinitized in
a way that broke those assumptions, but I suppose you could
also imagine make sched_setaffinity() fail for such a process.
Somewhat unrelated, but it occurred to me in the context of this
reply, so what do you think? I can certainly add this to the
patch series if it seems like it makes setting the prctl() flags
more conservative.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-14 20:55:24

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE

On 05/12/2015 08:52 AM, Ingo Molnar wrote:
> What I suggested is that it might make sense to offer a system call,
> for example a sched_setparam() variant, that makes such guarantees.
>
> Say if user-space does:
>
> ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
>
> ... then we would get the task moved to an isolated domain and get a 0
> return code if the kernel is able to do all that and if the current
> uid/namespace/etc. has the required permissions and such.

Unfortunately I don't know nearly as much about the scheduler
and scheduler policies as I might, since I mostly focused on
make the scheduler stay out of the way. :-) This does seem like
another way to set a policy bit on a process. I assume you
could only validly issue this call on a nohz_full core, and that
you're not assuming it migrates the cpu to such a core?

You suggested that BIND_ISOLATED would not replace the usual
scheduler policies, but perhaps SCHED_ISOLATED as a full
replacement would make sense - it would make it an error
to have any other schedulable task on that core. I guess that
brings it around to whether the "cpu_isolated" task just loses when
another task is scheduled on the core with it (the current
approach I'm proposing) or if it ends up truly owning the core
and other processes can be denied the right to run there:
which in that case clearly does get us into the area of requiring
privileges to set up, as Andy pointed out later.

This would leave the notion of "strict" as proposed elsewhere
as a separate thing, but presumably it could still be a prctl()
as originally proposed.

I admit I don't know enough to say whether this sounds like
a better approach than just using a prctl() to set the
cpu_isolated state. My instinct is that it's cleanest to avoid
requiring permissions to do this, and to simply enable the
quiescing semantics the process requested when it happens
to be alone on a core. If so, it's somewhat orthogonal to the
actual scheduler policy in force, so best not to conflate it with
the notion of scheduler code at all via sched_setscheduler()?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-14 20:55:52

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane

On 05/12/2015 09:12 AM, Paul E. McKenney wrote:
> On Tue, May 12, 2015 at 11:26:07AM +0200, Peter Zijlstra wrote:
>> On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote:
>>> While the current fallback to 1-second tick is still helpful for
>>> maintaining completely correct kernel semantics, processes using
>>> prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
>>> completely tickless, so don't bound the time_delta for such processes.
>>>
>>> This was previously discussed in
>>>
>>> https://lkml.org/lkml/2014/10/31/364
>>>
>>> and Thomas Gleixner observed that vruntime, load balancing data,
>>> load accounting, and other things might be impacted. Frederic
>>> Weisbecker similarly observed that allowing the tick to be indefinitely
>>> deferred just meant that no one would ever fix the underlying bugs.
>>> However it's at least true that the mode proposed in this patch can
>>> only be enabled on an isolcpus core, which may limit how important
>>> it is to maintain scheduler data correctly, for example.
>> So how is making this available going to help people fix the actual
>> problem?
> It will at least provide an environment where adding more of this
> problem might get punished. This would be an improvement over what
> we have today, namely that the 1HZ fallback timer silently forgives
> adding more problems of this sort.

So I guess the obvious question to ask is whether there is a mode
that can be dynamically enabled (/proc/sys/kernel/nohz_experimental
or whatever) where we allow turning off this tick - perhaps to make
it more likely tick-dependent code isn't added to the kernel as Paul
suggests, or perhaps to enable applications that want to avoid the
tick conservativeness and are willing to do sufficient QA that they
are comfortable exploring possible issues with the 1Hz tick being
disabled?

Paul, PeterZ, any thoughts on something along these lines?
Or another suggestion?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-15 15:05:56

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On 05/11/2015 09:47 PM, Mike Galbraith wrote:
> On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
>> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
>>> I really shouldn't have acked nohz_full -> isolcpus. Beside the fact
>>> that old static isolcpus was_supposed_ to crawl off and die, I know
>>> beyond doubt that having isolated a cpu as well as you can definitely
>>> does NOT imply that said cpu should become tickless.
>> True, at a high level, I agree that it would be better to have a
>> top-level concept like Frederic's proposed ISOLATION that includes
>> isolcpus and nohz_cpu (and other stuff as needed).
>>
>> That said, what you wrote above is wrong; even with the patch you
>> acked, setting isolcpus does not automatically turn on nohz_full for
>> a given cpu. The patch made it true the other way around: when
>> you say nohz_full, you automatically get isolcpus on that cpu too.
>> That does, at least, make sense for the semantics of nohz_full.
> I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> Yes, with nohz_full currently being static, the old allegedly dying but
> also static isolcpus scheduler off switch is a convenient thing to wire
> the nohz_full CPU SET (<- hint;) property to.

Yes, I was responding to the bit where you said "having isolated a
cpu as well as you can does NOT imply it should become tickless",
but indeed, the "nohz_full -> isolcpus" patch didn't make that true.
In any case sounds like we were just talking past each other.

> BTW, another facet of this: Rik wants to make isolcpus immune to
> cpusets, which makes some sense, user did say isolcpus=, but that also
> makes isolcpus truly static. If the user now says nohz_full=, they lose
> the ability to deactivate CPU isolation, making the set fairly useless
> for anything other than HPC. Currently, the user can flip the isolation
> switch as he sees fit. He takes a size extra large performance hit for
> having said nohz_full=, but he doesn't lose generic utility.

I don't I follow this completely. If the user says nohz_full=, he
probably doesn't care about deactivating isolcpus later, since that
defeats the entire purpose of the nohz_full= in the first place,
as far as I can tell. And when you say "anything other than HPC",
I'm not sure what you mean; as far as I know high-performance
computing only cares because it wants that extra 0.5% of the
cpu or whatever interrupts eat up, but just as a nice-to-have.
The real use case is high-performance userspace drivers where
the nohz_full cores are responding to real-time things like packet
arrivals with almost no latency to spare.

What is the generic utility you're envisioning for nohz_full cores
that have turned off scheduler isolation? I assume it's some
workload where you'd prefer not to have too many interrupts
but still are running multiple tasks, but in that case does it really
make much difference in practice?

> Thomas has nuked the hrtimer softirq.

Yes, this I didn't know. So I will drop my "no ksoftirqd" patch and
we will see if ksoftirqs emerge as an issue for my "cpu isolation"
stuff in the future; it may be that that was the only issue.

> Inlining softirqs may save a context switch, but adds cycles that we may
> consume at higher frequency than the thing we're avoiding.

Yes but consuming cycles is not nearly as much of a concern
as avoiding interrupts or scheduling, certainly for the case of
userspace drivers that I described above.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-15 15:10:54

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On 05/12/2015 06:46 AM, Peter Zijlstra wrote:
> On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
>> Please lets get NO_HZ_FULL up to par. That should be the main focus.
>>
> ACK, much of this dataplane stuff is (useful) hacks working around the
> fact that nohz_full just isn't complete.

There are enough disjoint threads on this topic that I want
to just touch base here and see if you have been convinced
on other threads that there is stuff beyond the hacks here:
in particular

1. The basic "dataplane" mode to arrange to do extra work on
return to kernel space that normally isn't warranted, to avoid
future IPIs, and additionally to wait in the kernel until any timer
interrupts required by the kernel invocation itself are done; and

2. The "strict" mode to allow a task to tell the kernel it isn't
planning on making any more such calls, and have the kernel
help diagnose any resulting application bugs.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-15 18:44:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Fri, 2015-05-15 at 11:05 -0400, Chris Metcalf wrote:
> On 05/11/2015 09:47 PM, Mike Galbraith wrote:
> > On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> >> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> >>> I really shouldn't have acked nohz_full -> isolcpus. Beside the fact
> >>> that old static isolcpus was_supposed_ to crawl off and die, I know
> >>> beyond doubt that having isolated a cpu as well as you can definitely
> >>> does NOT imply that said cpu should become tickless.
> >> True, at a high level, I agree that it would be better to have a
> >> top-level concept like Frederic's proposed ISOLATION that includes
> >> isolcpus and nohz_cpu (and other stuff as needed).
> >>
> >> That said, what you wrote above is wrong; even with the patch you
> >> acked, setting isolcpus does not automatically turn on nohz_full for
> >> a given cpu. The patch made it true the other way around: when
> >> you say nohz_full, you automatically get isolcpus on that cpu too.
> >> That does, at least, make sense for the semantics of nohz_full.
> > I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> > Yes, with nohz_full currently being static, the old allegedly dying but
> > also static isolcpus scheduler off switch is a convenient thing to wire
> > the nohz_full CPU SET (<- hint;) property to.
>
> Yes, I was responding to the bit where you said "having isolated a
> cpu as well as you can does NOT imply it should become tickless",
> but indeed, the "nohz_full -> isolcpus" patch didn't make that true.
> In any case sounds like we were just talking past each other.

Yup.

> > BTW, another facet of this: Rik wants to make isolcpus immune to
> > cpusets, which makes some sense, user did say isolcpus=, but that also
> > makes isolcpus truly static. If the user now says nohz_full=, they lose
> > the ability to deactivate CPU isolation, making the set fairly useless
> > for anything other than HPC. Currently, the user can flip the isolation
> > switch as he sees fit. He takes a size extra large performance hit for
> > having said nohz_full=, but he doesn't lose generic utility.
>
> I don't I follow this completely. If the user says nohz_full=, he
> probably doesn't care about deactivating isolcpus later, since that
> defeats the entire purpose of the nohz_full= in the first place,
> as far as I can tell. And when you say "anything other than HPC",
> I'm not sure what you mean; as far as I know high-performance
> computing only cares because it wants that extra 0.5% of the
> cpu or whatever interrupts eat up, but just as a nice-to-have.
> The real use case is high-performance userspace drivers where
> the nohz_full cores are responding to real-time things like packet
> arrivals with almost no latency to spare.

Ok, verbosity on.

Currently, nohz_full is static, meaning in a dynamic environment, where
the user may not have a constant need for it, if you make it imply
isolcpus, then make isolcpus immutable, you have just needlessly taken
an option from the user. Those CPUS are no longer part of his generic
resource pool, and he has nothing to say about it.

> What is the generic utility you're envisioning for nohz_full cores
> that have turned off scheduler isolation? I assume it's some
> workload where you'd prefer not to have too many interrupts
> but still are running multiple tasks, but in that case does it really
> make much difference in practice?

Again, I think we're talking past one another.

I'm saying there is no need to mandate, nothing more. For your needs,
my needs whatever, that immutable may sound good, but in fact, it
removes flexibility, and for no good reason.

This shows immediately in simple testing. Do I need nohz_full? Hell
no, only for testing. If I want to test, I obviously need it for a
while, and yes, I can reboot... but what's the difference between me the
silly tester who needs it only to see if it works at all, and how well,
and some guy who does something critical once in a while, or a company
with a pool of big boxen that they reconfigure on the fly to meet
whatever dynamic needs?

Just because the nohz_full feature itself is currently static is no
reason to put users thereof in a straight jacket by mandating that any
set they define irrevocably disappears from the generic resource pool .
Those CPUS are useful until the moment someone cripples them, which
making nohz_full imply isolcpus does if isolcpus then also becomes
immutable, which Rik's patch does. Making nohz_full imply isolcpus
sounds perfectly fine until someone comes along and makes isolcpus
immutable (Rik's patch), at which point the user loses a choice due to
two people making it imply things that _alone_ sound perfectly fine.

See what I'm saying now?

> > Thomas has nuked the hrtimer softirq.
>
> Yes, this I didn't know. So I will drop my "no ksoftirqd" patch and
> we will see if ksoftirqs emerge as an issue for my "cpu isolation"
> stuff in the future; it may be that that was the only issue.
>
> > Inlining softirqs may save a context switch, but adds cycles that we may
> > consume at higher frequency than the thing we're avoiding.
>
> Yes but consuming cycles is not nearly as much of a concern
> as avoiding interrupts or scheduling, certainly for the case of
> userspace drivers that I described above.

If you're raising softirqs in an SMP kernel, you're also doing something
that puts you at very serious risk of meeting the jitter monster, locks,
and worse, sleeping locks, no?

-Mike

2015-05-15 21:25:31

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

On 05/12/2015 06:23 PM, Andy Lutomirski wrote:
> On May 13, 2015 6:06 AM, "Chris Metcalf" <[email protected]> wrote:
>> On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
>>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <[email protected]> wrote:
>>>> In this case, killing the task is appropriate, since that's exactly
>>>> the semantics that have been asked for - it's like on architectures
>>>> that don't natively support unaligned accesses, but fake it relatively
>>>> slowly in the kernel, and in development you just say "give me a
>>>> SIGBUS when that happens" and in production you might say
>>>> "fix it up and let's try to keep going".
>>> I think more control is needed. I also think that, if we go this
>>> route, we should distinguish syscalls, synchronous non-syscall
>>> entries, and asynchronous non-syscall entries. They're quite
>>> different.
>>
>> I don't think it's necessary to distinguish the types. As long as we
>> have a PC pointing to the instruction that triggered the problem,
>> we can see if it's a system call instruction, a memory write that
>> caused a page fault, a trap instruction, etc.
> Not true. PC right after a syscall insn could be any type of kernel
> entry, and you can't even reliably tell whether the syscall insn was
> executed or, on x86, whether it was a syscall at all. (x86 insns
> can't be reliably decided backwards.)
>
> PC pointing at a load could be a page fault or an IPI.

All that we are trying to do with this API, though, is distinguish
synchronous faults. So IPIs, etc., should not be happening
(they would be bugs), and hopefully we are mostly just
distinguishing different types of synchronous program entries.
That said, I did a si_info flag to differentiate syscalls from other
synchronous entries, and I'm open to looking at more such if
it seems useful.

> Again, though, I think we really do need to distinguish at least MCE
> and NMI (on x86) from the others.

Yes, those are both interesting cases, and I'm not entirely
sure what the right way to handle them is - for example,
likely disable STRICT if you are running with perf enabled.

I look forward to hearing more when you're back next week!

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-15 21:26:46

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v2 0/5] support "cpu_isolated" mode for nohz_full

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v2:
rename "dataplane" to "cpu_isolated"
drop ksoftirqd suppression changes (believed no longer needed)
merge previous "QUIESCE" functionality into baseline functionality
explicitly track syscalls and exceptions for "STRICT" functionality
allow configuring a signal to be delivered for STRICT mode failures
move debug tracking to irq_enter(), not irq_exit()

Note: I have not yet removed the hack to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555).

Chris Metcalf (5):
nohz_full: add support for "cpu_isolated" mode
nohz: support PR_CPU_ISOLATED_STRICT mode
nohz: cpu_isolated strict mode configurable signal
nohz: add cpu_isolated_debug boot flag
nohz: cpu_isolated: allow tick to be fully disabled

Documentation/kernel-parameters.txt | 6 +++
arch/tile/kernel/ptrace.c | 6 ++-
arch/tile/mm/homecache.c | 5 +-
arch/x86/kernel/ptrace.c | 2 +
include/linux/context_tracking.h | 11 +++--
include/linux/sched.h | 3 ++
include/linux/tick.h | 28 +++++++++++
include/uapi/linux/prctl.h | 8 +++
kernel/context_tracking.c | 12 +++--
kernel/irq_work.c | 4 +-
kernel/sched/core.c | 18 +++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 6 +++
kernel/sys.c | 8 +++
kernel/time/tick-sched.c | 98 ++++++++++++++++++++++++++++++++++++-
16 files changed, 214 insertions(+), 10 deletions(-)

--
2.1.2

2015-05-15 21:27:57

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl(). When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken. First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core. Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/sched.h | 3 +++
include/linux/tick.h | 10 +++++++++
include/uapi/linux/prctl.h | 5 +++++
kernel/context_tracking.c | 3 +++
kernel/sys.c | 8 ++++++++
kernel/time/tick-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 80 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..fb4ba400d7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
+#ifdef CONFIG_NO_HZ_FULL
+ unsigned int cpu_isolated_flags;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..ec1953474a65 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
#include <linux/context_tracking_state.h>
#include <linux/cpumask.h>
#include <linux/sched.h>
+#include <linux/prctl.h>

#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}

+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+ return tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick(void) { }
static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */

+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED 47
+#define PR_GET_CPU_ISOLATED 48
+# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..66739d7c1350 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/tick.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
* on the tick.
*/
if (state == CONTEXT_USER) {
+ if (tick_nohz_is_cpu_isolated())
+ tick_nohz_cpu_isolated_enter();
trace_user_enter(0);
vtime_user_enter(current);
}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..3fd9e47f8fc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_NO_HZ_FULL
+ case PR_SET_CPU_ISOLATED:
+ me->cpu_isolated_flags = arg2;
+ break;
+ case PR_GET_CPU_ISOLATED:
+ error = me->cpu_isolated_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..f1551c946c45 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/swap.h>

#include <asm/irq_regs.h>

@@ -389,6 +390,56 @@ void __init tick_nohz_init(void)
pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
cpumask_pr_args(tick_nohz_full_mask));
}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending. Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually. Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+ struct clock_event_device *dev =
+ __this_cpu_read(tick_cpu_device.evtdev);
+ struct task_struct *task = current;
+ unsigned long start = jiffies;
+ bool warned = false;
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+ if (!warned && (jiffies - start) >= (5 * HZ)) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start));
+ warned = true;
+ }
+ if (should_resched())
+ schedule();
+ if (test_thread_flag(TIF_SIGPENDING))
+ break;
+
+ /* Idle with interrupts enabled and wait for the tick. */
+ set_current_state(TASK_INTERRUPTIBLE);
+ arch_cpu_idle();
+ set_current_state(TASK_RUNNING);
+ }
+ if (warned) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start));
+ dump_stack();
+ }
+}
+
#endif

/*
--
2.1.2

2015-05-15 21:28:04

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves. In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86 and tile;
I am happy to try to add more for additional platforms in the final
version.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/kernel/ptrace.c | 6 +++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/tick.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++
7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(
+ regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..7f784054ddea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..d042f4cda39d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/tick.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);
extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ec1953474a65..b7ffb10337ba 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
@@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
__tick_nohz_task_switch(tsk);
}

+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags &
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_CPU_ISOLATED 47
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+# define PR_CPU_ISOLATED_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 66739d7c1350..c82509caa42e 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (__this_cpu_read(context_tracking.state) == state) {
@@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state)
__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
}
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f1551c946c45..273820cd484a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
#include <linux/swap.h>

#include <asm/irq_regs.h>
+#include <asm/unistd.h>

#include "tick-internal.h"

@@ -440,6 +441,43 @@ void tick_nohz_cpu_isolated_enter(void)
}
}

+static void kill_cpu_isolated_strict_task(void)
+{
+ dump_stack();
+ current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+ pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_cpu_isolated_strict_task();
+}
+
#endif

/*
--
2.1.2

2015-05-15 21:28:10

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 2 ++
kernel/time/tick-sched.c | 15 +++++++++++----
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
# define PR_CPU_ISOLATED_STRICT (1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 273820cd484a..772be78f926c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -441,11 +441,18 @@ void tick_nohz_cpu_isolated_enter(void)
}
}

-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
{
+ siginfo_t info = {};
+ int sig;
+
dump_stack();
current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
- send_sig(SIGKILL, current, 1);
+
+ sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+ info.si_signo = sig;
+ info.si_code = is_syscall;
+ send_sig_info(sig, &info, current);
}

/*
@@ -464,7 +471,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)

pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
current->comm, current->pid, syscall);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(1);
}

/*
@@ -475,7 +482,7 @@ void tick_nohz_cpu_isolated_exception(void)
{
pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
current->comm, current->pid);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(0);
}

#endif
--
2.1.2

2015-05-15 21:28:47

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v2 4/5] nohz: add cpu_isolated_debug boot flag

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated core
has unexpectedly entered the kernel. But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/kernel-parameters.txt | 6 ++++++
arch/tile/mm/homecache.c | 5 ++++-
include/linux/tick.h | 2 ++
kernel/irq_work.c | 4 +++-
kernel/sched/core.c | 18 ++++++++++++++++++
kernel/signal.c | 5 +++++
kernel/smp.c | 4 ++++
kernel/softirq.c | 6 ++++++
8 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f6befa9855c1..2b4c89225d25 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -743,6 +743,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
/proc/<pid>/coredump_filter.
See also Documentation/filesystems/proc.txt.

+ cpu_isolated_debug [KNL]
+ In kernels built with CONFIG_NO_HZ_FULL and booted
+ in nohz_full= mode, this setting will generate console
+ backtraces when the kernel is about to interrupt a
+ task that has requested PR_CPU_ISOLATED_ENABLE.
+
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..f336880e1b01 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/tick.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
- for_each_cpu(cpu, &mask)
+ for_each_cpu(cpu, &mask) {
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+ tick_nohz_cpu_isolated_debug(cpu);
+ }
}

/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b7ffb10337ba..0b0d76106b8c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -149,6 +149,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_cpu_isolated_enter(void);
extern void tick_nohz_cpu_isolated_syscall(int nr);
extern void tick_nohz_cpu_isolated_exception(void);
+extern void tick_nohz_cpu_isolated_debug(int cpu);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -161,6 +162,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
static inline void tick_nohz_cpu_isolated_enter(void) { }
static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
static inline void tick_nohz_cpu_isolated_exception(void) { }
+static inline void tick_nohz_cpu_isolated_debug(int cpu) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..7f35c90346de 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ tick_nohz_cpu_isolated_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a82cbb6..7315e7272e94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,24 @@ bool sched_can_stop_tick(void)

return true;
}
+
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug;
+static int __init cpu_isolated_debug_func(char *str)
+{
+ cpu_isolated_debug = true;
+ return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void tick_nohz_cpu_isolated_debug(int cpu)
+{
+ if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) &&
+ (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+ pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+ dump_stack();
+ }
+}
#endif /* CONFIG_NO_HZ_FULL */

void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..1a810ac2656e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_NO_HZ_FULL
+ /* If the task is being killed, don't complain about cpu_isolated. */
+ if (state & TASK_WAKEKILL)
+ t->cpu_isolated_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..6b7d8e2c8af4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/tick.h>

#include "smpboot.h"

@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
+ tick_nohz_cpu_isolated_debug(cpu);
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ for_each_cpu(cpu, cfd->cpumask)
+ tick_nohz_cpu_isolated_debug(cpu);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..333872925ff6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/smpboot.h>
+#include <linux/context_tracking.h>
#include <linux/tick.h>
#include <linux/irq.h>

@@ -335,6 +336,11 @@ void irq_enter(void)
_local_bh_enable();
}

+ if (context_tracking_cpu_is_enabled() &&
+ context_tracking_in_user() &&
+ !in_interrupt())
+ tick_nohz_cpu_isolated_debug(smp_processor_id());
+
__irq_enter();
}

--
2.1.2

2015-05-15 21:28:27

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v2 5/5] nohz: cpu_isolated: allow tick to be fully disabled

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.

This was previously discussed in

https://lkml.org/lkml/2014/10/31/364

and Thomas Gleixner observed that vruntime, load balancing data,
load accounting, and other things might be impacted. Frederic
Weisbecker similarly observed that allowing the tick to be indefinitely
deferred just meant that no one would ever fix the underlying bugs.
However it's at least true that the mode proposed in this patch can
only be enabled on an isolcpus core, which may limit how important
it is to maintain scheduler data correctly, for example.

It's also worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2005) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc). So this semantics is very
useful if we can convince ourselves that doing this is safe.

Signed-off-by: Chris Metcalf <[email protected]>
---
Note: I have kept this in the series despite PeterZ's nack, since it
didn't seem resolved in the original thread from v1 of the patch
(https://lkml.org/lkml/2015/5/8/555).

kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 772be78f926c..be4db5d81ada 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -727,7 +727,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
}

#ifdef CONFIG_NO_HZ_FULL
- if (!ts->inidle) {
+ if (!ts->inidle && !tick_nohz_is_cpu_isolated()) {
time_delta = min(time_delta,
scheduler_tick_max_deferment());
}
--
2.1.2

2015-05-15 22:17:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode

On Fri, 15 May 2015, Chris Metcalf wrote:
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In "cpu_isolated" mode we wait until no more interrupts are
> + * pending. Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two "cpu_isolated" processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually.

And why are we not preventing that situation in the first place? The
scheduler should be able to figure that out easily..

> + Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void tick_nohz_cpu_isolated_enter(void)
> +{
> + struct clock_event_device *dev =
> + __this_cpu_read(tick_cpu_device.evtdev);
> + struct task_struct *task = current;
> + unsigned long start = jiffies;
> + bool warned = false;
> +
> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> + lru_add_drain();
> +
> + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {

What's the ACCESS_ONCE for?

> + if (!warned && (jiffies - start) >= (5 * HZ)) {
> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
> + task->comm, task->pid, smp_processor_id(),
> + (jiffies - start));

What additional value has the jiffies delta over a plain human
readable '5sec' ?

> + warned = true;
> + }
> + if (should_resched())
> + schedule();
> + if (test_thread_flag(TIF_SIGPENDING))
> + break;
> +
> + /* Idle with interrupts enabled and wait for the tick. */
> + set_current_state(TASK_INTERRUPTIBLE);
> + arch_cpu_idle();

Oh NO! Not another variant of fake idle task. The idle implementations
can call into code which rightfully expects that the CPU is actually
IDLE.

I wasted enough time already debugging the resulting wreckage. Feel
free to use it for experimental purposes, but this is not going
anywhere near to a mainline kernel.

I completely understand WHY you want to do that, but we need proper
mechanisms for that and not some duct tape engineering band aids which
will create hard to debug side effects.

Hint: It's a scheduler job to make sure that the machine has quiesced
_BEFORE_ letting the magic task off to user land.

> + set_current_state(TASK_RUNNING);
> + }
> + if (warned) {
> + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
> + task->comm, task->pid, smp_processor_id(),
> + (jiffies - start));
> + dump_stack();

And that dump_stack() tells us which important information?

tick_nohz_cpu_isolated_enter
context_tracking_enter
context_tracking_user_enter
arch_return_to_user_code

Thanks,

tglx

2015-05-26 19:52:11

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

Thanks for the clarification, and sorry for the slow reply; I had a busy
week of meetings last week, and then the long weekend in the U.S.

On 05/15/2015 02:44 PM, Mike Galbraith wrote:
> Just because the nohz_full feature itself is currently static is no
> reason to put users thereof in a straight jacket by mandating that any
> set they define irrevocably disappears from the generic resource pool .
> Those CPUS are useful until the moment someone cripples them, which
> making nohz_full imply isolcpus does if isolcpus then also becomes
> immutable, which Rik's patch does. Making nohz_full imply isolcpus
> sounds perfectly fine until someone comes along and makes isolcpus
> immutable (Rik's patch), at which point the user loses a choice due to
> two people making it imply things that _alone_ sound perfectly fine.
>
> See what I'm saying now?

That does make sense; my argument was that 99% of the time when
someone specifies nohz_full they also need isolcpus. You're right
that someone playing with nohz_full would be unpleasantly surprised.
And of course having more flexibility always feels like a plus.
On balance I suspect it's still better to make command line arguments
handle the common cases most succinctly.

Hopefully we'll get a to a point where all of this is dynamic and how
we play with the boot arguments no longer matters. If not, perhaps
we revisit this and make a cpu_isolation=1-15 type command line
argument that enables isolcpus and nohz_full both.

>>> Thomas has nuked the hrtimer softirq.
>> Yes, this I didn't know. So I will drop my "no ksoftirqd" patch and
>> we will see if ksoftirqs emerge as an issue for my "cpu isolation"
>> stuff in the future; it may be that that was the only issue.
>>
>>> Inlining softirqs may save a context switch, but adds cycles that we may
>>> consume at higher frequency than the thing we're avoiding.
>> Yes but consuming cycles is not nearly as much of a concern
>> as avoiding interrupts or scheduling, certainly for the case of
>> userspace drivers that I described above.
> If you're raising softirqs in an SMP kernel, you're also doing something
> that puts you at very serious risk of meeting the jitter monster, locks,
> and worse, sleeping locks, no?

The softirqs were being raised by third parties for hrtimer, not by
the application code itself, if I remember correctly. In any case
this appears not to be an issue for nohz_full any more now.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-05-27 03:28:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full

On Tue, 2015-05-26 at 15:51 -0400, Chris Metcalf wrote:

> On balance I suspect it's still better to make command line arguments
> handle the common cases most succinctly.

I prefer user specifies precisely, but yeah, that entails more typing.

Idle curiosity: can SGI monster from hell boot a NO_HZ_FULL_ALL kernel,
w/wo it implying isolcpus? Readers having same and a reactor to power
it in their basement, please test.

-Mike

2015-06-03 15:29:55

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v3:
remove dependency on cpu_idle subsystem (Thomas Gleixner)
use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
use seconds for console messages instead of jiffies (Thomas Gleixner)
updated commit description for patch 5/5

v2:
rename "dataplane" to "cpu_isolated"
drop ksoftirqd suppression changes (believed no longer needed)
merge previous "QUIESCE" functionality into baseline functionality
explicitly track syscalls and exceptions for "STRICT" functionality
allow configuring a signal to be delivered for STRICT mode failures
move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
nohz_full: add support for "cpu_isolated" mode
nohz: support PR_CPU_ISOLATED_STRICT mode
nohz: cpu_isolated strict mode configurable signal
nohz: add cpu_isolated_debug boot flag
nohz: cpu_isolated: allow tick to be fully disabled

Documentation/kernel-parameters.txt | 6 +++
arch/tile/kernel/process.c | 9 ++++
arch/tile/kernel/ptrace.c | 6 ++-
arch/tile/mm/homecache.c | 5 +-
arch/x86/kernel/ptrace.c | 2 +
include/linux/context_tracking.h | 11 ++--
include/linux/sched.h | 3 ++
include/linux/tick.h | 28 ++++++++++
include/uapi/linux/prctl.h | 8 +++
kernel/context_tracking.c | 12 +++--
kernel/irq_work.c | 4 +-
kernel/sched/core.c | 18 +++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 6 +++
kernel/sys.c | 8 +++
kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++-
17 files changed, 229 insertions(+), 10 deletions(-)

--
2.1.2

2015-06-03 15:30:29

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl(). When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken. First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core. Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/kernel/process.c | 9 ++++++++
include/linux/sched.h | 3 +++
include/linux/tick.h | 10 ++++++++
include/uapi/linux/prctl.h | 5 ++++
kernel/context_tracking.c | 3 +++
kernel/sys.c | 8 +++++++
kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 95 insertions(+)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..e20c3f4a6a82 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
_cpu_idle();
}

+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_cpu_isolated_wait()
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ _cpu_idle();
+ set_current_state(TASK_RUNNING);
+}
+#endif
+
/*
* Release a thread_info structure
*/
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..fb4ba400d7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
+#ifdef CONFIG_NO_HZ_FULL
+ unsigned int cpu_isolated_flags;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..ec1953474a65 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
#include <linux/context_tracking_state.h>
#include <linux/cpumask.h>
#include <linux/sched.h>
+#include <linux/prctl.h>

#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}

+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+ return tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick(void) { }
static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */

+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED 47
+#define PR_GET_CPU_ISOLATED 48
+# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..66739d7c1350 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/tick.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
* on the tick.
*/
if (state == CONTEXT_USER) {
+ if (tick_nohz_is_cpu_isolated())
+ tick_nohz_cpu_isolated_enter();
trace_user_enter(0);
vtime_user_enter(current);
}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..3fd9e47f8fc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_NO_HZ_FULL
+ case PR_SET_CPU_ISOLATED:
+ me->cpu_isolated_flags = arg2;
+ break;
+ case PR_GET_CPU_ISOLATED:
+ error = me->cpu_isolated_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..f6236b66788f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/swap.h>

#include <asm/irq_regs.h>

@@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
cpumask_pr_args(tick_nohz_full_mask));
}
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak tick_nohz_cpu_isolated_wait()
+{
+ cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending. Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually. Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+ struct clock_event_device *dev =
+ __this_cpu_read(tick_cpu_device.evtdev);
+ struct task_struct *task = current;
+ unsigned long start = jiffies;
+ bool warned = false;
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+ if (!warned && (jiffies - start) >= (5 * HZ)) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ warned = true;
+ }
+ if (should_resched())
+ schedule();
+ if (test_thread_flag(TIF_SIGPENDING))
+ break;
+ tick_nohz_cpu_isolated_wait();
+ }
+ if (warned) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ dump_stack();
+ }
+}
+
#endif

/*
--
2.1.2

2015-06-03 15:30:11

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves. In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86 and tile;
I am happy to try to add more for additional platforms in the final
version.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/kernel/ptrace.c | 6 +++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/tick.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++
7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(
+ regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..7f784054ddea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..d042f4cda39d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/tick.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);
extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ec1953474a65..b7ffb10337ba 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
@@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
__tick_nohz_task_switch(tsk);
}

+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags &
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_CPU_ISOLATED 47
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+# define PR_CPU_ISOLATED_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 66739d7c1350..c82509caa42e 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (__this_cpu_read(context_tracking.state) == state) {
@@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state)
__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
}
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f6236b66788f..ce3bcf29a0f6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
#include <linux/swap.h>

#include <asm/irq_regs.h>
+#include <asm/unistd.h>

#include "tick-internal.h"

@@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void)
}
}

+static void kill_cpu_isolated_strict_task(void)
+{
+ dump_stack();
+ current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+ pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_cpu_isolated_strict_task();
+}
+
#endif

/*
--
2.1.2

2015-06-03 15:30:22

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 2 ++
kernel/time/tick-sched.c | 15 +++++++++++----
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
# define PR_CPU_ISOLATED_STRICT (1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ce3bcf29a0f6..f09c003da22f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void)
}
}

-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
{
+ siginfo_t info = {};
+ int sig;
+
dump_stack();
current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
- send_sig(SIGKILL, current, 1);
+
+ sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+ info.si_signo = sig;
+ info.si_code = is_syscall;
+ send_sig_info(sig, &info, current);
}

/*
@@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)

pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
current->comm, current->pid, syscall);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(1);
}

/*
@@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void)
{
pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
current->comm, current->pid);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(0);
}

#endif
--
2.1.2

2015-06-03 15:31:29

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v3 4/5] nohz: add cpu_isolated_debug boot flag

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated core
has unexpectedly entered the kernel. But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/kernel-parameters.txt | 6 ++++++
arch/tile/mm/homecache.c | 5 ++++-
include/linux/tick.h | 2 ++
kernel/irq_work.c | 4 +++-
kernel/sched/core.c | 18 ++++++++++++++++++
kernel/signal.c | 5 +++++
kernel/smp.c | 4 ++++
kernel/softirq.c | 6 ++++++
8 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f6befa9855c1..2b4c89225d25 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -743,6 +743,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
/proc/<pid>/coredump_filter.
See also Documentation/filesystems/proc.txt.

+ cpu_isolated_debug [KNL]
+ In kernels built with CONFIG_NO_HZ_FULL and booted
+ in nohz_full= mode, this setting will generate console
+ backtraces when the kernel is about to interrupt a
+ task that has requested PR_CPU_ISOLATED_ENABLE.
+
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..f336880e1b01 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/tick.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
- for_each_cpu(cpu, &mask)
+ for_each_cpu(cpu, &mask) {
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+ tick_nohz_cpu_isolated_debug(cpu);
+ }
}

/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b7ffb10337ba..0b0d76106b8c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -149,6 +149,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_cpu_isolated_enter(void);
extern void tick_nohz_cpu_isolated_syscall(int nr);
extern void tick_nohz_cpu_isolated_exception(void);
+extern void tick_nohz_cpu_isolated_debug(int cpu);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -161,6 +162,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
static inline void tick_nohz_cpu_isolated_enter(void) { }
static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
static inline void tick_nohz_cpu_isolated_exception(void) { }
+static inline void tick_nohz_cpu_isolated_debug(int cpu) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..7f35c90346de 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ tick_nohz_cpu_isolated_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a82cbb6..7315e7272e94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,24 @@ bool sched_can_stop_tick(void)

return true;
}
+
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug;
+static int __init cpu_isolated_debug_func(char *str)
+{
+ cpu_isolated_debug = true;
+ return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void tick_nohz_cpu_isolated_debug(int cpu)
+{
+ if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) &&
+ (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+ pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+ dump_stack();
+ }
+}
#endif /* CONFIG_NO_HZ_FULL */

void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..1a810ac2656e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_NO_HZ_FULL
+ /* If the task is being killed, don't complain about cpu_isolated. */
+ if (state & TASK_WAKEKILL)
+ t->cpu_isolated_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..6b7d8e2c8af4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/tick.h>

#include "smpboot.h"

@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
+ tick_nohz_cpu_isolated_debug(cpu);
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ for_each_cpu(cpu, cfd->cpumask)
+ tick_nohz_cpu_isolated_debug(cpu);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..333872925ff6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/smpboot.h>
+#include <linux/context_tracking.h>
#include <linux/tick.h>
#include <linux/irq.h>

@@ -335,6 +336,11 @@ void irq_enter(void)
_local_bh_enable();
}

+ if (context_tracking_cpu_is_enabled() &&
+ context_tracking_in_user() &&
+ !in_interrupt())
+ tick_nohz_cpu_isolated_debug(smp_processor_id());
+
__irq_enter();
}

--
2.1.2

2015-06-03 15:30:40

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v3 5/5] nohz: cpu_isolated: allow tick to be fully disabled

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.
In addition, due to the way such processes quiesce by waiting for
the timer tick to stop prior to returning to userspace, without this
commit it won't be possible to use the cpu_isolated mode at all.

Removing the 1-second cap was previously discussed (see link below)
and Thomas Gleixner observed that vruntime, load balancing data, load
accounting, and other things might be impacted. Frederic Weisbecker
similarly observed that allowing the tick to be indefinitely deferred just
meant that no one would ever fix the underlying bugs. However it's at
least true that the mode proposed in this patch can only be enabled on an
isolcpus core by a process requesting cpu_isolated mode, which may limit
how important it is to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback timer
is removed, this will provide an environment where new code that relies
on that tick will get punished, and we won't forgive such assumptions
silently, so it may also be worth it from that perspective.

Finally, it's worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2008) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc). So this semantics is very
useful if we can convince ourselves that doing this is safe.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f09c003da22f..ec36ed00af9d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -733,7 +733,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
}

#ifdef CONFIG_NO_HZ_FULL
- if (!ts->inidle) {
+ if (!ts->inidle && !tick_nohz_is_cpu_isolated()) {
time_delta = min(time_delta,
scheduler_tick_max_deferment());
}
--
2.1.2

2015-07-13 19:58:16

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full

This posting of the series is basically a "ping" since there were
no comments to the v3 version. I have rebased it to 4.2-rc1, added
support for arm64 syscall tracking for "strict" mode, and retested it;
are there any remaining concerns? Thomas, I haven't heard from you
whether my removal of the cpu_idle calls sufficiently addresses your
concerns about that aspect. Are there other concerns with this patch
series at this point?

Original patch series cover letter follows:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc1) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v4:
rebased on kernel v4.2-rc1
added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
remove dependency on cpu_idle subsystem (Thomas Gleixner)
use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
use seconds for console messages instead of jiffies (Thomas Gleixner)
updated commit description for patch 5/5

v2:
rename "dataplane" to "cpu_isolated"
drop ksoftirqd suppression changes (believed no longer needed)
merge previous "QUIESCE" functionality into baseline functionality
explicitly track syscalls and exceptions for "STRICT" functionality
allow configuring a signal to be delivered for STRICT mode failures
move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
nohz_full: add support for "cpu_isolated" mode
nohz: support PR_CPU_ISOLATED_STRICT mode
nohz: cpu_isolated strict mode configurable signal
nohz: add cpu_isolated_debug boot flag
nohz: cpu_isolated: allow tick to be fully disabled

Documentation/kernel-parameters.txt | 6 +++
arch/tile/kernel/process.c | 9 ++++
arch/tile/kernel/ptrace.c | 6 ++-
arch/tile/mm/homecache.c | 5 +-
arch/x86/kernel/ptrace.c | 2 +
include/linux/context_tracking.h | 11 ++--
include/linux/sched.h | 3 ++
include/linux/tick.h | 28 ++++++++++
include/uapi/linux/prctl.h | 8 +++
kernel/context_tracking.c | 12 +++--
kernel/irq_work.c | 4 +-
kernel/sched/core.c | 18 +++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 6 +++
kernel/sys.c | 8 +++
kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++-
17 files changed, 229 insertions(+), 10 deletions(-)

--
2.1.2

2015-07-13 19:58:26

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl(). When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken. First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core. Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/kernel/process.c | 9 ++++++++
include/linux/sched.h | 3 +++
include/linux/tick.h | 10 ++++++++
include/uapi/linux/prctl.h | 5 ++++
kernel/context_tracking.c | 3 +++
kernel/sys.c | 8 +++++++
kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 95 insertions(+)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..3625e839ad62 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
_cpu_idle();
}

+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_cpu_isolated_wait(void)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ _cpu_idle();
+ set_current_state(TASK_RUNNING);
+}
+#endif
+
/*
* Release a thread_info structure
*/
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f1591615..f350b0c20bbc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1778,6 +1778,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_NO_HZ_FULL
+ unsigned int cpu_isolated_flags;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3741ba1a652c..cb5569181359 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
#include <linux/context_tracking_state.h>
#include <linux/cpumask.h>
#include <linux/sched.h>
+#include <linux/prctl.h>

#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern void __init tick_init(void);
@@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
cpumask_or(mask, mask, tick_nohz_full_mask);
}

+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+ return tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick(void) { }
static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */

+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED 47
+#define PR_GET_CPU_ISOLATED 48
+# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..f9de3ee12723 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/tick.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
* on the tick.
*/
if (state == CONTEXT_USER) {
+ if (tick_nohz_is_cpu_isolated())
+ tick_nohz_cpu_isolated_enter();
trace_user_enter(0);
vtime_user_enter(current);
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..36eb9a839f1f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_NO_HZ_FULL
+ case PR_SET_CPU_ISOLATED:
+ me->cpu_isolated_flags = arg2;
+ break;
+ case PR_GET_CPU_ISOLATED:
+ error = me->cpu_isolated_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..4cf093c012d1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/swap.h>

#include <asm/irq_regs.h>

@@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
cpumask_pr_args(tick_nohz_full_mask));
}
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak tick_nohz_cpu_isolated_wait(void)
+{
+ cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending. Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually. Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+ struct clock_event_device *dev =
+ __this_cpu_read(tick_cpu_device.evtdev);
+ struct task_struct *task = current;
+ unsigned long start = jiffies;
+ bool warned = false;
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+ if (!warned && (jiffies - start) >= (5 * HZ)) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ warned = true;
+ }
+ if (should_resched())
+ schedule();
+ if (test_thread_flag(TIF_SIGPENDING))
+ break;
+ tick_nohz_cpu_isolated_wait();
+ }
+ if (warned) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ dump_stack();
+ }
+}
+
#endif

/*
--
2.1.2

2015-07-13 19:59:43

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves. In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/arm64/kernel/ptrace.c | 4 ++++
arch/tile/kernel/ptrace.c | 6 +++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/tick.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++
8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..7315b1579cbd 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,

asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
+ /* Ensure we report cpu_isolated violations in all circumstances. */
+ if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(regs->syscallno);
+
/* Do the secure computing check first; failures should be fast. */
if (secure_computing() == -1)
return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(
+ regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..860f346977e2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..8b994e2a0330 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/tick.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);

@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (tick_nohz_cpu_isolated_strict())
+ tick_nohz_cpu_isolated_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/tick.h b/include/linux/tick.h
index cb5569181359..f79f6945f762 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -157,6 +157,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -168,6 +170,8 @@ static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
@@ -200,4 +204,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
__tick_nohz_task_switch(tsk);
}

+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags &
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_CPU_ISOLATED 47
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+# define PR_CPU_ISOLATED_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index f9de3ee12723..fd051ea290ee 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
context_tracking_recursion_exit();
out_irq_restore:
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4cf093c012d1..9f495c7c7dc2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
#include <linux/swap.h>

#include <asm/irq_regs.h>
+#include <asm/unistd.h>

#include "tick-internal.h"

@@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void)
}
}

+static void kill_cpu_isolated_strict_task(void)
+{
+ dump_stack();
+ current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+ pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_cpu_isolated_strict_task();
+}
+
#endif

/*
--
2.1.2

2015-07-13 19:58:30

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 2 ++
kernel/time/tick-sched.c | 15 +++++++++++----
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
# define PR_CPU_ISOLATED_STRICT (1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f495c7c7dc2..c5eca9c99fad 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void)
}
}

-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
{
+ siginfo_t info = {};
+ int sig;
+
dump_stack();
current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
- send_sig(SIGKILL, current, 1);
+
+ sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+ info.si_signo = sig;
+ info.si_code = is_syscall;
+ send_sig_info(sig, &info, current);
}

/*
@@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)

pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
current->comm, current->pid, syscall);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(1);
}

/*
@@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void)
{
pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
current->comm, current->pid);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(0);
}

#endif
--
2.1.2

2015-07-13 19:58:35

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v4 4/5] nohz: add cpu_isolated_debug boot flag

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated core
has unexpectedly entered the kernel. But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/kernel-parameters.txt | 6 ++++++
arch/tile/mm/homecache.c | 5 ++++-
include/linux/tick.h | 2 ++
kernel/irq_work.c | 4 +++-
kernel/sched/core.c | 18 ++++++++++++++++++
kernel/signal.c | 5 +++++
kernel/smp.c | 4 ++++
kernel/softirq.c | 6 ++++++
8 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f0459cd7b..76e8e2ff4a0a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -749,6 +749,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
/proc/<pid>/coredump_filter.
See also Documentation/filesystems/proc.txt.

+ cpu_isolated_debug [KNL]
+ In kernels built with CONFIG_NO_HZ_FULL and booted
+ in nohz_full= mode, this setting will generate console
+ backtraces when the kernel is about to interrupt a
+ task that has requested PR_CPU_ISOLATED_ENABLE.
+
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..f336880e1b01 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/tick.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
- for_each_cpu(cpu, &mask)
+ for_each_cpu(cpu, &mask) {
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+ tick_nohz_cpu_isolated_debug(cpu);
+ }
}

/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f79f6945f762..ed65551e2315 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -159,6 +159,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
extern void tick_nohz_cpu_isolated_enter(void);
extern void tick_nohz_cpu_isolated_syscall(int nr);
extern void tick_nohz_cpu_isolated_exception(void);
+extern void tick_nohz_cpu_isolated_debug(int cpu);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -172,6 +173,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
static inline void tick_nohz_cpu_isolated_enter(void) { }
static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
static inline void tick_nohz_cpu_isolated_exception(void) { }
+static inline void tick_nohz_cpu_isolated_debug(int cpu) { }
#endif

static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..7f35c90346de 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ tick_nohz_cpu_isolated_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10081..c8388f9206b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -743,6 +743,24 @@ bool sched_can_stop_tick(void)

return true;
}
+
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug;
+static int __init cpu_isolated_debug_func(char *str)
+{
+ cpu_isolated_debug = true;
+ return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void tick_nohz_cpu_isolated_debug(int cpu)
+{
+ if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) &&
+ (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+ pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+ dump_stack();
+ }
+}
#endif /* CONFIG_NO_HZ_FULL */

void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index 836df8dac6cc..90ee460c2586 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_NO_HZ_FULL
+ /* If the task is being killed, don't complain about cpu_isolated. */
+ if (state & TASK_WAKEKILL)
+ t->cpu_isolated_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..6b7d8e2c8af4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/tick.h>

#include "smpboot.h"

@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
+ tick_nohz_cpu_isolated_debug(cpu);
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ for_each_cpu(cpu, cfd->cpumask)
+ tick_nohz_cpu_isolated_debug(cpu);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..333872925ff6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/smpboot.h>
+#include <linux/context_tracking.h>
#include <linux/tick.h>
#include <linux/irq.h>

@@ -335,6 +336,11 @@ void irq_enter(void)
_local_bh_enable();
}

+ if (context_tracking_cpu_is_enabled() &&
+ context_tracking_in_user() &&
+ !in_interrupt())
+ tick_nohz_cpu_isolated_debug(smp_processor_id());
+
__irq_enter();
}

--
2.1.2

2015-07-13 19:58:43

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v4 5/5] nohz: cpu_isolated: allow tick to be fully disabled

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.
In addition, due to the way such processes quiesce by waiting for
the timer tick to stop prior to returning to userspace, without this
commit it won't be possible to use the cpu_isolated mode at all.

Removing the 1-second cap was previously discussed (see link below)
and Thomas Gleixner observed that vruntime, load balancing data, load
accounting, and other things might be impacted. Frederic Weisbecker
similarly observed that allowing the tick to be indefinitely deferred just
meant that no one would ever fix the underlying bugs. However it's at
least true that the mode proposed in this patch can only be enabled on an
isolcpus core by a process requesting cpu_isolated mode, which may limit
how important it is to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback timer
is removed, this will provide an environment where new code that relies
on that tick will get punished, and we won't forgive such assumptions
silently, so it may also be worth it from that perspective.

Finally, it's worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2008) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc). So this semantics is very
useful if we can convince ourselves that doing this is safe.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c5eca9c99fad..8187b4b4c91c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -754,7 +754,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,

#ifdef CONFIG_NO_HZ_FULL
/* Limit the tick delta to the maximum scheduler deferment */
- if (!ts->inidle)
+ if (!ts->inidle && !tick_nohz_is_cpu_isolated())
delta = min(delta, scheduler_tick_max_deferment());
#endif

--
2.1.2

2015-07-13 20:41:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]> wrote:
> The existing nohz_full mode makes tradeoffs to minimize userspace
> interruptions while still attempting to avoid overheads in the
> kernel entry/exit path, to provide 100% kernel semantics, etc.
>
> However, some applications require a stronger commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
>
> This change introduces a framework to allow applications to elect
> to have the stronger semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.

I thought the general consensus was that this should be the default
behavior and that any associated bugs should be fixed.

--Andy

2015-07-13 21:01:41

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]> wrote:
>> The existing nohz_full mode makes tradeoffs to minimize userspace
>> interruptions while still attempting to avoid overheads in the
>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>
>> However, some applications require a stronger commitment from the
>> kernel to avoid interruptions, in particular userspace device
>> driver style applications, such as high-speed networking code.
>>
>> This change introduces a framework to allow applications to elect
>> to have the stronger semantics as needed, specifying
>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>> Subsequent commits will add additional flags and additional
>> semantics.
> I thought the general consensus was that this should be the default
> behavior and that any associated bugs should be fixed.

I think it comes down to dividing the set of use cases in two:

- "Regular" nohz_full, as used to improve performance and limit
interruptions, possibly for power benefits, etc. But, stray
interrupts are not particularly bad, and you don't want to take
extreme measures to avoid them.

- What I'm calling "cpu_isolated" mode where when you return to
userspace, you expect that by God, the kernel doesn't interrupt you
again, and if it does, it's a flat-out bug.

There are a few things that cpu_isolated mode currently does to
accomplish its goals that are pretty heavy-weight:

Processes are held in kernel space until ticks are quiesced; this is
not necessarily what every nohz_full task wants. If a task makes a
kernel call, there may well be arbitrary timer fallout, and having a
way to select whether or not you are willing to take a timer tick after
return to userspace is pretty important.

Likewise, there are things that you may want to do on return to
userspace that are designed to prevent further interruptions in
cpu_isolated mode, even at a possible future performance cost if and
when you return to the kernel, such as flushing the per-cpu free page
list so that you won't be interrupted by an IPI to flush it later.

If you're arguing that the cpu_isolated semantic is really the only
one that makes sense for nohz_full, my sense is that it might be
surprising to many of the folks who do nohz_full work. But, I'm happy
to be wrong on this point, and maybe all the nohz_full community is
interested in making the same tradeoffs for nohz_full generally that
I've proposed in this patch series just for cpu_isolated?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-13 21:45:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <[email protected]> wrote:
> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]>
>> wrote:
>>>
>>> The existing nohz_full mode makes tradeoffs to minimize userspace
>>> interruptions while still attempting to avoid overheads in the
>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>>
>>> However, some applications require a stronger commitment from the
>>> kernel to avoid interruptions, in particular userspace device
>>> driver style applications, such as high-speed networking code.
>>>
>>> This change introduces a framework to allow applications to elect
>>> to have the stronger semantics as needed, specifying
>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>>> Subsequent commits will add additional flags and additional
>>> semantics.
>>
>> I thought the general consensus was that this should be the default
>> behavior and that any associated bugs should be fixed.
>
>
> I think it comes down to dividing the set of use cases in two:
>
> - "Regular" nohz_full, as used to improve performance and limit
> interruptions, possibly for power benefits, etc. But, stray
> interrupts are not particularly bad, and you don't want to take
> extreme measures to avoid them.
>
> - What I'm calling "cpu_isolated" mode where when you return to
> userspace, you expect that by God, the kernel doesn't interrupt you
> again, and if it does, it's a flat-out bug.
>
> There are a few things that cpu_isolated mode currently does to
> accomplish its goals that are pretty heavy-weight:
>
> Processes are held in kernel space until ticks are quiesced; this is
> not necessarily what every nohz_full task wants. If a task makes a
> kernel call, there may well be arbitrary timer fallout, and having a
> way to select whether or not you are willing to take a timer tick after
> return to userspace is pretty important.

Then shouldn't deferred work be done immediately in nohz_full mode
regardless? What is this delayed work that's being done?

>
> Likewise, there are things that you may want to do on return to
> userspace that are designed to prevent further interruptions in
> cpu_isolated mode, even at a possible future performance cost if and
> when you return to the kernel, such as flushing the per-cpu free page
> list so that you won't be interrupted by an IPI to flush it later.
>

Why not just kick the per-cpu free page over to whatever cpu is
monitoring your RCU state, etc? That should be very quick.

> If you're arguing that the cpu_isolated semantic is really the only
> one that makes sense for nohz_full, my sense is that it might be
> surprising to many of the folks who do nohz_full work. But, I'm happy
> to be wrong on this point, and maybe all the nohz_full community is
> interested in making the same tradeoffs for nohz_full generally that
> I've proposed in this patch series just for cpu_isolated?

nohz_full is currently dog slow for no particularly good reasons. I
suspect that the interrupts you're seeing are also there for no
particularly good reasons as well.

Let's fix them instead of adding new ABIs to work around them.

--Andy

2015-07-13 21:47:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]> wrote:
> With cpu_isolated mode, the task is in principle guaranteed not to be
> interrupted by the kernel, but only if it behaves. In particular, if it
> enters the kernel via system call, page fault, or any of a number of other
> synchronous traps, it may be unexpectedly exposed to long latencies.
> Add a simple flag that puts the process into a state where any such
> kernel entry is fatal.
>

To me, this seems like the wrong design. If nothing else, it seems
too much like an abusable anti-debugging mechanism. I can imagine
some per-task flag "I think I shouldn't be interrupted now" and a
tracepoint that fires if the task is interrupted with that flag set.
But the strong cpu isolation stuff requires systemwide configuration,
and I think that monitoring that it works should work similarly.

More comments below.

> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> arch/arm64/kernel/ptrace.c | 4 ++++
> arch/tile/kernel/ptrace.c | 6 +++++-
> arch/x86/kernel/ptrace.c | 2 ++
> include/linux/context_tracking.h | 11 ++++++++---
> include/linux/tick.h | 16 ++++++++++++++++
> include/uapi/linux/prctl.h | 1 +
> kernel/context_tracking.c | 9 ++++++---
> kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++
> 8 files changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index d882b833dbdb..7315b1579cbd 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>
> asmlinkage int syscall_trace_enter(struct pt_regs *regs)
> {
> + /* Ensure we report cpu_isolated violations in all circumstances. */
> + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
> + tick_nohz_cpu_isolated_syscall(regs->syscallno);

IMO this is pointless. If a user wants a syscall to kill them, use
seccomp. The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.


> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
> return 0;
>
> prev_ctx = this_cpu_read(context_tracking.state);
> - if (prev_ctx != CONTEXT_KERNEL)
> - context_tracking_exit(prev_ctx);
> + if (prev_ctx != CONTEXT_KERNEL) {
> + if (context_tracking_exit(prev_ctx)) {
> + if (tick_nohz_cpu_isolated_strict())
> + tick_nohz_cpu_isolated_exception();
> + }
> + }

NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will
simply never call exception_enter. It certainly won't call it
frequently unless something goes wrong with the patches that are
already in -tip.

> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
> * This call supports re-entrancy. This way it can be called from any exception
> * handler without needing to know if we came from userspace or not.
> */
> -void context_tracking_exit(enum ctx_state state)
> +bool context_tracking_exit(enum ctx_state state)
> {
> unsigned long flags;
> + bool from_user = false;
>

IMO the internal context tracking API (e.g. context_tracking_exit) are
mostly of the form "hey context tracking: I don't really know what
you're doing or what I'm doing, but let me call you and make both of
us feel better." You're making it somewhat worse: now it's all of the
above plus "I don't even know whether I just entered the kernel --
maybe you have a better idea".

Starting with 4.3, x86 kernels will know *exactly* when they enter the
kernel. All of this context tracking what-was-my-previous-state stuff
will remain until someone kills it, but when it goes away we'll get a
nice performance boost.

So, no, let's implement this for real if we're going to implement it.

--Andy

2015-07-21 19:11:18

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

Sorry for the delay in responding; some other priorities came up internally.

On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <[email protected]> wrote:
>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]>
>>> wrote:
>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
>>>> interruptions while still attempting to avoid overheads in the
>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>>>
>>>> However, some applications require a stronger commitment from the
>>>> kernel to avoid interruptions, in particular userspace device
>>>> driver style applications, such as high-speed networking code.
>>>>
>>>> This change introduces a framework to allow applications to elect
>>>> to have the stronger semantics as needed, specifying
>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>>>> Subsequent commits will add additional flags and additional
>>>> semantics.
>>> I thought the general consensus was that this should be the default
>>> behavior and that any associated bugs should be fixed.
>>
>> I think it comes down to dividing the set of use cases in two:
>>
>> - "Regular" nohz_full, as used to improve performance and limit
>> interruptions, possibly for power benefits, etc. But, stray
>> interrupts are not particularly bad, and you don't want to take
>> extreme measures to avoid them.
>>
>> - What I'm calling "cpu_isolated" mode where when you return to
>> userspace, you expect that by God, the kernel doesn't interrupt you
>> again, and if it does, it's a flat-out bug.
>>
>> There are a few things that cpu_isolated mode currently does to
>> accomplish its goals that are pretty heavy-weight:
>>
>> Processes are held in kernel space until ticks are quiesced; this is
>> not necessarily what every nohz_full task wants. If a task makes a
>> kernel call, there may well be arbitrary timer fallout, and having a
>> way to select whether or not you are willing to take a timer tick after
>> return to userspace is pretty important.
> Then shouldn't deferred work be done immediately in nohz_full mode
> regardless? What is this delayed work that's being done?

I'm thinking of things like needing to wait for an RCU quiesce
period to complete.

In the current version, there's also the vmstat_update() that
may schedule delayed work and interrupt the core again
shortly before realizing that there are no more counter updates
happening, at which point it quiesces. Currently we handle
this in cpu_isolated mode simply by spinning and waiting for
the timer interrupts to complete.

>> Likewise, there are things that you may want to do on return to
>> userspace that are designed to prevent further interruptions in
>> cpu_isolated mode, even at a possible future performance cost if and
>> when you return to the kernel, such as flushing the per-cpu free page
>> list so that you won't be interrupted by an IPI to flush it later.
> Why not just kick the per-cpu free page over to whatever cpu is
> monitoring your RCU state, etc? That should be very quick.

So just for the sake of precision, the thing I'm talking about
is the lru_add_drain() call on kernel exit. Are you proposing
that we call that for every nohz_full core on kernel exit?
I'm not opposed to this, but I don't know if other nohz
developers feel like this is the right tradeoff.

Similarly, addressing the vmstat_update() issue above, in
cpu_isolated mode we might want to have a follow-on
patch that forces the vmstat system into quiesced state
on return to userspace. We would need to do this
unconditionally on all nohz_full cores if we tried to combine
the current nohz_full with my proposed cpu_isolated
functionality. Again, I'm not necessarily opposed, but
I suspect other nohz developers might not want this.

(I didn't want to introduce such a patch as part of this
series since it pulls in even more interested parties, and
it gets harder and harder to get to consensus.)

>> If you're arguing that the cpu_isolated semantic is really the only
>> one that makes sense for nohz_full, my sense is that it might be
>> surprising to many of the folks who do nohz_full work. But, I'm happy
>> to be wrong on this point, and maybe all the nohz_full community is
>> interested in making the same tradeoffs for nohz_full generally that
>> I've proposed in this patch series just for cpu_isolated?
> nohz_full is currently dog slow for no particularly good reasons. I
> suspect that the interrupts you're seeing are also there for no
> particularly good reasons as well.
>
> Let's fix them instead of adding new ABIs to work around them.

Well, in principle if we accepted my proposed patch series
and then over time came to decide that it was reasonable
for nohz_full to have these complete cpu isolation
semantics, the one proposed ABI simply becomes a no-op.
So it's not as problematic an ABI as some.

My issue is this: I'm totally happy with submitting a revised
patch series that does all the stuff for pure nohz_full that
I'm currently proposing for cpu_isolated. But, is it what
the community wants? Should I propose it and see?

Frederic, do you have any insight here? Thanks!

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-21 19:26:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <[email protected]> wrote:
> Sorry for the delay in responding; some other priorities came up internally.
>
> On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <[email protected]>
>> wrote:
>>>
>>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]>
>>>>
>>>> wrote:
>>>>>
>>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
>>>>> interruptions while still attempting to avoid overheads in the
>>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>>>>
>>>>> However, some applications require a stronger commitment from the
>>>>> kernel to avoid interruptions, in particular userspace device
>>>>> driver style applications, such as high-speed networking code.
>>>>>
>>>>> This change introduces a framework to allow applications to elect
>>>>> to have the stronger semantics as needed, specifying
>>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>>>>> Subsequent commits will add additional flags and additional
>>>>> semantics.
>>>>
>>>> I thought the general consensus was that this should be the default
>>>> behavior and that any associated bugs should be fixed.
>>>
>>>
>>> I think it comes down to dividing the set of use cases in two:
>>>
>>> - "Regular" nohz_full, as used to improve performance and limit
>>> interruptions, possibly for power benefits, etc. But, stray
>>> interrupts are not particularly bad, and you don't want to take
>>> extreme measures to avoid them.
>>>
>>> - What I'm calling "cpu_isolated" mode where when you return to
>>> userspace, you expect that by God, the kernel doesn't interrupt you
>>> again, and if it does, it's a flat-out bug.
>>>
>>> There are a few things that cpu_isolated mode currently does to
>>> accomplish its goals that are pretty heavy-weight:
>>>
>>> Processes are held in kernel space until ticks are quiesced; this is
>>> not necessarily what every nohz_full task wants. If a task makes a
>>> kernel call, there may well be arbitrary timer fallout, and having a
>>> way to select whether or not you are willing to take a timer tick after
>>> return to userspace is pretty important.
>>
>> Then shouldn't deferred work be done immediately in nohz_full mode
>> regardless? What is this delayed work that's being done?
>
>
> I'm thinking of things like needing to wait for an RCU quiesce
> period to complete.

rcu_nocbs does this, right?

>
> In the current version, there's also the vmstat_update() that
> may schedule delayed work and interrupt the core again
> shortly before realizing that there are no more counter updates
> happening, at which point it quiesces. Currently we handle
> this in cpu_isolated mode simply by spinning and waiting for
> the timer interrupts to complete.

Perhaps we should fix that?

>
>>> Likewise, there are things that you may want to do on return to
>>> userspace that are designed to prevent further interruptions in
>>> cpu_isolated mode, even at a possible future performance cost if and
>>> when you return to the kernel, such as flushing the per-cpu free page
>>> list so that you won't be interrupted by an IPI to flush it later.
>>
>> Why not just kick the per-cpu free page over to whatever cpu is
>> monitoring your RCU state, etc? That should be very quick.
>
>
> So just for the sake of precision, the thing I'm talking about
> is the lru_add_drain() call on kernel exit. Are you proposing
> that we call that for every nohz_full core on kernel exit?
> I'm not opposed to this, but I don't know if other nohz
> developers feel like this is the right tradeoff.

I'm proposing either that we do that or that we arrange for other cpus
to be able to steal our LRU list while we're in RCU user/idle.

>> Let's fix them instead of adding new ABIs to work around them.
>
>
> Well, in principle if we accepted my proposed patch series
> and then over time came to decide that it was reasonable
> for nohz_full to have these complete cpu isolation
> semantics, the one proposed ABI simply becomes a no-op.
> So it's not as problematic an ABI as some.

What if we made it a debugfs thing instead of a prctl? Have a mode
where the system tries really hard to quiesce itself even at the cost
of performance.

>
> My issue is this: I'm totally happy with submitting a revised
> patch series that does all the stuff for pure nohz_full that
> I'm currently proposing for cpu_isolated. But, is it what
> the community wants? Should I propose it and see?
>
> Frederic, do you have any insight here? Thanks!
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Andy Lutomirski
AMA Capital Management, LLC

2015-07-21 19:34:28

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]> wrote:
>> With cpu_isolated mode, the task is in principle guaranteed not to be
>> interrupted by the kernel, but only if it behaves. In particular, if it
>> enters the kernel via system call, page fault, or any of a number of other
>> synchronous traps, it may be unexpectedly exposed to long latencies.
>> Add a simple flag that puts the process into a state where any such
>> kernel entry is fatal.
>>
> To me, this seems like the wrong design. If nothing else, it seems
> too much like an abusable anti-debugging mechanism. I can imagine
> some per-task flag "I think I shouldn't be interrupted now" and a
> tracepoint that fires if the task is interrupted with that flag set.
> But the strong cpu isolation stuff requires systemwide configuration,
> and I think that monitoring that it works should work similarly.

First, you mention a per-task flag, but not specifically whether the
proposed prctl() mechanism is a reasonable way to set that flag.
Just wanted to clarify that this wasn't an issue in and of itself for you.

Second, you suggest a tracepoint. I'm OK with creating a tracepoint
dedicated to cpu_isolated strict failures and making that the only
way this mechanism works. But, earlier community feedback seemed to
suggest that the signal mechanism was OK; one piece of feedback
just requested being able to set which signal was delivered. Do you
think the signal idea is a bad one? Are you proposing potentially
having a signal and/or a tracepoint?

Last, you mention systemwide configuration for monitoring. Can you
expand on what you mean by that? We already support the monitoring
only on the nohz_full cores, so to that extent it's already systemwide.
And the per-task flag has to be set by the running process when it's
ready for this state, so that can't really be systemwide configuration.
I don't understand your suggestion on this point.

>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>> index d882b833dbdb..7315b1579cbd 100644
>> --- a/arch/arm64/kernel/ptrace.c
>> +++ b/arch/arm64/kernel/ptrace.c
>> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>>
>> asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>> {
>> + /* Ensure we report cpu_isolated violations in all circumstances. */
>> + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
>> + tick_nohz_cpu_isolated_syscall(regs->syscallno);
> IMO this is pointless. If a user wants a syscall to kill them, use
> seccomp. The kernel isn't at fault if the user does a syscall when it
> didn't want to enter the kernel.

Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here. One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.

So, do you think there's a good role for the seccomp() API to play
in achieving this goal? It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience). You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.

Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen. It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.

>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>> return 0;
>>
>> prev_ctx = this_cpu_read(context_tracking.state);
>> - if (prev_ctx != CONTEXT_KERNEL)
>> - context_tracking_exit(prev_ctx);
>> + if (prev_ctx != CONTEXT_KERNEL) {
>> + if (context_tracking_exit(prev_ctx)) {
>> + if (tick_nohz_cpu_isolated_strict())
>> + tick_nohz_cpu_isolated_exception();
>> + }
>> + }
> NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will
> simply never call exception_enter. It certainly won't call it
> frequently unless something goes wrong with the patches that are
> already in -tip.

This is intended to catch user exceptions like page faults, GPV or
(on platforms where this would happen) unaligned data traps.
The kernel still has a role to play here and cpu_isolated mode
needs to let the user know they have accidentally entered
the kernel in this case.

>> --- a/kernel/context_tracking.c
>> +++ b/kernel/context_tracking.c
>> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>> * This call supports re-entrancy. This way it can be called from any exception
>> * handler without needing to know if we came from userspace or not.
>> */
>> -void context_tracking_exit(enum ctx_state state)
>> +bool context_tracking_exit(enum ctx_state state)
>> {
>> unsigned long flags;
>> + bool from_user = false;
>>
> IMO the internal context tracking API (e.g. context_tracking_exit) are
> mostly of the form "hey context tracking: I don't really know what
> you're doing or what I'm doing, but let me call you and make both of
> us feel better." You're making it somewhat worse: now it's all of the
> above plus "I don't even know whether I just entered the kernel --
> maybe you have a better idea".
>
> Starting with 4.3, x86 kernels will know *exactly* when they enter the
> kernel. All of this context tracking what-was-my-previous-state stuff
> will remain until someone kills it, but when it goes away we'll get a
> nice performance boost.
>
> So, no, let's implement this for real if we're going to implement it.

I'm certainly OK with rebasing on top of 4.3 after the context
tracking stuff is better. That said, I think it makes sense to continue
to debate the intent of the patch series even if we pull this one
patch out and defer it until after 4.3, or having it end up pulled
into some other repo that includes the improvements and
is being pulled for 4.3.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-21 19:42:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <[email protected]> wrote:
> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]>
>> wrote:
>>>
>>> With cpu_isolated mode, the task is in principle guaranteed not to be
>>> interrupted by the kernel, but only if it behaves. In particular, if it
>>> enters the kernel via system call, page fault, or any of a number of
>>> other
>>> synchronous traps, it may be unexpectedly exposed to long latencies.
>>> Add a simple flag that puts the process into a state where any such
>>> kernel entry is fatal.
>>>
>> To me, this seems like the wrong design. If nothing else, it seems
>> too much like an abusable anti-debugging mechanism. I can imagine
>> some per-task flag "I think I shouldn't be interrupted now" and a
>> tracepoint that fires if the task is interrupted with that flag set.
>> But the strong cpu isolation stuff requires systemwide configuration,
>> and I think that monitoring that it works should work similarly.
>
>
> First, you mention a per-task flag, but not specifically whether the
> proposed prctl() mechanism is a reasonable way to set that flag.
> Just wanted to clarify that this wasn't an issue in and of itself for you.

I think I'm okay with a per-task flag for this and, if you add one,
then prctl() is presumably the way to go. Unless people think that
nohz should be 100% reliable always, in which case might as well make
the flag per-cpu.

>
> Second, you suggest a tracepoint. I'm OK with creating a tracepoint
> dedicated to cpu_isolated strict failures and making that the only
> way this mechanism works. But, earlier community feedback seemed to
> suggest that the signal mechanism was OK; one piece of feedback
> just requested being able to set which signal was delivered. Do you
> think the signal idea is a bad one? Are you proposing potentially
> having a signal and/or a tracepoint?

I prefer the tracepoint. It's friendlier to debuggers, and it's
really about diagnosing a kernel problem, not a userspace problem.
Also, I really doubt that people should deploy a signal thing in
production. What if an NMI fires and kills their realtime program?

>
> Last, you mention systemwide configuration for monitoring. Can you
> expand on what you mean by that? We already support the monitoring
> only on the nohz_full cores, so to that extent it's already systemwide.
> And the per-task flag has to be set by the running process when it's
> ready for this state, so that can't really be systemwide configuration.
> I don't understand your suggestion on this point.

I'm really thinking about systemwide configuration for isolation. I
think we'll always (at least in the nearish term) need the admin's
help to set up isolated CPUs. If the admin makes a whole CPU be
isolated, then monitoring just that CPU and monitoring it all the time
seems sensible. If we really do think that isolating a CPU should
require a syscall of some sort because it's too expensive otherwise,
then we can do it that way, too. And if full isolation requires some
user help (e.g. don't do certain things that break isolation), then
having a per-task monitoring flag seems reasonable.

We may always need the user's help to avoid IPIs. For example, if one
thread calls munmap, the other thread is going to get an IPI. There's
nothing we can do about that.

> I'm certainly OK with rebasing on top of 4.3 after the context
> tracking stuff is better. That said, I think it makes sense to continue
> to debate the intent of the patch series even if we pull this one
> patch out and defer it until after 4.3, or having it end up pulled
> into some other repo that includes the improvements and
> is being pulled for 4.3.

Sure, no problem.

--Andy

2015-07-21 20:36:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Tue, Jul 21, 2015 at 12:26:17PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <[email protected]> wrote:
> > Sorry for the delay in responding; some other priorities came up internally.
> >
> > On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
> >>
> >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <[email protected]>
> >> wrote:
> >>>
> >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
> >>>>
> >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <[email protected]>
> >>>>
> >>>> wrote:
> >>>>>
> >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
> >>>>> interruptions while still attempting to avoid overheads in the
> >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
> >>>>>
> >>>>> However, some applications require a stronger commitment from the
> >>>>> kernel to avoid interruptions, in particular userspace device
> >>>>> driver style applications, such as high-speed networking code.
> >>>>>
> >>>>> This change introduces a framework to allow applications to elect
> >>>>> to have the stronger semantics as needed, specifying
> >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> >>>>> Subsequent commits will add additional flags and additional
> >>>>> semantics.
> >>>>
> >>>> I thought the general consensus was that this should be the default
> >>>> behavior and that any associated bugs should be fixed.
> >>>
> >>>
> >>> I think it comes down to dividing the set of use cases in two:
> >>>
> >>> - "Regular" nohz_full, as used to improve performance and limit
> >>> interruptions, possibly for power benefits, etc. But, stray
> >>> interrupts are not particularly bad, and you don't want to take
> >>> extreme measures to avoid them.
> >>>
> >>> - What I'm calling "cpu_isolated" mode where when you return to
> >>> userspace, you expect that by God, the kernel doesn't interrupt you
> >>> again, and if it does, it's a flat-out bug.
> >>>
> >>> There are a few things that cpu_isolated mode currently does to
> >>> accomplish its goals that are pretty heavy-weight:
> >>>
> >>> Processes are held in kernel space until ticks are quiesced; this is
> >>> not necessarily what every nohz_full task wants. If a task makes a
> >>> kernel call, there may well be arbitrary timer fallout, and having a
> >>> way to select whether or not you are willing to take a timer tick after
> >>> return to userspace is pretty important.
> >>
> >> Then shouldn't deferred work be done immediately in nohz_full mode
> >> regardless? What is this delayed work that's being done?
> >
> > I'm thinking of things like needing to wait for an RCU quiesce
> > period to complete.
>
> rcu_nocbs does this, right?

CONFIG_RCU_NOCB_CPUS offloads the RCU callbacks to a kthread, which
allows the nohz CPU to turn off its scheduling-clock tick more frequently.
Chris might have some other reason to wait for an RCU grace period, given
that waiting for an RCU grace period would not guarantee no callbacks.
Some more might have arrived in the meantime, and there can be some delay
between the end of the grace period and the invocation of the callbacks.

> > In the current version, there's also the vmstat_update() that
> > may schedule delayed work and interrupt the core again
> > shortly before realizing that there are no more counter updates
> > happening, at which point it quiesces. Currently we handle
> > this in cpu_isolated mode simply by spinning and waiting for
> > the timer interrupts to complete.
>
> Perhaps we should fix that?

Didn't Christoph Lameter fix this? Or is this an additional problem?

Thanx, Paul

> >>> Likewise, there are things that you may want to do on return to
> >>> userspace that are designed to prevent further interruptions in
> >>> cpu_isolated mode, even at a possible future performance cost if and
> >>> when you return to the kernel, such as flushing the per-cpu free page
> >>> list so that you won't be interrupted by an IPI to flush it later.
> >>
> >> Why not just kick the per-cpu free page over to whatever cpu is
> >> monitoring your RCU state, etc? That should be very quick.
> >
> >
> > So just for the sake of precision, the thing I'm talking about
> > is the lru_add_drain() call on kernel exit. Are you proposing
> > that we call that for every nohz_full core on kernel exit?
> > I'm not opposed to this, but I don't know if other nohz
> > developers feel like this is the right tradeoff.
>
> I'm proposing either that we do that or that we arrange for other cpus
> to be able to steal our LRU list while we're in RCU user/idle.
>
> >> Let's fix them instead of adding new ABIs to work around them.
> >
> >
> > Well, in principle if we accepted my proposed patch series
> > and then over time came to decide that it was reasonable
> > for nohz_full to have these complete cpu isolation
> > semantics, the one proposed ABI simply becomes a no-op.
> > So it's not as problematic an ABI as some.
>
> What if we made it a debugfs thing instead of a prctl? Have a mode
> where the system tries really hard to quiesce itself even at the cost
> of performance.
>
> >
> > My issue is this: I'm totally happy with submitting a revised
> > patch series that does all the stuff for pure nohz_full that
> > I'm currently proposing for cpu_isolated. But, is it what
> > the community wants? Should I propose it and see?
> >
> > Frederic, do you have any insight here? Thanks!
> >
> > --
> > Chris Metcalf, EZChip Semiconductor
> > http://www.ezchip.com
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-api" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
>

Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Tue, 21 Jul 2015, Paul E. McKenney wrote:

> > > In the current version, there's also the vmstat_update() that
> > > may schedule delayed work and interrupt the core again
> > > shortly before realizing that there are no more counter updates
> > > happening, at which point it quiesces. Currently we handle
> > > this in cpu_isolated mode simply by spinning and waiting for
> > > the timer interrupts to complete.
> >
> > Perhaps we should fix that?
>
> Didn't Christoph Lameter fix this? Or is this an additional problem?

Well the vmstat update must realize first that there are no outstanding
updates before switching itself off. So typically there is one extra tick.
But we could add another function that will simply fold the differential
immediately and turn the kworker task in the expectation that the
processor will stay quiet.

2015-07-22 19:28:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Wed, Jul 22, 2015 at 08:57:45AM -0500, Christoph Lameter wrote:
> On Tue, 21 Jul 2015, Paul E. McKenney wrote:
>
> > > > In the current version, there's also the vmstat_update() that
> > > > may schedule delayed work and interrupt the core again
> > > > shortly before realizing that there are no more counter updates
> > > > happening, at which point it quiesces. Currently we handle
> > > > this in cpu_isolated mode simply by spinning and waiting for
> > > > the timer interrupts to complete.
> > >
> > > Perhaps we should fix that?
> >
> > Didn't Christoph Lameter fix this? Or is this an additional problem?
>
> Well the vmstat update must realize first that there are no outstanding
> updates before switching itself off. So typically there is one extra tick.
> But we could add another function that will simply fold the differential
> immediately and turn the kworker task in the expectation that the
> processor will stay quiet.

Got it, thank you!

Thanx, Paul

Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Wed, 22 Jul 2015, Paul E. McKenney wrote:

> > > Didn't Christoph Lameter fix this? Or is this an additional problem?
> >
> > Well the vmstat update must realize first that there are no outstanding
> > updates before switching itself off. So typically there is one extra tick.
> > But we could add another function that will simply fold the differential
> > immediately and turn the kworker task in the expectation that the
> > processor will stay quiet.
>
> Got it, thank you!
>
> Thanx, Paul

Ok here is a function that quiets down the vmstat kworkers.


Subject: vmstat: provide a function to quiet down the diff processing

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_st
}

/*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+ do {
+ if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+ cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+ } while (refresh_cpu_vm_stats());
+}
+
+/*
* Check if the diffs for a certain cpu indicate that
* an update is needed.
*/
Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
static inline void refresh_cpu_vm_stats(int cpu) { }
static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }

2015-07-24 13:27:11

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode makes tradeoffs to minimize userspace
> interruptions while still attempting to avoid overheads in the
> kernel entry/exit path, to provide 100% kernel semantics, etc.
>
> However, some applications require a stronger commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
>
> This change introduces a framework to allow applications to elect
> to have the stronger semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.
>
> The "cpu_isolated" state is indicated by setting a new task struct
> field, cpu_isolated_flags, to the value passed by prctl(). When the
> _ENABLE bit is set for a task, and it is returning to userspace
> on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
> routine to take additional actions to help the task avoid being
> interrupted in the future.
>
> Initially, there are only two actions taken. First, the task
> calls lru_add_drain() to prevent being interrupted by a subsequent
> lru_add_drain_all() call on another core. Then, the code checks for
> pending timer interrupts and quiesces until they are no longer pending.
> As a result, sys calls (and page faults, etc.) can be inordinately slow.
> However, this quiescing guarantees that no unexpected interrupts will
> occur, even if the application intentionally calls into the kernel.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> arch/tile/kernel/process.c | 9 ++++++++
> include/linux/sched.h | 3 +++
> include/linux/tick.h | 10 ++++++++
> include/uapi/linux/prctl.h | 5 ++++
> kernel/context_tracking.c | 3 +++
> kernel/sys.c | 8 +++++++
> kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 95 insertions(+)
>
> diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
> index e036c0aa9792..3625e839ad62 100644
> --- a/arch/tile/kernel/process.c
> +++ b/arch/tile/kernel/process.c
> @@ -70,6 +70,15 @@ void arch_cpu_idle(void)
> _cpu_idle();
> }
>
> +#ifdef CONFIG_NO_HZ_FULL

I think this goes way beyond nohz itself. We don't only want the tick to shutdown,
we want also the pending timers, workqueues, etc...

It's time to create the CONFIG_ISOLATION_foo stuffs.

> +void tick_nohz_cpu_isolated_wait(void)
> +{
> + set_current_state(TASK_INTERRUPTIBLE);
> + _cpu_idle();
> + set_current_state(TASK_RUNNING);
> +}
> +#endif
> +
> /*
> * Release a thread_info structure
> */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ae21f1591615..f350b0c20bbc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1778,6 +1778,9 @@ struct task_struct {
> unsigned long task_state_change;
> #endif
> int pagefault_disabled;
> +#ifdef CONFIG_NO_HZ_FULL
> + unsigned int cpu_isolated_flags;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 3741ba1a652c..cb5569181359 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -10,6 +10,7 @@
> #include <linux/context_tracking_state.h>
> #include <linux/cpumask.h>
> #include <linux/sched.h>
> +#include <linux/prctl.h>
>
> #ifdef CONFIG_GENERIC_CLOCKEVENTS
> extern void __init tick_init(void);
> @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
> cpumask_or(mask, mask, tick_nohz_full_mask);
> }
>
> +static inline bool tick_nohz_is_cpu_isolated(void)
> +{
> + return tick_nohz_full_cpu(smp_processor_id()) &&
> + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
> +}
> +
> extern void __tick_nohz_full_check(void);
> extern void tick_nohz_full_kick(void);
> extern void tick_nohz_full_kick_cpu(int cpu);
> extern void tick_nohz_full_kick_all(void);
> extern void __tick_nohz_task_switch(struct task_struct *tsk);
> +extern void tick_nohz_cpu_isolated_enter(void);
> #else
> static inline bool tick_nohz_full_enabled(void) { return false; }
> static inline bool tick_nohz_full_cpu(int cpu) { return false; }
> @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
> static inline void tick_nohz_full_kick(void) { }
> static inline void tick_nohz_full_kick_all(void) { }
> static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
> +static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
> +static inline void tick_nohz_cpu_isolated_enter(void) { }
> #endif
>
> static inline bool is_housekeeping_cpu(int cpu)
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 31891d9535e2..edb40b6b84db 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -190,4 +190,9 @@ struct prctl_mm_map {
> # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
> # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */
>
> +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
> +#define PR_SET_CPU_ISOLATED 47
> +#define PR_GET_CPU_ISOLATED 48
> +# define PR_CPU_ISOLATED_ENABLE (1 << 0)
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 0a495ab35bc7..f9de3ee12723 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -20,6 +20,7 @@
> #include <linux/hardirq.h>
> #include <linux/export.h>
> #include <linux/kprobes.h>
> +#include <linux/tick.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/context_tracking.h>
> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
> * on the tick.
> */
> if (state == CONTEXT_USER) {
> + if (tick_nohz_is_cpu_isolated())
> + tick_nohz_cpu_isolated_enter();
> trace_user_enter(0);
> vtime_user_enter(current);
> }
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 259fda25eb6b..36eb9a839f1f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> case PR_GET_FP_MODE:
> error = GET_FP_MODE(me);
> break;
> +#ifdef CONFIG_NO_HZ_FULL
> + case PR_SET_CPU_ISOLATED:
> + me->cpu_isolated_flags = arg2;
> + break;
> + case PR_GET_CPU_ISOLATED:
> + error = me->cpu_isolated_flags;
> + break;
> +#endif
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c792429e98c6..4cf093c012d1 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -24,6 +24,7 @@
> #include <linux/posix-timers.h>
> #include <linux/perf_event.h>
> #include <linux/context_tracking.h>
> +#include <linux/swap.h>
>
> #include <asm/irq_regs.h>
>
> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
> pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
> cpumask_pr_args(tick_nohz_full_mask));
> }
> +
> +/*
> + * Rather than continuously polling for the next_event in the
> + * tick_cpu_device, architectures can provide a method to save power
> + * by sleeping until an interrupt arrives.
> + */
> +void __weak tick_nohz_cpu_isolated_wait(void)
> +{
> + cpu_relax();
> +}
> +
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In "cpu_isolated" mode we wait until no more interrupts are
> + * pending. Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two "cpu_isolated" processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually. Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void tick_nohz_cpu_isolated_enter(void)

Similarly, I'd rather see that in kernel/cpu_isolation.c and call it
cpu_isolation_enter().

> +{
> + struct clock_event_device *dev =
> + __this_cpu_read(tick_cpu_device.evtdev);
> + struct task_struct *task = current;
> + unsigned long start = jiffies;
> + bool warned = false;
> +
> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> + lru_add_drain();
> +
> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> + if (!warned && (jiffies - start) >= (5 * HZ)) {
> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
> + task->comm, task->pid, smp_processor_id(),
> + (jiffies - start) / HZ);
> + warned = true;
> + }
> + if (should_resched())
> + schedule();
> + if (test_thread_flag(TIF_SIGPENDING))
> + break;
> + tick_nohz_cpu_isolated_wait();

If we call cpu_idle(), what is going to wake the CPU up if not further interrupt happen?

We could either implement some sort of tick waiters with proper wake up once the CPU sees
no tick to schedule. Arguably this is all risky because this involve a scheduler wake up
and thus the risk for new noise. But it might work.

Another possibility is an msleep() based wait. But that's about the same, maybe even worse
due to repetitive wake ups.

> + }
> + if (warned) {
> + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
> + task->comm, task->pid, smp_processor_id(),
> + (jiffies - start) / HZ);
> + dump_stack();
> + }
> +}
> +
> #endif
>
> /*
> --
> 2.1.2
>

2015-07-24 14:03:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On Tue, Jul 21, 2015 at 03:10:54PM -0400, Chris Metcalf wrote:
> >>If you're arguing that the cpu_isolated semantic is really the only
> >>one that makes sense for nohz_full, my sense is that it might be
> >>surprising to many of the folks who do nohz_full work. But, I'm happy
> >>to be wrong on this point, and maybe all the nohz_full community is
> >>interested in making the same tradeoffs for nohz_full generally that
> >>I've proposed in this patch series just for cpu_isolated?
> >nohz_full is currently dog slow for no particularly good reasons. I
> >suspect that the interrupts you're seeing are also there for no
> >particularly good reasons as well.
> >
> >Let's fix them instead of adding new ABIs to work around them.
>
> Well, in principle if we accepted my proposed patch series
> and then over time came to decide that it was reasonable
> for nohz_full to have these complete cpu isolation
> semantics, the one proposed ABI simply becomes a no-op.
> So it's not as problematic an ABI as some.
>
> My issue is this: I'm totally happy with submitting a revised
> patch series that does all the stuff for pure nohz_full that
> I'm currently proposing for cpu_isolated. But, is it what
> the community wants? Should I propose it and see?
>
> Frederic, do you have any insight here? Thanks!

So you guys mean that if nohz_full was implemented fully like
we expect it to, we shouldn't be burdened at all by noise and that
whole patchset would therefore be pointless, right? And that would meet
the requirements for those who want hard isolation (critical noise-free
guarantee) as well as those who want soft isolation (less noise as
possible for performance).

Well first of all nohz is not isolation, it's a significant part of it
but it's not all isolatiion. We really want to separate these things and
not mess up isolation policies in the tick code.

Second, yes perhaps we can eventually have both soft and hard isolation
expectation eventually be implemented the same way through hard isolation.
But that will only work if we don't do that polling for noise-free before
resuming userspace, which might work for hard isolation that is ready to
sacrifice some warm-up before a run to meet guarantees, but it won't
work for soft isolation workloads.

So the only solution is to offline everything we can to housekeeping
CPUs. And if we still have stuff that can't be dealt with that way
and which need to be taken care of with some explicit operation
before resuming to userspace, then we can start to think about splitting
stuff in several isolation configs.

Similarly, offlining everything to housekeepers means that we sacrifice
a CPU that could have been used in performance oriented workloads so that
might not match soft isolation as well. But I think we'll see that all once
we manage to have pure noise-free CPUs (some patches are on the way to be
posted by Vatika Harlalka concerning the residual 1hz tick to kill).

To summarize, lets first split nohz and isolation. Introduce
CONFIG_CPU_ISOLATION and stuff all the isolation policies to
kernel/cpu_isolation.c, lets try to implement hard isolation and see if that
meets soft isolation workload users as well, if not we'll split that later.

And we can keep the prctl to tell the user when hard isolation has been
broken, through SIGKILL or whatever. I think we are doing a similar thing
with SCHED_DEADLINE when the task hasn't met deadline requirement. We might
want to do the same.

2015-07-24 20:19:59

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On 07/24/2015 10:03 AM, Frederic Weisbecker wrote:
> To summarize, lets first split nohz and isolation. Introduce
> CONFIG_CPU_ISOLATION and stuff all the isolation policies to
> kernel/cpu_isolation.c, lets try to implement hard isolation and see if that
> meets soft isolation workload users as well, if not we'll split that later.

I will do that for v5.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-24 20:21:46

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On 07/24/2015 09:27 AM, Frederic Weisbecker wrote:
> On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote:
>> +{
>> + struct clock_event_device *dev =
>> + __this_cpu_read(tick_cpu_device.evtdev);
>> + struct task_struct *task = current;
>> + unsigned long start = jiffies;
>> + bool warned = false;
>> +
>> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
>> + lru_add_drain();
>> +
>> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
>> + if (!warned && (jiffies - start) >= (5 * HZ)) {
>> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
>> + task->comm, task->pid, smp_processor_id(),
>> + (jiffies - start) / HZ);
>> + warned = true;
>> + }
>> + if (should_resched())
>> + schedule();
>> + if (test_thread_flag(TIF_SIGPENDING))
>> + break;
>> + tick_nohz_cpu_isolated_wait();
> If we call cpu_idle(), what is going to wake the CPU up if no further interrupt happen?
>
> We could either implement some sort of tick waiters with proper wake up once the CPU sees
> no tick to schedule. Arguably this is all risky because this involve a scheduler wake up
> and thus the risk for new noise. But it might work.
>
> Another possibility is an msleep() based wait. But that's about the same, maybe even worse
> due to repetitive wake ups.

The presumption here is that it is not possible to have
tick_cpu_device have a pending next_event without also
having a timer interrupt pending to go off. That certainly
seems to be true on the architectures I have looked at.
Do we think that might ever not be the case?

We are running here with interrupts disabled, so this core won't
transition from "timer interrupt scheduled" to "no timer interrupt
scheduled" before we spin or idle, and presumably no other core
can reach across and turn off our timer interrupt either.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-24 20:22:06

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On 07/22/2015 04:02 PM, Christoph Lameter wrote:
> On Wed, 22 Jul 2015, Paul E. McKenney wrote:
>
>>>> Didn't Christoph Lameter fix this? Or is this an additional problem?
>>> Well the vmstat update must realize first that there are no outstanding
>>> updates before switching itself off. So typically there is one extra tick.
>>> But we could add another function that will simply fold the differential
>>> immediately and turn the kworker task in the expectation that the
>>> processor will stay quiet.
>> Got it, thank you!
>>
>> Thanx, Paul
> Ok here is a function that quiets down the vmstat kworkers.

That's great - I will include this patch in my series then, and call it
as part of the "hard isolation" mode return to userspace. Thanks!

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-24 20:22:26

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode

On 07/21/2015 03:26 PM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <[email protected]> wrote:
>> So just for the sake of precision, the thing I'm talking about
>> is the lru_add_drain() call on kernel exit. Are you proposing
>> that we call that for every nohz_full core on kernel exit?
>> I'm not opposed to this, but I don't know if other nohz
>> developers feel like this is the right tradeoff.
> I'm proposing either that we do that or that we arrange for other cpus
> to be able to steal our LRU list while we're in RCU user/idle.

That seems challenging; there is a lot that has to be done in
lru_add_drain() and we may not want to do it for the "soft
isolation" mode Frederic alludes to in a later email. And, we
would have to add a bunch of locking to allow another process
to steal the list from under us, so that's not obviously going
to be a performance win in terms of the per-cpu page cache
for normal operations.

Perhaps there could be a lock taken that nohz_full processes
have to take just to exit from userspace, and that other tasks
could take to do things on behalf of the nohz_full process that
it thinks it can do locklessly. It gets complicated, since you'd
want to tie that to whether the nohz_full process was currently
in the kernel or not, so some kind of atomic update on the
context_tracking state or some such, perhaps. Still not really
clear if that overhead is worth it (both from a maintenance
point of view and the possible performance hit).

Limiting it just to the hard isolation mode seems like a good
answer since there we really know that userspace does not
care about the performance implications of kernel/userspace
transitions, and it doesn't cause slowdowns to anyone else.

For now I will bundle it in with my respin as part of the
"hard isolation" mode Frederic proposed.

>> Well, in principle if we accepted my proposed patch series
>> and then over time came to decide that it was reasonable
>> for nohz_full to have these complete cpu isolation
>> semantics, the one proposed ABI simply becomes a no-op.
>> So it's not as problematic an ABI as some.
> What if we made it a debugfs thing instead of a prctl? Have a mode
> where the system tries really hard to quiesce itself even at the cost
> of performance.

No, since it's really a mode within an individual task that you'd
like to switch on and off depending on what the task is trying
to do - strict mode while it's running its main fast-path userspace
code, but certainly not strict mode during its setup, and possibly
leaving strict mode to run some kinds of slow-path, diagnostic,
or error-handling code.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-24 20:30:26

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode

On 07/21/2015 03:42 PM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <[email protected]> wrote:
>> Second, you suggest a tracepoint. I'm OK with creating a tracepoint
>> dedicated to cpu_isolated strict failures and making that the only
>> way this mechanism works. But, earlier community feedback seemed to
>> suggest that the signal mechanism was OK; one piece of feedback
>> just requested being able to set which signal was delivered. Do you
>> think the signal idea is a bad one? Are you proposing potentially
>> having a signal and/or a tracepoint?
> I prefer the tracepoint. It's friendlier to debuggers, and it's
> really about diagnosing a kernel problem, not a userspace problem.
> Also, I really doubt that people should deploy a signal thing in
> production. What if an NMI fires and kills their realtime program?

No, this piece of the patch series is about diagnosing bugs in the
userspace program (likely in third-party code, in our customers'
experience). When you violate strict mode, you get a signal and
you have a nice pointer to what instruction it was that caused
you to enter the kernel.

You are right that running this in production is likely not a great
idea, as is true for other debugging mechanisms. But you might
really want to have it as a signal with a signal handler that fires
to generate a trace of some kind into the application's existing
tracing mechanisms, so the app doesn't just report "wow, I lost
a bunch of time in here somewhere, sorry about those packets
I dropped on the floor", but "here's where I took a strict signal".
You probably drop a few additional packets due to the signal
handling and logging, but given you've already fallen away from
100% in this case, the extra diagnostics are almost certainly
worth it.

In this case it's probably not as helpful to have a tracepoint-based
solution, just because you really do want to be able to easily
integrate into the app's existing logging framework.

My sense, I think, is that we can easily add tracepoints to the
strict failure code in the future, so it may not be worth trying to
widen the scope of the patch series just now.

>> Last, you mention systemwide configuration for monitoring. Can you
>> expand on what you mean by that? We already support the monitoring
>> only on the nohz_full cores, so to that extent it's already systemwide.
>> And the per-task flag has to be set by the running process when it's
>> ready for this state, so that can't really be systemwide configuration.
>> I don't understand your suggestion on this point.
> I'm really thinking about systemwide configuration for isolation. I
> think we'll always (at least in the nearish term) need the admin's
> help to set up isolated CPUs. If the admin makes a whole CPU be
> isolated, then monitoring just that CPU and monitoring it all the time
> seems sensible. If we really do think that isolating a CPU should
> require a syscall of some sort because it's too expensive otherwise,
> then we can do it that way, too. And if full isolation requires some
> user help (e.g. don't do certain things that break isolation), then
> having a per-task monitoring flag seems reasonable.
>
> We may always need the user's help to avoid IPIs. For example, if one
> thread calls munmap, the other thread is going to get an IPI. There's
> nothing we can do about that.

I think we're mostly agreed on this stuff, though your use of
"monitored" doesn't really match the "strict" mode in this patch.

It's certainly true that, for example, we advise customers not to
run the slow-path code on a housekeeping cpu as a thread in the
same process space as the fast-path code on the nohz_full cores,
just because things like fclose() on a file descriptor will lead to
free() which can lead to munmap() and an IPI to the fast path.

>> I'm certainly OK with rebasing on top of 4.3 after the context
>> tracking stuff is better. That said, I think it makes sense to continue
>> to debate the intent of the patch series even if we pull this one
>> patch out and defer it until after 4.3, or having it end up pulled
>> into some other repo that includes the improvements and
>> is being pulled for 4.3.
> Sure, no problem.

I will add a comment to the patch and a note to the series about
this, but for now I'll keep it in the series. If we can arrange to pull
it into Frederic's tree after the context_tracking changes, we can
respin it at that point to layer it on top.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-07-28 19:49:59

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full

This version of the patch series incorporates Christoph Lameter's
change to add a quiet_vmstat() call, and restructures cpu_isolated as
a "hard" isolation mode in contrast to nohz_full's "soft" isolation,
breaking it out as a separate CONFIG_CPU_ISOLATED with its own
include/linux/cpu_isolated.h and kernel/time/cpu_isolated.c.
It is rebased to 4.2-rc3.

Thomas: as I mentioned in v4, I haven't heard from you whether my
removal of the cpu_idle calls sufficiently addresses your concerns
about that aspect.

Andy: as I said in email, I've left in the support where cpu_isolated
relies on the context_tracking stuff currently in 4.2-rc3. I'm not
sure what the cleanest way is for me to pick up the new
context_tracking stuff; if that's all that ends up standing between
this patch series and having it be pulled, perhaps I can rebase it
onto whatever branch it is that has the new context_tracking?

Original patch series cover letter follows:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. The
kernel must be built with CONFIG_CPU_ISOLATED to take advantage of
this new mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control. Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v5:
rebased on kernel v4.2-rc3
converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
incorporates Christoph Lameter's quiet_vmstat() call

v4:
rebased on kernel v4.2-rc1
added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
remove dependency on cpu_idle subsystem (Thomas Gleixner)
use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
use seconds for console messages instead of jiffies (Thomas Gleixner)
updated commit description for patch 5/5

v2:
rename "dataplane" to "cpu_isolated"
drop ksoftirqd suppression changes (believed no longer needed)
merge previous "QUIESCE" functionality into baseline functionality
explicitly track syscalls and exceptions for "STRICT" functionality
allow configuring a signal to be delivered for STRICT mode failures
move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
cpu_isolated: add initial support
cpu_isolated: support PR_CPU_ISOLATED_STRICT mode
cpu_isolated: provide strict mode configurable signal
cpu_isolated: add debug boot flag
nohz: cpu_isolated: allow tick to be fully disabled

Christoph Lameter (1):
vmstat: provide a function to quiet down the diff processing

Documentation/kernel-parameters.txt | 7 +++
arch/arm64/kernel/ptrace.c | 5 ++
arch/tile/kernel/process.c | 9 +++
arch/tile/kernel/ptrace.c | 5 +-
arch/tile/mm/homecache.c | 5 +-
arch/x86/kernel/ptrace.c | 2 +
include/linux/context_tracking.h | 11 +++-
include/linux/cpu_isolated.h | 42 +++++++++++++
include/linux/sched.h | 3 +
include/linux/vmstat.h | 2 +
include/uapi/linux/prctl.h | 8 +++
kernel/context_tracking.c | 12 +++-
kernel/irq_work.c | 5 +-
kernel/sched/core.c | 21 +++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 7 +++
kernel/sys.c | 8 +++
kernel/time/Kconfig | 20 +++++++
kernel/time/Makefile | 1 +
kernel/time/cpu_isolated.c | 116 ++++++++++++++++++++++++++++++++++++
kernel/time/tick-sched.c | 3 +-
mm/vmstat.c | 14 +++++
23 files changed, 305 insertions(+), 10 deletions(-)
create mode 100644 include/linux/cpu_isolated.h
create mode 100644 kernel/time/cpu_isolated.c

--
2.1.2

2015-07-28 19:50:01

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 1/6] vmstat: provide a function to quiet down the diff processing

From: Christoph Lameter <[email protected]>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <[email protected]>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..c013b8d8e434 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page,
static inline void refresh_cpu_vm_stats(int cpu) { }
static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..cf7d324f16e2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w)
}

/*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+ do {
+ if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+ cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+ } while (refresh_cpu_vm_stats());
+}
+
+/*
* Check if the diffs for a certain cpu indicate that
* an update is needed.
*/
--
2.1.2

2015-07-28 19:50:11

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 2/6] cpu_isolated: add initial support

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new CPU_ISOLATED Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument. The "cpu_isolated" state is then
indicated by setting a new task struct field, cpu_isolated_flags,
to the value passed by prctl(). When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new cpu_isolated_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken. First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core. Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt. Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending. As a result, sys
calls (and page faults, etc.) can be inordinately slow. However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/kernel/process.c | 9 ++++++
include/linux/cpu_isolated.h | 24 +++++++++++++++
include/linux/sched.h | 3 ++
include/uapi/linux/prctl.h | 5 ++++
kernel/context_tracking.c | 3 ++
kernel/sys.c | 8 +++++
kernel/time/Kconfig | 20 +++++++++++++
kernel/time/Makefile | 1 +
kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 144 insertions(+)
create mode 100644 include/linux/cpu_isolated.h
create mode 100644 kernel/time/cpu_isolated.c

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..7db6f8386417 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
_cpu_idle();
}

+#ifdef CONFIG_CPU_ISOLATED
+void cpu_isolated_wait(void)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ _cpu_idle();
+ set_current_state(TASK_RUNNING);
+}
+#endif
+
/*
* Release a thread_info structure
*/
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
new file mode 100644
index 000000000000..a3d17360f7ae
--- /dev/null
+++ b/include/linux/cpu_isolated.h
@@ -0,0 +1,24 @@
+/*
+ * CPU isolation related global functions
+ */
+#ifndef _LINUX_CPU_ISOLATED_H
+#define _LINUX_CPU_ISOLATED_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_CPU_ISOLATED
+static inline bool is_cpu_isolated(void)
+{
+ return tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
+extern void cpu_isolated_enter(void);
+extern void cpu_isolated_wait(void);
+#else
+static inline bool is_cpu_isolated(void) { return false; }
+static inline void cpu_isolated_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..0bb248385d88 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,6 +1776,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_CPU_ISOLATED
+ unsigned int cpu_isolated_flags;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */

+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED 47
+#define PR_GET_CPU_ISOLATED 48
+# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..36b6509c3e2a 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/cpu_isolated.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
* on the tick.
*/
if (state == CONTEXT_USER) {
+ if (is_cpu_isolated())
+ cpu_isolated_enter();
trace_user_enter(0);
vtime_user_enter(current);
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..c68417ff4800 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_CPU_ISOLATED
+ case PR_SET_CPU_ISOLATED:
+ me->cpu_isolated_flags = arg2;
+ break;
+ case PR_GET_CPU_ISOLATED:
+ error = me->cpu_isolated_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 579ce1b929af..141969149994 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -195,5 +195,25 @@ config HIGH_RES_TIMERS
hardware is not capable then this option only increases
the size of the kernel image.

+config CPU_ISOLATED
+ bool "Provide hard CPU isolation from the kernel on demand"
+ depends on NO_HZ_FULL
+ help
+ Allow userspace processes to place themselves on nohz_full
+ cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate"
+ themselves from the kernel. On return to userspace,
+ cpu-isolated tasks will first arrange that no future kernel
+ activity will interrupt the task while the task is running
+ in userspace. This "hard" isolation from the kernel is
+ required for userspace tasks that are running hard real-time
+ tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+ Without this option, but with NO_HZ_FULL enabled, the kernel
+ will make a best-faith, "soft" effort to shield a single userspace
+ process from interrupts, but makes no guarantees.
+
+ You should say "N" unless you are intending to run a
+ high-performance userspace driver or similar task.
+
endmenu
endif
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 49eca0beed32..984081cce974 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o
obj-$(CONFIG_TIMER_STATS) += timer_stats.o
obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o
obj-$(CONFIG_TEST_UDELAY) += test_udelay.o
+obj-$(CONFIG_CPU_ISOLATED) += cpu_isolated.o
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
new file mode 100644
index 000000000000..e27259f30caf
--- /dev/null
+++ b/kernel/time/cpu_isolated.c
@@ -0,0 +1,71 @@
+/*
+ * linux/kernel/time/cpu_isolated.c
+ *
+ * Implementation for cpu isolation.
+ *
+ * Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/cpu_isolated.h>
+#include "tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak cpu_isolated_wait(void)
+{
+ cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In cpu_isolated mode we wait until no more interrupts are
+ * pending. Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two cpu_isolated processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually. Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void cpu_isolated_enter(void)
+{
+ struct clock_event_device *dev =
+ __this_cpu_read(tick_cpu_device.evtdev);
+ struct task_struct *task = current;
+ unsigned long start = jiffies;
+ bool warned = false;
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ /* Quieten the vmstat worker so it won't interrupt us. */
+ quiet_vmstat();
+
+ while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+ if (!warned && (jiffies - start) >= (5 * HZ)) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ warned = true;
+ }
+ if (should_resched())
+ schedule();
+ if (test_thread_flag(TIF_SIGPENDING))
+ break;
+ cpu_isolated_wait();
+ }
+ if (warned) {
+ pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ dump_stack();
+ }
+}
--
2.1.2

2015-07-28 19:51:37

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies. Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
Note: Andy Lutomirski points out that improvements are coming
to the context_tracking code to make it more robust, which may
mean that some of the code suggested here for context_tracking
may not be necessary. I am keeping it in the series for now since
it is required for it to work based on 4.2-rc3.

arch/arm64/kernel/ptrace.c | 5 +++++
arch/tile/kernel/ptrace.c | 5 ++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/cpu_isolated.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/time/cpu_isolated.c | 38 ++++++++++++++++++++++++++++++++++++++
8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..ff83968ab4d4 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/cpu_isolated.h>

#include <asm/compat.h>
#include <asm/debug-monitors.h>
@@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,

asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
+ /* Ensure we report cpu_isolated violations in all circumstances. */
+ if (test_thread_flag(TIF_NOHZ) && cpu_isolated_strict())
+ cpu_isolated_syscall(regs->syscallno);
+
/* Do the secure computing check first; failures should be fast. */
if (secure_computing() == -1)
return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..e54256c54311 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (cpu_isolated_strict())
+ cpu_isolated_syscall(regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..e5aec57e8e25 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (cpu_isolated_strict())
+ cpu_isolated_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..590414ef2bf1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/cpu_isolated.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);

@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (cpu_isolated_strict())
+ cpu_isolated_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
index a3d17360f7ae..b0f1c2669b2f 100644
--- a/include/linux/cpu_isolated.h
+++ b/include/linux/cpu_isolated.h
@@ -15,10 +15,26 @@ static inline bool is_cpu_isolated(void)
}

extern void cpu_isolated_enter(void);
+extern void cpu_isolated_syscall(int nr);
+extern void cpu_isolated_exception(void);
extern void cpu_isolated_wait(void);
#else
static inline bool is_cpu_isolated(void) { return false; }
static inline void cpu_isolated_enter(void) { }
+static inline void cpu_isolated_syscall(int nr) { }
+static inline void cpu_isolated_exception(void) { }
#endif

+static inline bool cpu_isolated_strict(void)
+{
+#ifdef CONFIG_CPU_ISOLATED
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->cpu_isolated_flags &
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+ (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_CPU_ISOLATED 47
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
+# define PR_CPU_ISOLATED_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36b6509c3e2a..c740850eea11 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
context_tracking_recursion_exit();
out_irq_restore:
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
index e27259f30caf..d30bf3852897 100644
--- a/kernel/time/cpu_isolated.c
+++ b/kernel/time/cpu_isolated.c
@@ -10,6 +10,7 @@
#include <linux/swap.h>
#include <linux/vmstat.h>
#include <linux/cpu_isolated.h>
+#include <asm/unistd.h>
#include "tick-sched.h"

/*
@@ -69,3 +70,40 @@ void cpu_isolated_enter(void)
dump_stack();
}
}
+
+static void kill_cpu_isolated_strict_task(void)
+{
+ dump_stack();
+ current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void cpu_isolated_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void cpu_isolated_exception(void)
+{
+ pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_cpu_isolated_strict_task();
+}
--
2.1.2

2015-07-28 19:50:18

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 2 ++
kernel/time/cpu_isolated.c | 17 ++++++++++++-----
2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
#define PR_GET_CPU_ISOLATED 48
# define PR_CPU_ISOLATED_ENABLE (1 << 0)
# define PR_CPU_ISOLATED_STRICT (1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
index d30bf3852897..9f8fcbd97770 100644
--- a/kernel/time/cpu_isolated.c
+++ b/kernel/time/cpu_isolated.c
@@ -71,11 +71,18 @@ void cpu_isolated_enter(void)
}
}

-static void kill_cpu_isolated_strict_task(void)
-{
+static void kill_cpu_isolated_strict_task(int is_syscall)
+ {
+ siginfo_t info = {};
+ int sig;
+
dump_stack();
current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
- send_sig(SIGKILL, current, 1);
+
+ sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+ info.si_signo = sig;
+ info.si_code = is_syscall;
+ send_sig_info(sig, &info, current);
}

/*
@@ -94,7 +101,7 @@ void cpu_isolated_syscall(int syscall)

pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
current->comm, current->pid, syscall);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(1);
}

/*
@@ -105,5 +112,5 @@ void cpu_isolated_exception(void)
{
pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
current->comm, current->pid);
- kill_cpu_isolated_strict_task();
+ kill_cpu_isolated_strict_task(0);
}
--
2.1.2

2015-07-28 19:50:58

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 5/6] cpu_isolated: add debug boot flag

The new "cpu_isolated_debug" flag simplifies debugging
of CPU_ISOLATED kernels when processes are running in
PR_CPU_ISOLATED_ENABLE mode. Such processes should get no interrupts
from the kernel, and if they do, when this boot flag is specified
a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated
core has unexpectedly entered the kernel. But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/kernel-parameters.txt | 7 +++++++
arch/tile/mm/homecache.c | 5 ++++-
include/linux/cpu_isolated.h | 2 ++
kernel/irq_work.c | 5 ++++-
kernel/sched/core.c | 21 +++++++++++++++++++++
kernel/signal.c | 5 +++++
kernel/smp.c | 4 ++++
kernel/softirq.c | 7 +++++++
8 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f0459cd7b..940e4c9f1978 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -749,6 +749,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
/proc/<pid>/coredump_filter.
See also Documentation/filesystems/proc.txt.

+ cpu_isolated_debug [KNL]
+ In kernels built with CONFIG_CPU_ISOLATED and booted
+ in nohz_full= mode, this setting will generate console
+ backtraces when the kernel is about to interrupt a
+ task that has requested PR_CPU_ISOLATED_ENABLE
+ and is running on a nohz_full core.
+
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..fdef5e3d6396 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/cpu_isolated.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
- for_each_cpu(cpu, &mask)
+ for_each_cpu(cpu, &mask) {
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+ cpu_isolated_debug(cpu);
+ }
}

/*
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
index b0f1c2669b2f..4ea67d640be7 100644
--- a/include/linux/cpu_isolated.h
+++ b/include/linux/cpu_isolated.h
@@ -18,11 +18,13 @@ extern void cpu_isolated_enter(void);
extern void cpu_isolated_syscall(int nr);
extern void cpu_isolated_exception(void);
extern void cpu_isolated_wait(void);
+extern void cpu_isolated_debug(int cpu);
#else
static inline bool is_cpu_isolated(void) { return false; }
static inline void cpu_isolated_enter(void) { }
static inline void cpu_isolated_syscall(int nr) { }
static inline void cpu_isolated_exception(void) { }
+static inline void cpu_isolated_debug(int cpu) { }
#endif

static inline bool cpu_isolated_strict(void)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..3c08a41f9898 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
+#include <linux/cpu_isolated.h>
#include <asm/processor.h>


@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ cpu_isolated_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10081..647671900497 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
#include <linux/binfmts.h>
#include <linux/context_tracking.h>
#include <linux/compiler.h>
+#include <linux/cpu_isolated.h>

#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -745,6 +746,26 @@ bool sched_can_stop_tick(void)
}
#endif /* CONFIG_NO_HZ_FULL */

+#ifdef CONFIG_CPU_ISOLATED
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug_flag;
+static int __init cpu_isolated_debug_func(char *str)
+{
+ cpu_isolated_debug_flag = true;
+ return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void cpu_isolated_debug(int cpu)
+{
+ if (cpu_isolated_debug_flag && tick_nohz_full_cpu(cpu) &&
+ (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+ pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+ dump_stack();
+ }
+}
+#endif
+
void sched_avg_update(struct rq *rq)
{
s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 836df8dac6cc..90ee460c2586 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_NO_HZ_FULL
+ /* If the task is being killed, don't complain about cpu_isolated. */
+ if (state & TASK_WAKEKILL)
+ t->cpu_isolated_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..846e42a3daa3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/cpu_isolated.h>

#include "smpboot.h"

@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
+ cpu_isolated_debug(cpu);
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ for_each_cpu(cpu, cfd->cpumask)
+ cpu_isolated_debug(cpu);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..456149a4a34f 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,8 +24,10 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/smpboot.h>
+#include <linux/context_tracking.h>
#include <linux/tick.h>
#include <linux/irq.h>
+#include <linux/cpu_isolated.h>

#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>
@@ -335,6 +337,11 @@ void irq_enter(void)
_local_bh_enable();
}

+ if (context_tracking_cpu_is_enabled() &&
+ context_tracking_in_user() &&
+ !in_interrupt())
+ cpu_isolated_debug(smp_processor_id());
+
__irq_enter();
}

--
2.1.2

2015-07-28 19:50:27

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v5 6/6] nohz: cpu_isolated: allow tick to be fully disabled

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on
running completely tickless, so don't bound the time_delta for such
processes. In addition, due to the way such processes quiesce by
waiting for the timer tick to stop prior to returning to userspace,
without this commit it won't be possible to use the cpu_isolated
mode at all.

Removing the 1-second cap was previously discussed (see link
below) and Thomas Gleixner observed that vruntime, load balancing
data, load accounting, and other things might be impacted.
Frederic Weisbecker similarly observed that allowing the tick to
be indefinitely deferred just meant that no one would ever fix the
underlying bugs. However it's at least true that the mode proposed
in this patch can only be enabled on a nohz_full core by a process
requesting cpu_isolated mode, which may limit how important it is
to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback
timer is removed, this will provide an environment where new code
that relies on that tick will get punished, and we won't forgive
such assumptions silently, so it may also be worth it from that
perspective.

Finally, it's worth observing that the tile architecture has been
using similar code for its Zero-Overhead Linux for many years
(starting in 2008) and customers are very enthusiastic about the
resulting bare-metal performance on cores that are available to
run full Linux semantics on demand (crash, logging, shutdown, etc).
So this semantics is very useful if we can convince ourselves that
doing this is safe.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/time/tick-sched.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..3a1d48418499 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/cpu_isolated.h>

#include <asm/irq_regs.h>

@@ -652,7 +653,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,

#ifdef CONFIG_NO_HZ_FULL
/* Limit the tick delta to the maximum scheduler deferment */
- if (!ts->inidle)
+ if (!ts->inidle && !is_cpu_isolated())
delta = min(delta, scheduler_tick_max_deferment());
#endif

--
2.1.2

2015-08-12 16:00:27

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] cpu_isolated: add initial support

On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
>
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
>
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.

We are doing this at the process level but the isolation works on
the CPU scope... Now I wonder if prctl is the right interface.

That said the user is rather interested in isolating a task. The CPU
being the backend eventually.

For example if the task is migrated by accident, we want it to be
warned about that. And if the isolation is done on the CPU level
instead of the task level, this won't happen.

I'm also afraid that the naming clashes with cpu_isolated_map,
although it could be a subset of it.

So probably in this case we should consider talking about task rather
than CPU isolation and change naming accordingly (sorry, I know I
suggested cpu_isolation.c, I guess I had to see the result to realize).

We must sort that out first. Either we consider isolation on the task
level (and thus the underlying CPU by backend effect) and we use prctl().
Or we do this on the CPU level and we use a specific syscall or sysfs
which takes effect on any task in the relevant isolated CPUs.

What do you think?

It would be nice to hear others opinions as well.

> The kernel must be built with the new CPU_ISOLATED Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> nohz_full=CPULIST boot argument. The "cpu_isolated" state is then
> indicated by setting a new task struct field, cpu_isolated_flags,
> to the value passed by prctl(). When the _ENABLE bit is set for a
> task, and it is returning to userspace on a nohz_full core, it calls
> the new cpu_isolated_enter() routine to take additional actions
> to help the task avoid being interrupted in the future.
>
> Initially, there are only three actions taken. First, the
> task calls lru_add_drain() to prevent being interrupted by a
> subsequent lru_add_drain_all() call on another core. Then, it calls
> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
> interrupt. Finally, the code checks for pending timer interrupts
> and quiesces until they are no longer pending. As a result, sys
> calls (and page faults, etc.) can be inordinately slow. However,
> this quiescing guarantees that no unexpected interrupts will occur,
> even if the application intentionally calls into the kernel.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> arch/tile/kernel/process.c | 9 ++++++
> include/linux/cpu_isolated.h | 24 +++++++++++++++
> include/linux/sched.h | 3 ++
> include/uapi/linux/prctl.h | 5 ++++
> kernel/context_tracking.c | 3 ++
> kernel/sys.c | 8 +++++
> kernel/time/Kconfig | 20 +++++++++++++
> kernel/time/Makefile | 1 +
> kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++

It's not about time :-)

The timer is only a part of the isolation.

Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would
better fit.

kernel/task_isolation.c maybe or just kernel/isolation.c

I think I prefer the latter because I'm not only interested in that task
hard isolation feature, I would like to also drive all the general isolation
operations from there (workqueue affinity, rcu nocb, ...).

> 9 files changed, 144 insertions(+)
> create mode 100644 include/linux/cpu_isolated.h
> create mode 100644 kernel/time/cpu_isolated.c
>
> diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
> index e036c0aa9792..7db6f8386417 100644
> --- a/arch/tile/kernel/process.c
> +++ b/arch/tile/kernel/process.c
> @@ -70,6 +70,15 @@ void arch_cpu_idle(void)
> _cpu_idle();
> }
>
> +#ifdef CONFIG_CPU_ISOLATED
> +void cpu_isolated_wait(void)
> +{
> + set_current_state(TASK_INTERRUPTIBLE);
> + _cpu_idle();
> + set_current_state(TASK_RUNNING);
> +}

I'm still uncomfortable with that. A wake up model could work?

> +#endif
> +
> /*
> * Release a thread_info structure
> */
> diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
> new file mode 100644
> index 000000000000..a3d17360f7ae
> --- /dev/null
> +++ b/include/linux/cpu_isolated.h
> @@ -0,0 +1,24 @@
> +/*
> + * CPU isolation related global functions
> + */
> +#ifndef _LINUX_CPU_ISOLATED_H
> +#define _LINUX_CPU_ISOLATED_H
> +
> +#include <linux/tick.h>
> +#include <linux/prctl.h>
> +
> +#ifdef CONFIG_CPU_ISOLATED
> +static inline bool is_cpu_isolated(void)
> +{
> + return tick_nohz_full_cpu(smp_processor_id()) &&
> + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
> +}
> +
> +extern void cpu_isolated_enter(void);
> +extern void cpu_isolated_wait(void);
> +#else
> +static inline bool is_cpu_isolated(void) { return false; }
> +static inline void cpu_isolated_enter(void) { }
> +#endif

And all the naming should be about task as well, if we take that task direction.

> +
> +#endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04b5ada460b4..0bb248385d88 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1776,6 +1776,9 @@ struct task_struct {
> unsigned long task_state_change;
> #endif
> int pagefault_disabled;
> +#ifdef CONFIG_CPU_ISOLATED
> + unsigned int cpu_isolated_flags;
> +#endif

Can't we add a new flag to tsk->flags? There seem to be some values remaining.

Thanks.

2015-08-12 18:22:28

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] cpu_isolated: add initial support

On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
> On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote:
>> The existing nohz_full mode is designed as a "soft" isolation mode
>> that makes tradeoffs to minimize userspace interruptions while
>> still attempting to avoid overheads in the kernel entry/exit path,
>> to provide 100% kernel semantics, etc.
>>
>> However, some applications require a "hard" commitment from the
>> kernel to avoid interruptions, in particular userspace device
>> driver style applications, such as high-speed networking code.
>>
>> This change introduces a framework to allow applications
>> to elect to have the "hard" semantics as needed, specifying
>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>> Subsequent commits will add additional flags and additional
>> semantics.
> We are doing this at the process level but the isolation works on
> the CPU scope... Now I wonder if prctl is the right interface.
>
> That said the user is rather interested in isolating a task. The CPU
> being the backend eventually.
>
> For example if the task is migrated by accident, we want it to be
> warned about that. And if the isolation is done on the CPU level
> instead of the task level, this won't happen.
>
> I'm also afraid that the naming clashes with cpu_isolated_map,
> although it could be a subset of it.
>
> So probably in this case we should consider talking about task rather
> than CPU isolation and change naming accordingly (sorry, I know I
> suggested cpu_isolation.c, I guess I had to see the result to realize).
>
> We must sort that out first. Either we consider isolation on the task
> level (and thus the underlying CPU by backend effect) and we use prctl().
> Or we do this on the CPU level and we use a specific syscall or sysfs
> which takes effect on any task in the relevant isolated CPUs.
>
> What do you think?

Yes, definitely task-centric is the right model.

With the original tilegx version of this code, we also checked that
the process had only a single core in its affinity mask, and that the
single core in question was a nohz_full core, before allowing the
"task isolated" mode to take effect. I didn't do that in this round
of patches because it seemed a little silly in that the user could
then immediately reset their affinity to another core and lose the
effect, and it wasn't clear how to handle that: do we return EINVAL
from sched_setaffinity() after enabling the "task isolated" mode?
That seems potentially ugly, maybe standards-violating, etc. So I
didn't bother.

But you could certainly argue for failing prctl() in that case anyway,
as a way to make sure users aren't doing something stupid like calling
the prctl() from a task that's running on a housekeeping core. And
you could even argue for doing some kind of console spew if you try to
migrate a task that is in "task isolation" state - though I suppose if
you migrate it to another isolcpus and nohz_full core, maybe that's
kind of reasonable and doesn't deserve a warning? I'm not sure.

>> The kernel must be built with the new CPU_ISOLATED Kconfig flag
>> to enable this mode, and the kernel booted with an appropriate
>> nohz_full=CPULIST boot argument. The "cpu_isolated" state is then
>> indicated by setting a new task struct field, cpu_isolated_flags,
>> to the value passed by prctl(). When the _ENABLE bit is set for a
>> task, and it is returning to userspace on a nohz_full core, it calls
>> the new cpu_isolated_enter() routine to take additional actions
>> to help the task avoid being interrupted in the future.
>>
>> Initially, there are only three actions taken. First, the
>> task calls lru_add_drain() to prevent being interrupted by a
>> subsequent lru_add_drain_all() call on another core. Then, it calls
>> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
>> interrupt. Finally, the code checks for pending timer interrupts
>> and quiesces until they are no longer pending. As a result, sys
>> calls (and page faults, etc.) can be inordinately slow. However,
>> this quiescing guarantees that no unexpected interrupts will occur,
>> even if the application intentionally calls into the kernel.
>>
>> Signed-off-by: Chris Metcalf <[email protected]>
>> ---
>> arch/tile/kernel/process.c | 9 ++++++
>> include/linux/cpu_isolated.h | 24 +++++++++++++++
>> include/linux/sched.h | 3 ++
>> include/uapi/linux/prctl.h | 5 ++++
>> kernel/context_tracking.c | 3 ++
>> kernel/sys.c | 8 +++++
>> kernel/time/Kconfig | 20 +++++++++++++
>> kernel/time/Makefile | 1 +
>> kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++
> It's not about time :-)
>
> The timer is only a part of the isolation.
>
> Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would
> better fit.
>
> kernel/task_isolation.c maybe or just kernel/isolation.c
>
> I think I prefer the latter because I'm not only interested in that task
> hard isolation feature, I would like to also drive all the general isolation
> operations from there (workqueue affinity, rcu nocb, ...).

That's reasonable, but I think the "task isolation" naming is probably
better for all the stuff that we're doing in this patch. In other words,
we probably should use "task_isolation" as the prefix for symbols
names and API names, even if we put the code in kernel/isolation.c
for now in anticipation of non-task isolation being added later.

I think my instinct would still be to call it kernel/task_isolation.c
until we actually add some non-task isolation, and at that point we
can decide if it makes sense to rename the file, or put the new
code somewhere else, but I'm OK with doing it the way I described
in the previous paragraph if you think it's better.

>> +#ifdef CONFIG_CPU_ISOLATED
>> +void cpu_isolated_wait(void)
>> +{
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + _cpu_idle();
>> + set_current_state(TASK_RUNNING);
>> +}
> I'm still uncomfortable with that. A wake up model could work?

I don't know exactly what you have in mind. The theory is that
at this point we're ready to return to user space and we're just
waiting for a timer tick that is guaranteed to arrive, since there
is something pending for the timer.

And, this is an arch-specific method anyway; the generic method
is actually checking to see if a signal has been delivered,
scheduling is needed, etc., each time around the loop, so if
you're not sure your architecture will do the right thing, just
don't provide a method that idles while waiting. For tilegx I'm
sure it works correctly, so I'm OK providing that method.

>> +extern void cpu_isolated_enter(void);
>> +extern void cpu_isolated_wait(void);
>> +#else
>> +static inline bool is_cpu_isolated(void) { return false; }
>> +static inline void cpu_isolated_enter(void) { }
>> +#endif
> And all the naming should be about task as well, if we take that task direction.

As discussed above, probably task_isolation_enter(), etc.

>> +
>> +#endif
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 04b5ada460b4..0bb248385d88 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1776,6 +1776,9 @@ struct task_struct {
>> unsigned long task_state_change;
>> #endif
>> int pagefault_disabled;
>> +#ifdef CONFIG_CPU_ISOLATED
>> + unsigned int cpu_isolated_flags;
>> +#endif
> Can't we add a new flag to tsk->flags? There seem to be some values remaining.

Yeah, I thought of that, but it seems like a pretty scarce resource,
and I wasn't sure it was the right thing to do. Also, I'm not actually
sure why the lowest two bits aren't apparently being used; looks
like PF_EXITING (0x4) is the first bit used. And there are only three
more bits higher up in the word that are not assigned.

Also, right now we are allowing users to customize the signal delivered
for STRICT violation, and that signal value is stored in the
cpu_isolated_flags word as well, so we really don't have room in
tsk->flags for all of that anyway.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-08-25 19:56:16

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 0/6] support "task_isolated" mode for nohz_full

The cover email for the patch series is getting a little unwieldy
so I will provide a terser summary here, and just update the
list of changes from version to version. Please see the previous
versions linked by the In-Reply-To for more detailed comments
about changes in earlier versions of the patch series.

v6:
restructured to be a "task_isolation" mode not a "cpu_isolated"
mode (Frederic)

v5:
rebased on kernel v4.2-rc3
converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
incorporates Christoph Lameter's quiet_vmstat() call

v4:
rebased on kernel v4.2-rc1
added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
remove dependency on cpu_idle subsystem (Thomas Gleixner)
use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
use seconds for console messages instead of jiffies (Thomas Gleixner)
updated commit description for patch 5/5

v2:
rename "dataplane" to "cpu_isolated"
drop ksoftirqd suppression changes (believed no longer needed)
merge previous "QUIESCE" functionality into baseline functionality
explicitly track syscalls and exceptions for "STRICT" functionality
allow configuring a signal to be delivered for STRICT mode failures
move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it. However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode. The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control. Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers. This patch
series represents the first serious attempt to upstream that
functionality. Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on task_isolation cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, task_isolation threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
task_isolation: add initial support
task_isolation: support PR_TASK_ISOLATION_STRICT mode
task_isolation: provide strict mode configurable signal
task_isolation: add debug boot flag
nohz: task_isolation: allow tick to be fully disabled

Christoph Lameter (1):
vmstat: provide a function to quiet down the diff processing

Documentation/kernel-parameters.txt | 7 +++
arch/arm64/kernel/ptrace.c | 5 ++
arch/tile/kernel/process.c | 9 +++
arch/tile/kernel/ptrace.c | 5 +-
arch/tile/mm/homecache.c | 5 +-
arch/x86/kernel/ptrace.c | 2 +
include/linux/context_tracking.h | 11 +++-
include/linux/isolation.h | 42 +++++++++++++
include/linux/sched.h | 3 +
include/linux/vmstat.h | 2 +
include/uapi/linux/prctl.h | 8 +++
init/Kconfig | 20 ++++++
kernel/Makefile | 1 +
kernel/context_tracking.c | 12 +++-
kernel/irq_work.c | 5 +-
kernel/isolation.c | 122 ++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 21 +++++++
kernel/signal.c | 5 ++
kernel/smp.c | 4 ++
kernel/softirq.c | 7 +++
kernel/sys.c | 8 +++
kernel/time/tick-sched.c | 3 +-
mm/vmstat.c | 14 +++++
23 files changed, 311 insertions(+), 10 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c

--
2.1.2

2015-08-25 19:56:26

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 1/6] vmstat: provide a function to quiet down the diff processing

From: Christoph Lameter <[email protected]>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <[email protected]>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..c013b8d8e434 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page,
static inline void refresh_cpu_vm_stats(int cpu) { }
static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..cf7d324f16e2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w)
}

/*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+ do {
+ if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+ cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+ } while (refresh_cpu_vm_stats());
+}
+
+/*
* Check if the diffs for a certain cpu indicate that
* an update is needed.
*/
--
2.1.2

2015-08-25 19:56:33

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 2/6] task_isolation: add initial support

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument. The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl(). When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new task_isolation_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken. First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core. Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt. Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending. As a result, sys
calls (and page faults, etc.) can be inordinately slow. However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/kernel/process.c | 9 ++++++
include/linux/isolation.h | 24 +++++++++++++++
include/linux/sched.h | 3 ++
include/uapi/linux/prctl.h | 5 ++++
init/Kconfig | 20 +++++++++++++
kernel/Makefile | 1 +
kernel/context_tracking.c | 3 ++
kernel/isolation.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 8 +++++
9 files changed, 148 insertions(+)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..1d9bd2320a50 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
_cpu_idle();
}

+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_wait(void)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ _cpu_idle();
+ set_current_state(TASK_RUNNING);
+}
+#endif
+
/*
* Release a thread_info structure
*/
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..fd04011b1c1e
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,24 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+static inline bool task_isolation_enabled(void)
+{
+ return tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern void task_isolation_enter(void);
+extern void task_isolation_wait(void);
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..2acb618189d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,6 +1776,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+ unsigned int task_isolation_flags;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..79da784fe17a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */

+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION 47
+#define PR_GET_TASK_ISOLATION 48
+# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index af09b4fb43d2..82d313cbd70f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT

endmenu # "RCU Subsystem"

+config TASK_ISOLATION
+ bool "Provide hard CPU isolation from the kernel on demand"
+ depends on NO_HZ_FULL
+ help
+ Allow userspace processes to place themselves on nohz_full
+ cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+ themselves from the kernel. On return to userspace,
+ isolated tasks will first arrange that no future kernel
+ activity will interrupt the task while the task is running
+ in userspace. This "hard" isolation from the kernel is
+ required for userspace tasks that are running hard real-time
+ tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+ Without this option, but with NO_HZ_FULL enabled, the kernel
+ will make a best-faith, "soft" effort to shield a single userspace
+ process from interrupts, but makes no guarantees.
+
+ You should say "N" unless you are intending to run a
+ high-performance userspace driver or similar task.
+
config BUILD_BIN2C
bool
default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 43c4c920f30a..9ffb5c021767 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..c57c99f5c4d7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/isolation.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
* on the tick.
*/
if (state == CONTEXT_USER) {
+ if (task_isolation_enabled())
+ task_isolation_enter();
trace_user_enter(0);
vtime_user_enter(current);
}
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..d4618cd9e23d
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,75 @@
+/*
+ * linux/kernel/isolation.c
+ *
+ * Implementation for task isolation.
+ *
+ * Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include "time/tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ *
+ * Note that it must be guaranteed for a particular architecture
+ * that if next_event is not KTIME_MAX, then a timer interrupt will
+ * occur, otherwise the sleep may never awaken.
+ */
+void __weak task_isolation_wait(void)
+{
+ cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In task_isolation mode we wait until no more interrupts are
+ * pending. Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two task_isolation processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually. Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void task_isolation_enter(void)
+{
+ struct clock_event_device *dev =
+ __this_cpu_read(tick_cpu_device.evtdev);
+ struct task_struct *task = current;
+ unsigned long start = jiffies;
+ bool warned = false;
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ /* Quieten the vmstat worker so it won't interrupt us. */
+ quiet_vmstat();
+
+ while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+ if (!warned && (jiffies - start) >= (5 * HZ)) {
+ pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ warned = true;
+ }
+ if (should_resched())
+ schedule();
+ if (test_thread_flag(TIF_SIGPENDING))
+ break;
+ task_isolation_wait();
+ }
+ if (warned) {
+ pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n",
+ task->comm, task->pid, smp_processor_id(),
+ (jiffies - start) / HZ);
+ dump_stack();
+ }
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..c7024be2d79b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_TASK_ISOLATION
+ case PR_SET_TASK_ISOLATION:
+ me->task_isolation_flags = arg2;
+ break;
+ case PR_GET_TASK_ISOLATION:
+ error = me->task_isolation_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
--
2.1.2

2015-08-25 19:56:36

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies. Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/arm64/kernel/ptrace.c | 5 +++++
arch/tile/kernel/ptrace.c | 5 ++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/isolation.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++
8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..e3d83a12f3cf 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/isolation.h>

#include <asm/compat.h>
#include <asm/debug-monitors.h>
@@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,

asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
+ /* Ensure we report task_isolation violations in all circumstances. */
+ if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+ task_isolation_syscall(regs->syscallno);
+
/* Do the secure computing check first; failures should be fast. */
if (secure_computing() == -1)
return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/isolation.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);

@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (task_isolation_strict())
+ task_isolation_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
}

extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
extern void task_isolation_wait(void);
#else
static inline bool task_isolation_enabled(void) { return false; }
static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
#endif

+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->task_isolation_flags &
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_TASK_ISOLATION 47
#define PR_GET_TASK_ISOLATION 48
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
context_tracking_recursion_exit();
out_irq_restore:
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
#include <linux/swap.h>
#include <linux/vmstat.h>
#include <linux/isolation.h>
+#include <asm/unistd.h>
#include "time/tick-sched.h"

/*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
dump_stack();
}
}
+
+static void kill_task_isolation_strict_task(void)
+{
+ dump_stack();
+ current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+ pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_task_isolation_strict_task();
+}
--
2.1.2

2015-08-25 19:57:29

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 4/6] task_isolation: provide strict mode configurable signal

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/uapi/linux/prctl.h | 2 ++
kernel/isolation.c | 17 +++++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index e16e13911e8a..2a4ddc890e22 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
#define PR_GET_TASK_ISOLATION 48
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
# define PR_TASK_ISOLATION_STRICT (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index a89a6e9adfb4..b776aa632c8f 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -75,11 +75,20 @@ void task_isolation_enter(void)
}
}

-static void kill_task_isolation_strict_task(void)
+static void kill_task_isolation_strict_task(int is_syscall)
{
+ siginfo_t info = {};
+ int sig;
+
dump_stack();
current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
- send_sig(SIGKILL, current, 1);
+
+ sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+ if (sig == 0)
+ sig = SIGKILL;
+ info.si_signo = sig;
+ info.si_code = is_syscall;
+ send_sig_info(sig, &info, current);
}

/*
@@ -98,7 +107,7 @@ void task_isolation_syscall(int syscall)

pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
current->comm, current->pid, syscall);
- kill_task_isolation_strict_task();
+ kill_task_isolation_strict_task(1);
}

/*
@@ -109,5 +118,5 @@ void task_isolation_exception(void)
{
pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
current->comm, current->pid);
- kill_task_isolation_strict_task();
+ kill_task_isolation_strict_task(0);
}
--
2.1.2

2015-08-25 19:56:44

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 5/6] task_isolation: add debug boot flag

The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode. Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel. But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/kernel-parameters.txt | 7 +++++++
arch/tile/mm/homecache.c | 5 ++++-
include/linux/isolation.h | 2 ++
kernel/irq_work.c | 5 ++++-
kernel/sched/core.c | 21 +++++++++++++++++++++
kernel/signal.c | 5 +++++
kernel/smp.c | 4 ++++
kernel/softirq.c | 7 +++++++
8 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f0459cd7b..934f172eb140 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3595,6 +3595,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.

+ task_isolation_debug [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION and booted
+ in nohz_full= mode, this setting will generate console
+ backtraces when the kernel is about to interrupt a
+ task that has requested PR_TASK_ISOLATION_ENABLE
+ and is running on a nohz_full core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..a79325113105 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/isolation.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
- for_each_cpu(cpu, &mask)
+ for_each_cpu(cpu, &mask) {
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+ task_isolation_debug(cpu);
+ }
}

/*
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 27a4469831c1..9f1747331a36 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -18,11 +18,13 @@ extern void task_isolation_enter(void);
extern void task_isolation_syscall(int nr);
extern void task_isolation_exception(void);
extern void task_isolation_wait(void);
+extern void task_isolation_debug(int cpu);
#else
static inline bool task_isolation_enabled(void) { return false; }
static inline void task_isolation_enter(void) { }
static inline void task_isolation_syscall(int nr) { }
static inline void task_isolation_exception(void) { }
+static inline void task_isolation_debug(int cpu) { }
#endif

static inline bool task_isolation_strict(void)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..745c2ea6a4e4 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
+#include <linux/isolation.h>
#include <asm/processor.h>


@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ task_isolation_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10081..0c4e4eba69b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
#include <linux/binfmts.h>
#include <linux/context_tracking.h>
#include <linux/compiler.h>
+#include <linux/isolation.h>

#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -745,6 +746,26 @@ bool sched_can_stop_tick(void)
}
#endif /* CONFIG_NO_HZ_FULL */

+#ifdef CONFIG_TASK_ISOLATION
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+ task_isolation_debug_flag = true;
+ return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug(int cpu)
+{
+ if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) &&
+ (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) {
+ pr_err("Interrupt detected for task_isolation cpu %d\n", cpu);
+ dump_stack();
+ }
+}
+#endif
+
void sched_avg_update(struct rq *rq)
{
s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 836df8dac6cc..60e15e835b9e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_TASK_ISOLATION
+ /* If the task is being killed, don't complain about task_isolation. */
+ if (state & TASK_WAKEKILL)
+ t->task_isolation_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..b0bddff2693d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/isolation.h>

#include "smpboot.h"

@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
+ task_isolation_debug(cpu);
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ for_each_cpu(cpu, cfd->cpumask)
+ task_isolation_debug(cpu);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..ed762fec7265 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,8 +24,10 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/smpboot.h>
+#include <linux/context_tracking.h>
#include <linux/tick.h>
#include <linux/irq.h>
+#include <linux/isolation.h>

#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>
@@ -335,6 +337,11 @@ void irq_enter(void)
_local_bh_enable();
}

+ if (context_tracking_cpu_is_enabled() &&
+ context_tracking_in_user() &&
+ !in_interrupt())
+ task_isolation_debug(smp_processor_id());
+
__irq_enter();
}

--
2.1.2

2015-08-25 19:56:59

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6 6/6] nohz: task_isolation: allow tick to be fully disabled

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on
running completely tickless, so don't bound the time_delta for such
processes. In addition, due to the way such processes quiesce by
waiting for the timer tick to stop prior to returning to userspace,
without this commit it won't be possible to use the task_isolation
mode at all.

Removing the 1-second cap was previously discussed (see link
below) and Thomas Gleixner observed that vruntime, load balancing
data, load accounting, and other things might be impacted.
Frederic Weisbecker similarly observed that allowing the tick to
be indefinitely deferred just meant that no one would ever fix the
underlying bugs. However it's at least true that the mode proposed
in this patch can only be enabled on a nohz_full core by a process
requesting task_isolation mode, which may limit how important it is
to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback
timer is removed, this will provide an environment where new code
that relies on that tick will get punished, and we won't forgive
such assumptions silently, so it may also be worth it from that
perspective.

Finally, it's worth observing that the tile architecture has been
using similar code for its Zero-Overhead Linux for many years
(starting in 2008) and customers are very enthusiastic about the
resulting bare-metal performance on cores that are available to
run full Linux semantics on demand (crash, logging, shutdown, etc).
So this semantics is very useful if we can convince ourselves that
doing this is safe.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/time/tick-sched.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..be296499b753 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/isolation.h>

#include <asm/irq_regs.h>

@@ -652,7 +653,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,

#ifdef CONFIG_NO_HZ_FULL
/* Limit the tick delta to the maximum scheduler deferment */
- if (!ts->inidle)
+ if (!ts->inidle && !task_isolation_enabled())
delta = min(delta, scheduler_tick_max_deferment());
#endif

--
2.1.2

2015-08-26 10:36:57

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode

Hi Chris,

On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote:
> With task_isolation mode, the task is in principle guaranteed not to
> be interrupted by the kernel, but only if it behaves. In particular,
> if it enters the kernel via system call, page fault, or any of a
> number of other synchronous traps, it may be unexpectedly exposed
> to long latencies. Add a simple flag that puts the process into
> a state where any such kernel entry is fatal.
>
> To allow the state to be entered and exited, we ignore the prctl()
> syscall so that we can clear the bit again later, and we ignore
> exit/exit_group to allow exiting the task without a pointless signal
> killing you as you try to do so.
>
> This change adds the syscall-detection hooks only for x86, arm64,
> and tile.
>
> The signature of context_tracking_exit() changes to report whether
> we, in fact, are exiting back to user space, so that we can track
> user exceptions properly separately from other kernel entries.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> arch/arm64/kernel/ptrace.c | 5 +++++
> arch/tile/kernel/ptrace.c | 5 ++++-
> arch/x86/kernel/ptrace.c | 2 ++
> include/linux/context_tracking.h | 11 ++++++++---
> include/linux/isolation.h | 16 ++++++++++++++++
> include/uapi/linux/prctl.h | 1 +
> kernel/context_tracking.c | 9 ++++++---
> kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++
> 8 files changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index d882b833dbdb..e3d83a12f3cf 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -37,6 +37,7 @@
> #include <linux/regset.h>
> #include <linux/tracehook.h>
> #include <linux/elf.h>
> +#include <linux/isolation.h>
>
> #include <asm/compat.h>
> #include <asm/debug-monitors.h>
> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>
> asmlinkage int syscall_trace_enter(struct pt_regs *regs)
> {
> + /* Ensure we report task_isolation violations in all circumstances. */
> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())

This is going to force us to check TIF_NOHZ on the syscall slowpath even
when CONFIG_TASK_ISOLATION=n.

> + task_isolation_syscall(regs->syscallno);
> +
> /* Do the secure computing check first; failures should be fast. */

Here we have the usual priority problems with all the subsystems that
hook into the syscall path. If a prctl is later rewritten to a different
syscall, do you care about catching it? Either way, the comment about
doing secure computing "first" needs fixing.

Cheers,

Will

2015-08-26 15:10:52

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode

On 08/26/2015 06:36 AM, Will Deacon wrote:
> Hi Chris,
>
> On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote:
>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>> index d882b833dbdb..e3d83a12f3cf 100644
>> --- a/arch/arm64/kernel/ptrace.c
>> +++ b/arch/arm64/kernel/ptrace.c
>> @@ -37,6 +37,7 @@
>> #include <linux/regset.h>
>> #include <linux/tracehook.h>
>> #include <linux/elf.h>
>> +#include <linux/isolation.h>
>>
>> #include <asm/compat.h>
>> #include <asm/debug-monitors.h>
>> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>>
>> asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>> {
>> + /* Ensure we report task_isolation violations in all circumstances. */
>> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
> This is going to force us to check TIF_NOHZ on the syscall slowpath even
> when CONFIG_TASK_ISOLATION=n.

Yes, good catch. I was thinking the "&& false" would suppress the TIF
test but I forgot that test_bit() takes a volatile argument, so it gets
evaluated even though the result isn't actually used.

But I don't want to just reorder the two tests, because when isolation
is enabled, testing TIF_NOHZ first is better. I think probably the right
solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that
test, even though that is a little crufty. The alternative is to provide
a task_isolation_configured() macro that just returns true or false, and
make it a three-part "&&" test with that new macro first, but
that seems a little crufty as well. Do you have a preference?

>> + task_isolation_syscall(regs->syscallno);
>> +
>> /* Do the secure computing check first; failures should be fast. */
> Here we have the usual priority problems with all the subsystems that
> hook into the syscall path. If a prctl is later rewritten to a different
> syscall, do you care about catching it? Either way, the comment about
> doing secure computing "first" needs fixing.

I admit I am unclear on the utility of rewriting prctl. My instinct is that
we are trying to catch userspace invocations of prctl and allow them,
and fail most everything else, so doing it pre-rewrite seems OK.

I'm not sure if it makes sense to catch it before or after the
secure computing check, though. On reflection maybe doing it
afterwards makes more sense - what do you think?

Thanks!

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-08-26 15:26:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] cpu_isolated: add initial support

On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote:
> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
> >>+#ifdef CONFIG_CPU_ISOLATED
> >>+void cpu_isolated_wait(void)
> >>+{
> >>+ set_current_state(TASK_INTERRUPTIBLE);
> >>+ _cpu_idle();
> >>+ set_current_state(TASK_RUNNING);
> >>+}
> >I'm still uncomfortable with that. A wake up model could work?
>
> I don't know exactly what you have in mind. The theory is that
> at this point we're ready to return to user space and we're just
> waiting for a timer tick that is guaranteed to arrive, since there
> is something pending for the timer.

Hmm, ok I'm going to discuss that in the new version. One worry is that
it gets racy and we sleep there for ever.

>
> And, this is an arch-specific method anyway; the generic method
> is actually checking to see if a signal has been delivered,
> scheduling is needed, etc., each time around the loop, so if
> you're not sure your architecture will do the right thing, just
> don't provide a method that idles while waiting. For tilegx I'm
> sure it works correctly, so I'm OK providing that method.

Yes but we do busy waiting on all other archs then. And since we can wait
for a while there, it doesn't look sane.

> >>diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>index 04b5ada460b4..0bb248385d88 100644
> >>--- a/include/linux/sched.h
> >>+++ b/include/linux/sched.h
> >>@@ -1776,6 +1776,9 @@ struct task_struct {
> >> unsigned long task_state_change;
> >> #endif
> >> int pagefault_disabled;
> >>+#ifdef CONFIG_CPU_ISOLATED
> >>+ unsigned int cpu_isolated_flags;
> >>+#endif
> >Can't we add a new flag to tsk->flags? There seem to be some values remaining.
>
> Yeah, I thought of that, but it seems like a pretty scarce resource,
> and I wasn't sure it was the right thing to do. Also, I'm not actually
> sure why the lowest two bits aren't apparently being used

Probably they were used but got removed.

> looks
> like PF_EXITING (0x4) is the first bit used. And there are only three
> more bits higher up in the word that are not assigned.

Which makes room for 5 :)

>
> Also, right now we are allowing users to customize the signal delivered
> for STRICT violation, and that signal value is stored in the
> cpu_isolated_flags word as well, so we really don't have room in
> tsk->flags for all of that anyway.

Yeah indeed, ok lets keep it that way for now.

Thanks.

2015-08-26 15:55:37

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] cpu_isolated: add initial support

On 08/26/2015 11:26 AM, Frederic Weisbecker wrote:
> On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote:
>> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
>>>> +#ifdef CONFIG_CPU_ISOLATED
>>>> +void cpu_isolated_wait(void)
>>>> +{
>>>> + set_current_state(TASK_INTERRUPTIBLE);
>>>> + _cpu_idle();
>>>> + set_current_state(TASK_RUNNING);
>>>> +}
>>> I'm still uncomfortable with that. A wake up model could work?
>> I don't know exactly what you have in mind. The theory is that
>> at this point we're ready to return to user space and we're just
>> waiting for a timer tick that is guaranteed to arrive, since there
>> is something pending for the timer.
> Hmm, ok I'm going to discuss that in the new version. One worry is that
> it gets racy and we sleep there for ever.
>
>> And, this is an arch-specific method anyway; the generic method
>> is actually checking to see if a signal has been delivered,
>> scheduling is needed, etc., each time around the loop, so if
>> you're not sure your architecture will do the right thing, just
>> don't provide a method that idles while waiting. For tilegx I'm
>> sure it works correctly, so I'm OK providing that method.
> Yes but we do busy waiting on all other archs then. And since we can wait
> for a while there, it doesn't look sane.

We can wait for a while (potentially multiple ticks), which is
certainly a long time, but that's what the user asked for.

Since we're checking signals and scheduling in the busy loop,
we definitely won't get into some nasty unkillable state, which
would be the real worst-case.

I think the question is, could a process just get stuck there
somehow in the normal course of events, where there is a
future event on the tick_cpu_device, but no interrupt is
enabled that will eventually deal with it? This seems like it
would be a pretty fundamental timekeeping bug, so my
assumption here is that can't happen, but maybe...?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-08-28 15:31:42

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v6.1 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies. Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile. For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION
so we can both achieve no overhead for !TASK_ISOLATION, but also
achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <[email protected]>
---
This "v6.1" is just a tweak to the existing v6 series to reflect
Will Deacon's suggestions about the arm64 syscall entry code.
I've updated the git tree with this updated patch in the series.
A more disruptive change would be to capture the thread flags
up front like x86 and tile, which allows the test itself to be
optimized away if the task_isolation call becomes a no-op.

arch/arm64/kernel/ptrace.c | 6 ++++++
arch/tile/kernel/ptrace.c | 5 ++++-
arch/x86/kernel/ptrace.c | 2 ++
include/linux/context_tracking.h | 11 ++++++++---
include/linux/isolation.h | 16 ++++++++++++++++
include/uapi/linux/prctl.h | 1 +
kernel/context_tracking.c | 9 ++++++---
kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++
8 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..5d4284445f70 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/isolation.h>

#include <asm/compat.h>
#include <asm/debug-monitors.h>
@@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
if (secure_computing() == -1)
return -1;

+#ifdef CONFIG_TASK_ISOLATION
+ if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+ task_isolation_syscall(regs->syscallno);
+#endif
+
if (test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
* If TIF_NOHZ is set, we are required to call user_exit() before
* doing anything that could touch RCU.
*/
- if (work & _TIF_NOHZ)
+ if (work & _TIF_NOHZ) {
user_exit();
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+ }

if (work & _TIF_SYSCALL_TRACE) {
if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
if (work & _TIF_NOHZ) {
user_exit();
work &= ~_TIF_NOHZ;
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->orig_ax);
}

#ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@

#include <linux/sched.h>
#include <linux/vtime.h>
+#include <linux/isolation.h>
#include <linux/context_tracking_state.h>
#include <asm/ptrace.h>

@@ -11,7 +12,7 @@
extern void context_tracking_cpu_set(int cpu);

extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);

@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
return 0;

prev_ctx = this_cpu_read(context_tracking.state);
- if (prev_ctx != CONTEXT_KERNEL)
- context_tracking_exit(prev_ctx);
+ if (prev_ctx != CONTEXT_KERNEL) {
+ if (context_tracking_exit(prev_ctx)) {
+ if (task_isolation_strict())
+ task_isolation_exception();
+ }
+ }

return prev_ctx;
}
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
}

extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
extern void task_isolation_wait(void);
#else
static inline bool task_isolation_enabled(void) { return false; }
static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
#endif

+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+ if (tick_nohz_full_cpu(smp_processor_id()) &&
+ (current->task_isolation_flags &
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+ return true;
+#endif
+ return false;
+}
+
#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
#define PR_SET_TASK_ISOLATION 47
#define PR_GET_TASK_ISOLATION 48
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_STRICT (1 << 1)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
* This call supports re-entrancy. This way it can be called from any exception
* handler without needing to know if we came from userspace or not.
*/
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
{
unsigned long flags;
+ bool from_user = false;

if (!context_tracking_is_enabled())
- return;
+ return false;

if (in_interrupt())
- return;
+ return false;

local_irq_save(flags);
if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
*/
rcu_user_exit();
if (state == CONTEXT_USER) {
+ from_user = true;
vtime_user_exit(current);
trace_user_exit(0);
}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
context_tracking_recursion_exit();
out_irq_restore:
local_irq_restore(flags);
+ return from_user;
}
NOKPROBE_SYMBOL(context_tracking_exit);
EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
#include <linux/swap.h>
#include <linux/vmstat.h>
#include <linux/isolation.h>
+#include <asm/unistd.h>
#include "time/tick-sched.h"

/*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
dump_stack();
}
}
+
+static void kill_task_isolation_strict_task(void)
+{
+ dump_stack();
+ current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+ send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return;
+ }
+
+ pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+ current->comm, current->pid, syscall);
+ kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+ pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+ current->comm, current->pid);
+ kill_task_isolation_strict_task();
+}
--
2.1.2

2015-08-28 19:23:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v6 4/6] task_isolation: provide strict mode configurable signal

On Tue, Aug 25, 2015 at 12:55 PM, Chris Metcalf <[email protected]> wrote:
> Allow userspace to override the default SIGKILL delivered
> when a task_isolation process in STRICT mode does a syscall
> or otherwise synchronously enters the kernel.
>
> In addition to being able to set the signal, we now also
> pass whether or not the interruption was from a syscall in
> the si_code field of the siginfo.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> include/uapi/linux/prctl.h | 2 ++
> kernel/isolation.c | 17 +++++++++++++----
> 2 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index e16e13911e8a..2a4ddc890e22 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -195,5 +195,7 @@ struct prctl_mm_map {
> #define PR_GET_TASK_ISOLATION 48
> # define PR_TASK_ISOLATION_ENABLE (1 << 0)
> # define PR_TASK_ISOLATION_STRICT (1 << 1)
> +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
>
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> index a89a6e9adfb4..b776aa632c8f 100644
> --- a/kernel/isolation.c
> +++ b/kernel/isolation.c
> @@ -75,11 +75,20 @@ void task_isolation_enter(void)
> }
> }
>
> -static void kill_task_isolation_strict_task(void)
> +static void kill_task_isolation_strict_task(int is_syscall)
> {
> + siginfo_t info = {};
> + int sig;
> +
> dump_stack();
> current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
> - send_sig(SIGKILL, current, 1);
> +
> + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
> + if (sig == 0)
> + sig = SIGKILL;
> + info.si_signo = sig;
> + info.si_code = is_syscall;
> + send_sig_info(sig, &info, current);

The stuff you're doing here is sufficiently nasty that I think you
should add something like:

rcu_lockdep_assert(rcu_is_watching(), "some message here");

Because as it stands this is just asking for trouble.

For the record, I am *extremely* unhappy with the state of the context
tracking hooks.

--Andy