Ingo,
Please pull the nohz/core branch (on top of tip:timers/core) that can be
found at:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/core
HEAD: 60d0b4dc36259029e4cc8d030d8f59b33a119814
---
Summary of changes:
* Fix rare crashes due to context tracking recursion when faulting on
vmalloc'ed areas.
* Simplify the TIF_NOHZ propagation (less context switch overhead)
* Set nohz full CPUs as isolcpus. This way we enforce nohz CPUs to be
undisturbed by globally affine tasks.
Thanks,
Frederic
---
Chris Metcalf (2):
nohz: Add tick_nohz_full_add_cpus_to() API
nohz: Set isolcpus when nohz_full is set
Frederic Weisbecker (2):
context_tracking: Protect against recursion
context_tracking: Inherit TIF_NOHZ through forks instead of context switches
include/linux/context_tracking.h | 10 -----
include/linux/context_tracking_state.h | 1 +
include/linux/sched.h | 3 ++
include/linux/tick.h | 7 ++++
kernel/context_tracking.c | 68 +++++++++++++++++++++++-----------
kernel/sched/core.c | 4 +-
6 files changed, 60 insertions(+), 33 deletions(-)
Context tracking recursion can happen when an exception triggers in the
middle of a call to a context tracking probe.
This special case can be caused by vmalloc faults. If an access to a
memory area allocated by vmalloc happens in the middle of
context_tracking_enter(), we may run into an endless fault loop because
the exception in turn calls context_tracking_enter() which faults on
the same vmalloc'ed memory, triggering an exception again, etc...
Some rare crashes have been reported so lets protect against this with
a recursion counter.
Reported-by: Dave Jones <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
include/linux/context_tracking_state.h | 1 +
kernel/context_tracking.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 6b7b96a..678ecdf 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -12,6 +12,7 @@ struct context_tracking {
* may be further optimized using static keys.
*/
bool active;
+ int recursion;
enum ctx_state {
CONTEXT_KERNEL = 0,
CONTEXT_USER,
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1..b9e0b4f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -38,6 +38,25 @@ void context_tracking_cpu_set(int cpu)
}
}
+static bool context_tracking_recursion_enter(void)
+{
+ int recursion;
+
+ recursion = __this_cpu_inc_return(context_tracking.recursion);
+ if (recursion == 1)
+ return true;
+
+ WARN_ONCE((recursion < 1), "Invalid context tracking recursion value %d\n", recursion);
+ __this_cpu_dec(context_tracking.recursion);
+
+ return false;
+}
+
+static void context_tracking_recursion_exit(void)
+{
+ __this_cpu_dec(context_tracking.recursion);
+}
+
/**
* context_tracking_enter - Inform the context tracking that the CPU is going
* enter user or guest space mode.
@@ -75,6 +94,11 @@ void context_tracking_enter(enum ctx_state state)
WARN_ON_ONCE(!current->mm);
local_irq_save(flags);
+ if (!context_tracking_recursion_enter()) {
+ local_irq_restore(flags);
+ return;
+ }
+
if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
/*
@@ -105,6 +129,7 @@ void context_tracking_enter(enum ctx_state state)
*/
__this_cpu_write(context_tracking.state, state);
}
+ context_tracking_recursion_exit();
local_irq_restore(flags);
}
NOKPROBE_SYMBOL(context_tracking_enter);
@@ -139,6 +164,10 @@ void context_tracking_exit(enum ctx_state state)
return;
local_irq_save(flags);
+ if (!context_tracking_recursion_enter()) {
+ local_irq_restore(flags);
+ return;
+ }
if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
@@ -153,6 +182,7 @@ void context_tracking_exit(enum ctx_state state)
}
__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
}
+ context_tracking_recursion_exit();
local_irq_restore(flags);
}
NOKPROBE_SYMBOL(context_tracking_exit);
--
2.1.4
TIF_NOHZ is used by context_tracking to force syscall slow-path on every
task in order to track userspace roundtrips. As such, it must be set on
all running tasks.
It's currently explicitly inherited through context switches. There is
no need to do it on this fast-path though. The flag could be simply
set once for all on all tasks, whether they are running or not.
Lets do this by setting the flag to init task on early boot and let it
propagate through fork inheritance.
While at it mark context_tracking_cpu_set() as init code, we only need
it at early boot time.
Suggested-by: Oleg Nesterov <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
include/linux/context_tracking.h | 10 ---------
include/linux/sched.h | 3 +++
kernel/context_tracking.c | 44 +++++++++++++++++-----------------------
kernel/sched/core.c | 1 -
4 files changed, 22 insertions(+), 36 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838..b96bd29 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -14,8 +14,6 @@ extern void context_tracking_enter(enum ctx_state state);
extern void context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);
-extern void __context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next);
static inline void user_enter(void)
{
@@ -51,19 +49,11 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}
}
-static inline void context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next)
-{
- if (context_tracking_is_enabled())
- __context_tracking_task_switch(prev, next);
-}
#else
static inline void user_enter(void) { }
static inline void user_exit(void) { }
static inline enum ctx_state exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
-static inline void context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next) { }
#endif /* !CONFIG_CONTEXT_TRACKING */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae4..f5d4f9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2540,6 +2540,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
}
#endif
+#define tasklist_empty() \
+ list_empty(&init_task.tasks)
+
#define next_task(p) \
list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index b9e0b4f..a3aaca7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -30,14 +30,6 @@ EXPORT_SYMBOL_GPL(context_tracking_enabled);
DEFINE_PER_CPU(struct context_tracking, context_tracking);
EXPORT_SYMBOL_GPL(context_tracking);
-void context_tracking_cpu_set(int cpu)
-{
- if (!per_cpu(context_tracking.active, cpu)) {
- per_cpu(context_tracking.active, cpu) = true;
- static_key_slow_inc(&context_tracking_enabled);
- }
-}
-
static bool context_tracking_recursion_enter(void)
{
int recursion;
@@ -194,24 +186,26 @@ void context_tracking_user_exit(void)
}
NOKPROBE_SYMBOL(context_tracking_user_exit);
-/**
- * __context_tracking_task_switch - context switch the syscall callbacks
- * @prev: the task that is being switched out
- * @next: the task that is being switched in
- *
- * The context tracking uses the syscall slow path to implement its user-kernel
- * boundaries probes on syscalls. This way it doesn't impact the syscall fast
- * path on CPUs that don't do context tracking.
- *
- * But we need to clear the flag on the previous task because it may later
- * migrate to some CPU that doesn't do the context tracking. As such the TIF
- * flag may not be desired there.
- */
-void __context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next)
+void __init context_tracking_cpu_set(int cpu)
{
- clear_tsk_thread_flag(prev, TIF_NOHZ);
- set_tsk_thread_flag(next, TIF_NOHZ);
+ static __initdata bool initialized = false;
+
+ if (!per_cpu(context_tracking.active, cpu)) {
+ per_cpu(context_tracking.active, cpu) = true;
+ static_key_slow_inc(&context_tracking_enabled);
+ }
+
+ if (initialized)
+ return;
+
+ /*
+ * Set TIF_NOHZ to init/0 and let it propagate to all tasks through fork
+ * This assumes that init is the only task at this early boot stage.
+ */
+ set_tsk_thread_flag(&init_task, TIF_NOHZ);
+ WARN_ON_ONCE(!tasklist_empty());
+
+ initialized = true;
}
#ifdef CONFIG_CONTEXT_TRACKING_FORCE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8a6196..6f149f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2338,7 +2338,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
- context_tracking_task_switch(prev, next);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
--
2.1.4
From: Chris Metcalf <[email protected]>
This API is useful to modify a cpumask indicating some special nohz-type
functionality so that the nohz cores are automatically added to that set.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Ingo Molnar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/tick.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da5..4191b56 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -134,6 +134,12 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}
+static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
+{
+ if (tick_nohz_full_enabled())
+ cpumask_or(mask, mask, tick_nohz_full_mask);
+}
+
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
@@ -142,6 +148,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
+static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
static inline void __tick_nohz_full_check(void) { }
static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick(void) { }
--
2.1.4
From: Chris Metcalf <[email protected]>
nohz_full is only useful with isolcpus also set, since otherwise the
scheduler has to run periodically to try to determine whether to steal
work from other cores.
Accordingly, when booting with nohz_full=xxx on the command line, we
should act as if isolcpus=xxx was also set, and set (or extend) the
isolcpus set to include the nohz_full cpus.
Acked-by: Mike Galbraith <[email protected]> ["thumbs up!"]
Acked-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f149f8..e95b4d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7042,6 +7042,9 @@ void __init sched_init_smp(void)
alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
+ /* nohz_full won't take effect without isolating the cpus. */
+ tick_nohz_full_add_cpus_to(cpu_isolated_map);
+
sched_init_numa();
/*
--
2.1.4
* Frederic Weisbecker <[email protected]> wrote:
> @@ -75,6 +94,11 @@ void context_tracking_enter(enum ctx_state state)
> WARN_ON_ONCE(!current->mm);
>
> local_irq_save(flags);
> + if (!context_tracking_recursion_enter()) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> if ( __this_cpu_read(context_tracking.state) != state) {
> if (__this_cpu_read(context_tracking.active)) {
> /*
> @@ -105,6 +129,7 @@ void context_tracking_enter(enum ctx_state state)
> */
> __this_cpu_write(context_tracking.state, state);
> }
> + context_tracking_recursion_exit();
> local_irq_restore(flags);
> }
So why not add an 'out_irq_restore:' label and use goto instead of
duplicating the return path in the recursion check?
> NOKPROBE_SYMBOL(context_tracking_enter);
> @@ -139,6 +164,10 @@ void context_tracking_exit(enum ctx_state state)
> return;
>
> local_irq_save(flags);
> + if (!context_tracking_recursion_enter()) {
> + local_irq_restore(flags);
> + return;
Ditto.
No need to resend, I fixed this up in the patch.
Thanks,
Ingo
Commit-ID: aed5ed47724f6a7453fa62e3c90f3cee93edbfe3
Gitweb: http://git.kernel.org/tip/aed5ed47724f6a7453fa62e3c90f3cee93edbfe3
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Wed, 6 May 2015 18:04:23 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 7 May 2015 12:02:50 +0200
context_tracking: Protect against recursion
Context tracking recursion can happen when an exception triggers
in the middle of a call to a context tracking probe.
This special case can be caused by vmalloc faults. If an access
to a memory area allocated by vmalloc happens in the middle of
context_tracking_enter(), we may run into an endless fault loop
because the exception in turn calls context_tracking_enter()
which faults on the same vmalloc'ed memory, triggering an
exception again, etc...
Some rare crashes have been reported so lets protect against
this with a recursion counter.
Reported-by: Dave Jones <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/context_tracking_state.h | 1 +
kernel/context_tracking.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 6b7b96a..678ecdf 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -12,6 +12,7 @@ struct context_tracking {
* may be further optimized using static keys.
*/
bool active;
+ int recursion;
enum ctx_state {
CONTEXT_KERNEL = 0,
CONTEXT_USER,
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1..5b11a10 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -38,6 +38,25 @@ void context_tracking_cpu_set(int cpu)
}
}
+static bool context_tracking_recursion_enter(void)
+{
+ int recursion;
+
+ recursion = __this_cpu_inc_return(context_tracking.recursion);
+ if (recursion == 1)
+ return true;
+
+ WARN_ONCE((recursion < 1), "Invalid context tracking recursion value %d\n", recursion);
+ __this_cpu_dec(context_tracking.recursion);
+
+ return false;
+}
+
+static void context_tracking_recursion_exit(void)
+{
+ __this_cpu_dec(context_tracking.recursion);
+}
+
/**
* context_tracking_enter - Inform the context tracking that the CPU is going
* enter user or guest space mode.
@@ -75,6 +94,9 @@ void context_tracking_enter(enum ctx_state state)
WARN_ON_ONCE(!current->mm);
local_irq_save(flags);
+ if (!context_tracking_recursion_enter())
+ goto out_irq_restore;
+
if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
/*
@@ -105,6 +127,8 @@ void context_tracking_enter(enum ctx_state state)
*/
__this_cpu_write(context_tracking.state, state);
}
+ context_tracking_recursion_exit();
+out_irq_restore:
local_irq_restore(flags);
}
NOKPROBE_SYMBOL(context_tracking_enter);
@@ -139,6 +163,9 @@ void context_tracking_exit(enum ctx_state state)
return;
local_irq_save(flags);
+ if (!context_tracking_recursion_enter())
+ goto out_irq_restore;
+
if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
@@ -153,6 +180,8 @@ void context_tracking_exit(enum ctx_state state)
}
__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
}
+ context_tracking_recursion_exit();
+out_irq_restore:
local_irq_restore(flags);
}
NOKPROBE_SYMBOL(context_tracking_exit);
Commit-ID: fafe870f31212a72f3c2d74e7b90e4ef39e83ee1
Gitweb: http://git.kernel.org/tip/fafe870f31212a72f3c2d74e7b90e4ef39e83ee1
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Wed, 6 May 2015 18:04:24 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 7 May 2015 12:02:51 +0200
context_tracking: Inherit TIF_NOHZ through forks instead of context switches
TIF_NOHZ is used by context_tracking to force syscall slow-path
on every task in order to track userspace roundtrips. As such,
it must be set on all running tasks.
It's currently explicitly inherited through context switches.
There is no need to do it in this fast-path though. The flag
could simply be set once for all on all tasks, whether they are
running or not.
Lets do this by setting the flag for the init task on early boot,
and let it propagate through fork inheritance.
While at it, mark context_tracking_cpu_set() as init code, we
only need it at early boot time.
Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E . McKenney <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/context_tracking.h | 10 ---------
include/linux/sched.h | 3 +++
kernel/context_tracking.c | 44 +++++++++++++++++-----------------------
kernel/sched/core.c | 1 -
4 files changed, 22 insertions(+), 36 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838..b96bd29 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -14,8 +14,6 @@ extern void context_tracking_enter(enum ctx_state state);
extern void context_tracking_exit(enum ctx_state state);
extern void context_tracking_user_enter(void);
extern void context_tracking_user_exit(void);
-extern void __context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next);
static inline void user_enter(void)
{
@@ -51,19 +49,11 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}
}
-static inline void context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next)
-{
- if (context_tracking_is_enabled())
- __context_tracking_task_switch(prev, next);
-}
#else
static inline void user_enter(void) { }
static inline void user_exit(void) { }
static inline enum ctx_state exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
-static inline void context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next) { }
#endif /* !CONFIG_CONTEXT_TRACKING */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e61..185a750 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2532,6 +2532,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
}
#endif
+#define tasklist_empty() \
+ list_empty(&init_task.tasks)
+
#define next_task(p) \
list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 5b11a10..0a495ab 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -30,14 +30,6 @@ EXPORT_SYMBOL_GPL(context_tracking_enabled);
DEFINE_PER_CPU(struct context_tracking, context_tracking);
EXPORT_SYMBOL_GPL(context_tracking);
-void context_tracking_cpu_set(int cpu)
-{
- if (!per_cpu(context_tracking.active, cpu)) {
- per_cpu(context_tracking.active, cpu) = true;
- static_key_slow_inc(&context_tracking_enabled);
- }
-}
-
static bool context_tracking_recursion_enter(void)
{
int recursion;
@@ -193,24 +185,26 @@ void context_tracking_user_exit(void)
}
NOKPROBE_SYMBOL(context_tracking_user_exit);
-/**
- * __context_tracking_task_switch - context switch the syscall callbacks
- * @prev: the task that is being switched out
- * @next: the task that is being switched in
- *
- * The context tracking uses the syscall slow path to implement its user-kernel
- * boundaries probes on syscalls. This way it doesn't impact the syscall fast
- * path on CPUs that don't do context tracking.
- *
- * But we need to clear the flag on the previous task because it may later
- * migrate to some CPU that doesn't do the context tracking. As such the TIF
- * flag may not be desired there.
- */
-void __context_tracking_task_switch(struct task_struct *prev,
- struct task_struct *next)
+void __init context_tracking_cpu_set(int cpu)
{
- clear_tsk_thread_flag(prev, TIF_NOHZ);
- set_tsk_thread_flag(next, TIF_NOHZ);
+ static __initdata bool initialized = false;
+
+ if (!per_cpu(context_tracking.active, cpu)) {
+ per_cpu(context_tracking.active, cpu) = true;
+ static_key_slow_inc(&context_tracking_enabled);
+ }
+
+ if (initialized)
+ return;
+
+ /*
+ * Set TIF_NOHZ to init/0 and let it propagate to all tasks through fork
+ * This assumes that init is the only task at this early boot stage.
+ */
+ set_tsk_thread_flag(&init_task, TIF_NOHZ);
+ WARN_ON_ONCE(!tasklist_empty());
+
+ initialized = true;
}
#ifdef CONFIG_CONTEXT_TRACKING_FORCE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f75..54f032c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2332,7 +2332,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
- context_tracking_task_switch(prev, next);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
Commit-ID: 83dedea8a07fb4bf91863764b15c1c4ec00330f9
Gitweb: http://git.kernel.org/tip/83dedea8a07fb4bf91863764b15c1c4ec00330f9
Author: Chris Metcalf <[email protected]>
AuthorDate: Wed, 6 May 2015 18:04:25 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 7 May 2015 12:02:51 +0200
nohz: Add tick_nohz_full_add_cpus_to() API
This API is useful to modify a cpumask indicating some special
nohz-type functionality so that the nohz cores are automatically
added to that set.
Signed-off-by: Chris Metcalf <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/tick.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da5..4191b56 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -134,6 +134,12 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}
+static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
+{
+ if (tick_nohz_full_enabled())
+ cpumask_or(mask, mask, tick_nohz_full_mask);
+}
+
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
@@ -142,6 +148,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
+static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
static inline void __tick_nohz_full_check(void) { }
static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick(void) { }
Commit-ID: 8cb9764fc88b41db11f251e8b2a0d006578b7eb4
Gitweb: http://git.kernel.org/tip/8cb9764fc88b41db11f251e8b2a0d006578b7eb4
Author: Chris Metcalf <[email protected]>
AuthorDate: Wed, 6 May 2015 18:04:26 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 7 May 2015 12:02:51 +0200
nohz: Set isolcpus when nohz_full is set
nohz_full is only useful with isolcpus are also set, since
otherwise the scheduler has to run periodically to try to
determine whether to steal work from other cores.
Accordingly, when booting with nohz_full=xxx on the command
line, we should act as if isolcpus=xxx was also set, and set
(or extend) the isolcpus set to include the nohz_full cpus.
Signed-off-by: Chris Metcalf <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54f032c..b8f4876 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7036,6 +7036,9 @@ void __init sched_init_smp(void)
alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
+ /* nohz_full won't take effect without isolating the cpus. */
+ tick_nohz_full_add_cpus_to(cpu_isolated_map);
+
sched_init_numa();
/*
On Thu, May 07, 2015 at 11:58:32AM +0200, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > @@ -75,6 +94,11 @@ void context_tracking_enter(enum ctx_state state)
> > WARN_ON_ONCE(!current->mm);
> >
> > local_irq_save(flags);
> > + if (!context_tracking_recursion_enter()) {
> > + local_irq_restore(flags);
> > + return;
> > + }
> > +
> > if ( __this_cpu_read(context_tracking.state) != state) {
> > if (__this_cpu_read(context_tracking.active)) {
> > /*
> > @@ -105,6 +129,7 @@ void context_tracking_enter(enum ctx_state state)
> > */
> > __this_cpu_write(context_tracking.state, state);
> > }
> > + context_tracking_recursion_exit();
>
> > local_irq_restore(flags);
> > }
>
> So why not add an 'out_irq_restore:' label and use goto instead of
> duplicating the return path in the recursion check?
Ah yeah. Sometimes people prefer that we just repeat the rollback code if
it's only one line. But I'm fine with the label as well.
>
> > NOKPROBE_SYMBOL(context_tracking_enter);
> > @@ -139,6 +164,10 @@ void context_tracking_exit(enum ctx_state state)
> > return;
> >
> > local_irq_save(flags);
> > + if (!context_tracking_recursion_enter()) {
> > + local_irq_restore(flags);
> > + return;
>
> Ditto.
>
> No need to resend, I fixed this up in the patch.
>
> Thanks,
>
> Ingo
Ah thanks a lot Ingo!
On 05/06/2015 12:04 PM, Frederic Weisbecker wrote:
> From: Chris Metcalf <[email protected]>
>
> nohz_full is only useful with isolcpus also set, since otherwise the
> scheduler has to run periodically to try to determine whether to steal
> work from other cores.
>
> Accordingly, when booting with nohz_full=xxx on the command line, we
> should act as if isolcpus=xxx was also set, and set (or extend) the
> isolcpus set to include the nohz_full cpus.
>
> Acked-by: Mike Galbraith <[email protected]> ["thumbs up!"]
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Chris Metcalf <[email protected]>
> Cc: Peter Zijlstra (Intel) <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
Hi folks,
I've noticed a regression in my testing a few days ago and bisected it down to
this patch. I was seeing frequent soft lockups/RCU lockups and the load of the
testing VMs would go beyond 400-500 (on 32 VCPU guests) - note I'm booting them
with nohz_full=1-27.
This patch sort of explains the behaviour I was seeing now: most of the cores
are no longer being used by the scheduler, and the remaining cores can't deal
with the load imposed on them which results in "lockups" which are really just
the CPUs being unable to keep up.
I always thought that nohz_full without isolcpus meant that the cores would
be available to the scheduler, but it won't interfere if there is one task
running on them. It seems that this patch changed that behaviour.
Did I misunderstand that?
Thanks,
Sasha
On Sat, 2015-05-16 at 15:39 -0400, Sasha Levin wrote:
> On 05/06/2015 12:04 PM, Frederic Weisbecker wrote:
> > From: Chris Metcalf <[email protected]>
> >
> > nohz_full is only useful with isolcpus also set, since otherwise the
> > scheduler has to run periodically to try to determine whether to steal
> > work from other cores.
> >
> > Accordingly, when booting with nohz_full=xxx on the command line, we
> > should act as if isolcpus=xxx was also set, and set (or extend) the
> > isolcpus set to include the nohz_full cpus.
> >
> > Acked-by: Mike Galbraith <[email protected]> ["thumbs up!"]
> > Acked-by: Rik van Riel <[email protected]>
> > Acked-by: Peter Zijlstra (Intel) <[email protected]>
> > Signed-off-by: Chris Metcalf <[email protected]>
> > Cc: Peter Zijlstra (Intel) <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Martin Schwidefsky <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
>
> Hi folks,
>
> I've noticed a regression in my testing a few days ago and bisected it down to
> this patch. I was seeing frequent soft lockups/RCU lockups and the load of the
> testing VMs would go beyond 400-500 (on 32 VCPU guests) - note I'm booting them
> with nohz_full=1-27.
>
> This patch sort of explains the behaviour I was seeing now: most of the cores
> are no longer being used by the scheduler, and the remaining cores can't deal
> with the load imposed on them which results in "lockups" which are really just
> the CPUs being unable to keep up.
>
> I always thought that nohz_full without isolcpus meant that the cores would
> be available to the scheduler, but it won't interfere if there is one task
> running on them. It seems that this patch changed that behaviour.
>
> Did I misunderstand that?
Yeah, tying nohz_full set to isolcpus set up an initial condition that
you have to tear down with cpusets if you want those cpus returned to
the general purpose pool. I had considered the kernel setting initial
state to be an optimization, but have reconsidered.
You may have misunderstood somewhat though, if you do not explicitly
isolate the nohz_full set from the scheduler via either isolcpus or
cpusets, there is no exclusion from load balancing, the scheduler may
place additional tasks on a nohz_full cpu to resolve load imbalance.
Given that kernel initiated association to isolcpus, a user turning
NO_HZ_FULL_ALL on had better not have much generic load to manage. If
he/she does not have CPUSETS enabled, or should Rik's patch rendering
isolcpus immutable be merged, he/she would quickly discover that the
generic box has been transformed into an utterly inflexible specialist
incapable of performing mundane tasks ever again.
-Mike
On 05/17/2015 01:30 AM, Mike Galbraith wrote:
> Given that kernel initiated association to isolcpus, a user turning
> NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> he/she does not have CPUSETS enabled, or should Rik's patch rendering
> isolcpus immutable be merged,
My patch does not aim to make isolcpus immutable, it aims to make
isolcpus resistent to system management tools (like libvirt)
automatically undoing isolcpus the instant a cpuset with the default
cpus (inherited from the root group) is created.
--
All rights reversed
On Sun, 2015-05-17 at 22:17 -0400, Rik van Riel wrote:
> On 05/17/2015 01:30 AM, Mike Galbraith wrote:
>
> > Given that kernel initiated association to isolcpus, a user turning
> > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> > he/she does not have CPUSETS enabled, or should Rik's patch rendering
> > isolcpus immutable be merged,
>
> My patch does not aim to make isolcpus immutable, it aims to make
> isolcpus resistent to system management tools (like libvirt)
> automatically undoing isolcpus the instant a cpuset with the default
> cpus (inherited from the root group) is created.
Aim or not, if cpusets is the sole modifier, it'll render isolcpus
immutable, no? Cpusets could grow an override to the override I
suppose, to regain control of the resource it thinks it manages.
-Mike
On 05/17/2015 11:29 PM, Mike Galbraith wrote:
> On Sun, 2015-05-17 at 22:17 -0400, Rik van Riel wrote:
>> On 05/17/2015 01:30 AM, Mike Galbraith wrote:
>>
>>> Given that kernel initiated association to isolcpus, a user turning
>>> NO_HZ_FULL_ALL on had better not have much generic load to manage. If
>>> he/she does not have CPUSETS enabled, or should Rik's patch rendering
>>> isolcpus immutable be merged,
>>
>> My patch does not aim to make isolcpus immutable, it aims to make
>> isolcpus resistent to system management tools (like libvirt)
>> automatically undoing isolcpus the instant a cpuset with the default
>> cpus (inherited from the root group) is created.
>
> Aim or not, if cpusets is the sole modifier, it'll render isolcpus
> immutable, no? Cpusets could grow an override to the override I
> suppose, to regain control of the resource it thinks it manages.
The other way would be to make /sys/devices/system/cpu/isolcpus
(which Greg KH promised he would queue up for 4.2) writable.
I am all for making isolcpus changeable at run time. I just want
to prevent it being changed accidentally at run time, by system
tools that want to place their workloads in cpusets.
--
All rights reversed
On Mon, 2015-05-18 at 10:07 -0400, Rik van Riel wrote:
> On 05/17/2015 11:29 PM, Mike Galbraith wrote:
> > On Sun, 2015-05-17 at 22:17 -0400, Rik van Riel wrote:
> >> On 05/17/2015 01:30 AM, Mike Galbraith wrote:
> >>
> >>> Given that kernel initiated association to isolcpus, a user turning
> >>> NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> >>> he/she does not have CPUSETS enabled, or should Rik's patch rendering
> >>> isolcpus immutable be merged,
> >>
> >> My patch does not aim to make isolcpus immutable, it aims to make
> >> isolcpus resistent to system management tools (like libvirt)
> >> automatically undoing isolcpus the instant a cpuset with the default
> >> cpus (inherited from the root group) is created.
> >
> > Aim or not, if cpusets is the sole modifier, it'll render isolcpus
> > immutable, no? Cpusets could grow an override to the override I
> > suppose, to regain control of the resource it thinks it manages.
>
> The other way would be to make /sys/devices/system/cpu/isolcpus
> (which Greg KH promised he would queue up for 4.2) writable.
Anything is better than override the override. That's easy, but the
changelog would have to be farmed out to a talented used car salesman.
-Mike
On 05/18/2015 10:22 AM, Mike Galbraith wrote:
> On Mon, 2015-05-18 at 10:07 -0400, Rik van Riel wrote:
>> On 05/17/2015 11:29 PM, Mike Galbraith wrote:
>>> On Sun, 2015-05-17 at 22:17 -0400, Rik van Riel wrote:
>>>> On 05/17/2015 01:30 AM, Mike Galbraith wrote:
>>>>
>>>>> Given that kernel initiated association to isolcpus, a user turning
>>>>> NO_HZ_FULL_ALL on had better not have much generic load to manage. If
>>>>> he/she does not have CPUSETS enabled, or should Rik's patch rendering
>>>>> isolcpus immutable be merged,
>>>>
>>>> My patch does not aim to make isolcpus immutable, it aims to make
>>>> isolcpus resistent to system management tools (like libvirt)
>>>> automatically undoing isolcpus the instant a cpuset with the default
>>>> cpus (inherited from the root group) is created.
>>>
>>> Aim or not, if cpusets is the sole modifier, it'll render isolcpus
>>> immutable, no? Cpusets could grow an override to the override I
>>> suppose, to regain control of the resource it thinks it manages.
>>
>> The other way would be to make /sys/devices/system/cpu/isolcpus
>> (which Greg KH promised he would queue up for 4.2) writable.
>
> Anything is better than override the override. That's easy, but the
> changelog would have to be farmed out to a talented used car salesman.
For real time KVM, it is desirable to run the VCPU threads on
isolated CPUs as real time tasks.
Meanwhile, the emulator threads can run as normal tasks anywhere
on the system.
This means the /machine cpuset, which all guests live under,
needs to contain all the system's CPUs, both isolated and
non-isolated, otherwise it will be unable to have the VCPU
threads on isolated CPUs and the emulator threads on other
CPUs.
--
All rights reversed
On Mon, 2015-05-18 at 10:52 -0400, Rik van Riel wrote:
> For real time KVM, it is desirable to run the VCPU threads on
> isolated CPUs as real time tasks.
>
> Meanwhile, the emulator threads can run as normal tasks anywhere
> on the system.
>
> This means the /machine cpuset, which all guests live under,
> needs to contain all the system's CPUs, both isolated and
> non-isolated, otherwise it will be unable to have the VCPU
> threads on isolated CPUs and the emulator threads on other
> CPUs.
Ok, I know next to nothing about virtualization only because I failed at
maintaining the desired absolute zero, but...
Why do you use isolcpus for this? This KVM setup sounds very similar to
what I do for real realtime. I use a shield script to set up an
exclusive isolated set and a system set and move the mundane world into
the system set. Inject realtime proggy into its exclusive home, it
creates/pins worker bees, done. Why the cpuset isolcpus hybrid?
When you say emulator threads can run anywhere, it sounds like that
includes the isolated set, but even so, nothing prevents injecting
whatever (the kernel permits) into whichever set.
-Mike
Hi,
On Sun, May 17, 2015 at 07:30:50AM +0200, Mike Galbraith wrote:
> Yeah, tying nohz_full set to isolcpus set up an initial condition that
> you have to tear down with cpusets if you want those cpus returned to
> the general purpose pool. I had considered the kernel setting initial
> state to be an optimization, but have reconsidered.
> Given that kernel initiated association to isolcpus, a user turning
> NO_HZ_FULL_ALL on had better not have much generic load to manage. If
On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
time as compared to w/o this patch, except boot cpu every one else
jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
it was working fine, but not after this - it is now like a single core
system.
Regards
Afzal
> he/she does not have CPUSETS enabled, or should Rik's patch rendering
> isolcpus immutable be merged, he/she would quickly discover that the
> generic box has been transformed into an utterly inflexible specialist
> incapable of performing mundane tasks ever again.
On Thu, May 21, 2015 at 02:08:09AM +0530, Afzal Mohammed wrote:
> Hi,
>
> On Sun, May 17, 2015 at 07:30:50AM +0200, Mike Galbraith wrote:
>
> > Yeah, tying nohz_full set to isolcpus set up an initial condition that
> > you have to tear down with cpusets if you want those cpus returned to
> > the general purpose pool. I had considered the kernel setting initial
> > state to be an optimization, but have reconsidered.
>
> > Given that kernel initiated association to isolcpus, a user turning
> > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
>
> On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
> time as compared to w/o this patch, except boot cpu every one else
> jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
> it was working fine, but not after this - it is now like a single core
> system.
I have to ask... What is your use case? What are you wanting NO_HZ_FULL
to do for you?
Thanx, Paul
> Regards
> Afzal
>
> > he/she does not have CPUSETS enabled, or should Rik's patch rendering
> > isolcpus immutable be merged, he/she would quickly discover that the
> > generic box has been transformed into an utterly inflexible specialist
> > incapable of performing mundane tasks ever again.
>
Hi,
On Wed, May 20, 2015 at 02:00:26PM -0700, Paul E. McKenney wrote:
> > > Given that kernel initiated association to isolcpus, a user turning
> > > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> >
> > On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
> > time as compared to w/o this patch, except boot cpu every one else
> > jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
> > it was working fine, but not after this - it is now like a single core
> > system.
>
> I have to ask... What is your use case? What are you wanting NO_HZ_FULL
> to do for you?
I was just playing NO_HZ_FULL with tip-[sched,timers]-* changes.
Thought that shutting down ticks as much as possible would be
beneficial to normal loads too, though it has been mentioned to be used
for specialized loads. Seems like drawbacks due to it weigh against
normal loads, but haven't so far observed any (on a laptop with normal
activities) before this change.
Regards
Afzal
On Thu, May 21, 2015 at 05:42:46PM +0530, Afzal Mohammed wrote:
> Hi,
>
> On Wed, May 20, 2015 at 02:00:26PM -0700, Paul E. McKenney wrote:
>
> > > > Given that kernel initiated association to isolcpus, a user turning
> > > > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> > >
> > > On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
> > > time as compared to w/o this patch, except boot cpu every one else
> > > jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
> > > it was working fine, but not after this - it is now like a single core
> > > system.
> >
> > I have to ask... What is your use case? What are you wanting NO_HZ_FULL
> > to do for you?
>
> I was just playing NO_HZ_FULL with tip-[sched,timers]-* changes.
>
> Thought that shutting down ticks as much as possible would be
> beneficial to normal loads too, though it has been mentioned to be used
> for specialized loads. Seems like drawbacks due to it weigh against
> normal loads, but haven't so far observed any (on a laptop with normal
> activities) before this change.
Indeed, NO_HZ_FULL is special purpose. You normally would select
NO_HZ_FULL_ALL only on a system intended for heavy compute without
normal-workload distractions or for some real-time systems. For mixed
workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
use the boot parameters to select which CPUs are to be running the
specialized portion of the workload.
And you would of course need to lead enough CPUs running normally to
handle the non-specialized portion of the workload.
This sort of thing has traditionally required specialized kernels,
so the cool thing here is that we can make Linux do it. Though, as
you noticed, careful configuration is still required.
Seem reasonable?
Thanx, Paul
On Thu, May 21, 2015 at 05:57:59AM -0700, Paul E. McKenney wrote:
> On Thu, May 21, 2015 at 05:42:46PM +0530, Afzal Mohammed wrote:
> > Hi,
> >
> > On Wed, May 20, 2015 at 02:00:26PM -0700, Paul E. McKenney wrote:
> >
> > > > > Given that kernel initiated association to isolcpus, a user turning
> > > > > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> > > >
> > > > On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
> > > > time as compared to w/o this patch, except boot cpu every one else
> > > > jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
> > > > it was working fine, but not after this - it is now like a single core
> > > > system.
> > >
> > > I have to ask... What is your use case? What are you wanting NO_HZ_FULL
> > > to do for you?
> >
> > I was just playing NO_HZ_FULL with tip-[sched,timers]-* changes.
> >
> > Thought that shutting down ticks as much as possible would be
> > beneficial to normal loads too, though it has been mentioned to be used
> > for specialized loads. Seems like drawbacks due to it weigh against
> > normal loads, but haven't so far observed any (on a laptop with normal
> > activities) before this change.
>
> Indeed, NO_HZ_FULL is special purpose. You normally would select
> NO_HZ_FULL_ALL only on a system intended for heavy compute without
> normal-workload distractions or for some real-time systems. For mixed
> workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
> use the boot parameters to select which CPUs are to be running the
> specialized portion of the workload.
>
> And you would of course need to lead enough CPUs running normally to
> handle the non-specialized portion of the workload.
>
> This sort of thing has traditionally required specialized kernels,
> so the cool thing here is that we can make Linux do it. Though, as
> you noticed, careful configuration is still required.
>
> Seem reasonable?
That said if he saw a big performance regression after applying these patches,
then there is likely a problem in the patchset. Well it could be due to that mode
which loops on full dynticks before resuming to userspace. Indeed when that is
enabled, I expect real throughput issues on workloads doing lots of kernel <->
userspace roundtrips. We just need to make sure this thing only works when requested.
Anyway, I need to look at the patchset.
On Thu, May 21, 2015 at 03:06:23PM +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2015 at 05:57:59AM -0700, Paul E. McKenney wrote:
> > On Thu, May 21, 2015 at 05:42:46PM +0530, Afzal Mohammed wrote:
> > > Hi,
> > >
> > > On Wed, May 20, 2015 at 02:00:26PM -0700, Paul E. McKenney wrote:
> > >
> > > > > > Given that kernel initiated association to isolcpus, a user turning
> > > > > > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> > > > >
> > > > > On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
> > > > > time as compared to w/o this patch, except boot cpu every one else
> > > > > jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
> > > > > it was working fine, but not after this - it is now like a single core
> > > > > system.
> > > >
> > > > I have to ask... What is your use case? What are you wanting NO_HZ_FULL
> > > > to do for you?
> > >
> > > I was just playing NO_HZ_FULL with tip-[sched,timers]-* changes.
> > >
> > > Thought that shutting down ticks as much as possible would be
> > > beneficial to normal loads too, though it has been mentioned to be used
> > > for specialized loads. Seems like drawbacks due to it weigh against
> > > normal loads, but haven't so far observed any (on a laptop with normal
> > > activities) before this change.
> >
> > Indeed, NO_HZ_FULL is special purpose. You normally would select
> > NO_HZ_FULL_ALL only on a system intended for heavy compute without
> > normal-workload distractions or for some real-time systems. For mixed
> > workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
> > use the boot parameters to select which CPUs are to be running the
> > specialized portion of the workload.
> >
> > And you would of course need to lead enough CPUs running normally to
> > handle the non-specialized portion of the workload.
> >
> > This sort of thing has traditionally required specialized kernels,
> > so the cool thing here is that we can make Linux do it. Though, as
> > you noticed, careful configuration is still required.
> >
> > Seem reasonable?
>
> That said if he saw a big performance regression after applying these patches,
> then there is likely a problem in the patchset. Well it could be due to that mode
> which loops on full dynticks before resuming to userspace. Indeed when that is
> enabled, I expect real throughput issues on workloads doing lots of kernel <->
> userspace roundtrips. We just need to make sure this thing only works when requested.
>
> Anyway, I need to look at the patchset.
Fair enough!
Thanx, Paul
Hi,
On Thu, May 21, 2015 at 03:06:23PM +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2015 at 05:57:59AM -0700, Paul E. McKenney wrote:
> > Indeed, NO_HZ_FULL is special purpose. You normally would select
> > NO_HZ_FULL_ALL only on a system intended for heavy compute without
> > normal-workload distractions or for some real-time systems. For mixed
> > workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
> > use the boot parameters to select which CPUs are to be running the
> > specialized portion of the workload.
> >
> > And you would of course need to lead enough CPUs running normally to
> > handle the non-specialized portion of the workload.
> >
> > This sort of thing has traditionally required specialized kernels,
> > so the cool thing here is that we can make Linux do it. Though, as
> > you noticed, careful configuration is still required.
> >
> > Seem reasonable?
Yes, thanks, some dots got connected :)
> That said if he saw a big performance regression after applying these patches,
> then there is likely a problem in the patchset. Well it could be due to that mode
> which loops on full dynticks before resuming to userspace. Indeed when that is
> enabled, I expect real throughput issues on workloads doing lots of kernel <->
> userspace roundtrips. We just need to make sure this thing only works when requested.
With this change (& having NO_HZ_FULL_ALL), hackbench was being served
only by the boot cpu, while w/o this change, all 8 (this is a quad
core HT processor) was being used - observation based on 'top'.
Regards
Afzal
On Thu, May 21, 2015 at 06:59:05PM +0530, Afzal Mohammed wrote:
> Hi,
>
> On Thu, May 21, 2015 at 03:06:23PM +0200, Frederic Weisbecker wrote:
> > On Thu, May 21, 2015 at 05:57:59AM -0700, Paul E. McKenney wrote:
>
> > > Indeed, NO_HZ_FULL is special purpose. You normally would select
> > > NO_HZ_FULL_ALL only on a system intended for heavy compute without
> > > normal-workload distractions or for some real-time systems. For mixed
> > > workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
> > > use the boot parameters to select which CPUs are to be running the
> > > specialized portion of the workload.
> > >
> > > And you would of course need to lead enough CPUs running normally to
> > > handle the non-specialized portion of the workload.
> > >
> > > This sort of thing has traditionally required specialized kernels,
> > > so the cool thing here is that we can make Linux do it. Though, as
> > > you noticed, careful configuration is still required.
> > >
> > > Seem reasonable?
>
> Yes, thanks, some dots got connected :)
>
> > That said if he saw a big performance regression after applying these patches,
> > then there is likely a problem in the patchset. Well it could be due to that mode
> > which loops on full dynticks before resuming to userspace. Indeed when that is
> > enabled, I expect real throughput issues on workloads doing lots of kernel <->
> > userspace roundtrips. We just need to make sure this thing only works when requested.
>
> With this change (& having NO_HZ_FULL_ALL), hackbench was being served
> only by the boot cpu, while w/o this change, all 8 (this is a quad
> core HT processor) was being used - observation based on 'top'.
Good to know! And it will be interesting to see what Frederic decides
based on his review of the patchset.
Thanx, Paul
On Thu, May 21, 2015 at 07:14:09AM -0700, Paul E. McKenney wrote:
> On Thu, May 21, 2015 at 06:59:05PM +0530, Afzal Mohammed wrote:
> > Hi,
> >
> > On Thu, May 21, 2015 at 03:06:23PM +0200, Frederic Weisbecker wrote:
> > > On Thu, May 21, 2015 at 05:57:59AM -0700, Paul E. McKenney wrote:
> >
> > > > Indeed, NO_HZ_FULL is special purpose. You normally would select
> > > > NO_HZ_FULL_ALL only on a system intended for heavy compute without
> > > > normal-workload distractions or for some real-time systems. For mixed
> > > > workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
> > > > use the boot parameters to select which CPUs are to be running the
> > > > specialized portion of the workload.
> > > >
> > > > And you would of course need to lead enough CPUs running normally to
> > > > handle the non-specialized portion of the workload.
> > > >
> > > > This sort of thing has traditionally required specialized kernels,
> > > > so the cool thing here is that we can make Linux do it. Though, as
> > > > you noticed, careful configuration is still required.
> > > >
> > > > Seem reasonable?
> >
> > Yes, thanks, some dots got connected :)
> >
> > > That said if he saw a big performance regression after applying these patches,
> > > then there is likely a problem in the patchset. Well it could be due to that mode
> > > which loops on full dynticks before resuming to userspace. Indeed when that is
> > > enabled, I expect real throughput issues on workloads doing lots of kernel <->
> > > userspace roundtrips. We just need to make sure this thing only works when requested.
> >
> > With this change (& having NO_HZ_FULL_ALL), hackbench was being served
> > only by the boot cpu, while w/o this change, all 8 (this is a quad
> > core HT processor) was being used - observation based on 'top'.
>
> Good to know! And it will be interesting to see what Frederic decides
> based on his review of the patchset.
Ah wait I was confusing this patchset with the dataplane mode.
No the performance loss seen by Afzal is actually expected. Now that we set
all nohz full CPUs as isolcpus, when you start a process such as hackbench,
it will be affine to CPU 0 unless you explicitly tweak the affinity of
hackbench.
So yeah this is a desired effect :-)
On Thu, 2015-05-21 at 15:06 +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2015 at 05:57:59AM -0700, Paul E. McKenney wrote:
> > On Thu, May 21, 2015 at 05:42:46PM +0530, Afzal Mohammed wrote:
> > > Hi,
> > >
> > > On Wed, May 20, 2015 at 02:00:26PM -0700, Paul E. McKenney wrote:
> > >
> > > > > > Given that kernel initiated association to isolcpus, a user turning
> > > > > > NO_HZ_FULL_ALL on had better not have much generic load to manage. If
> > > > >
> > > > > On a quad-core desktop system with NO_HZ_FULL_ALL, hackbench took 3x
> > > > > time as compared to w/o this patch, except boot cpu every one else
> > > > > jobless. Though NO_HZ_FULL_ALL (afaik) is not meant for generic load,
> > > > > it was working fine, but not after this - it is now like a single core
> > > > > system.
> > > >
> > > > I have to ask... What is your use case? What are you wanting NO_HZ_FULL
> > > > to do for you?
> > >
> > > I was just playing NO_HZ_FULL with tip-[sched,timers]-* changes.
> > >
> > > Thought that shutting down ticks as much as possible would be
> > > beneficial to normal loads too, though it has been mentioned to be used
> > > for specialized loads. Seems like drawbacks due to it weigh against
> > > normal loads, but haven't so far observed any (on a laptop with normal
> > > activities) before this change.
> >
> > Indeed, NO_HZ_FULL is special purpose. You normally would select
> > NO_HZ_FULL_ALL only on a system intended for heavy compute without
> > normal-workload distractions or for some real-time systems. For mixed
> > workloads, you would build with NO_HZ_FULL (but not NO_HZ_FULL_ALL) and
> > use the boot parameters to select which CPUs are to be running the
> > specialized portion of the workload.
> >
> > And you would of course need to lead enough CPUs running normally to
> > handle the non-specialized portion of the workload.
> >
> > This sort of thing has traditionally required specialized kernels,
> > so the cool thing here is that we can make Linux do it. Though, as
> > you noticed, careful configuration is still required.
> >
> > Seem reasonable?
>
> That said if he saw a big performance regression after applying these patches,
> then there is likely a problem in the patchset. Well it could be due to that mode
> which loops on full dynticks before resuming to userspace. Indeed when that is
> enabled, I expect real throughput issues on workloads doing lots of kernel <->
> userspace roundtrips. We just need to make sure this thing only works when requested.
>
> Anyway, I need to look at the patchset.
I think it's a mistake to make any rash assumption and/or mandate that
the user WILL use nohz_full CPUs immediately, or even at all for that
matter. nohz_full currently is nothing but a CPU attribute, period,
nothing more, nothing less. The overhead that comes with the it is the
only thing that suggests that the set really really should be used
exclusively for isolated compute loads, and even then, they had better
be pure userspace due to the hefty border crossing overhead.
Let that overhear be seriously reduced, as per the Ingo/Andy discussion,
and presto, the things become a very interesting to dynamic sets. You
can really really use them as you see fit, and on the fly, it doesn't
cost you an arm and a leg just to turn it on. The only restriction
being the static set declaration, that you have to manage until that too
goes away.
I see no reason to turn nohz_full into a rigid construct.
-Mike
On Thu, May 21, 2015 at 08:59:38PM +0200, Mike Galbraith wrote:
> I think it's a mistake to make any rash assumption and/or mandate that
> the user WILL use nohz_full CPUs immediately, or even at all for that
> matter. nohz_full currently is nothing but a CPU attribute, period,
> nothing more, nothing less.
That's what the nohz_full parameter is for. Now if you're referring to
change the nohz_full toggle to a runtime interface such as sysfs or such,
I don't see that's a pressing need, especially considering the added
complexity. At least I haven't heard much requests for it.
> The overhead that comes with the it is the
> only thing that suggests that the set really really should be used
> exclusively for isolated compute loads, and even then, they had better
> be pure userspace due to the hefty border crossing overhead.
Yep.
>
> Let that overhear be seriously reduced, as per the Ingo/Andy discussion,
> and presto, the things become a very interesting to dynamic sets. You
> can really really use them as you see fit, and on the fly, it doesn't
> cost you an arm and a leg just to turn it on. The only restriction
> being the static set declaration, that you have to manage until that too
> goes away.
>
> I see no reason to turn nohz_full into a rigid construct.
I'm all for making nohz_full just about the tick and drive the rest of the
isolation features through other settings.
Ideally the isolation should be tuned by userspace, we have all the interface
available for that: process affinity, irq affinity, workqueue affinity (soon),
watchdog, etc... Then we'll also add syscall or prctl to loop on hard isolation
(dataplane).
Unfortunately people seem to prefer adding that complexity to the kernel and let
it assume bold and unflexible pre-setting on its own.
On Fri, 2015-05-22 at 16:39 +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2015 at 08:59:38PM +0200, Mike Galbraith wrote:
> > I think it's a mistake to make any rash assumption and/or mandate that
> > the user WILL use nohz_full CPUs immediately, or even at all for that
> > matter. nohz_full currently is nothing but a CPU attribute, period,
> > nothing more, nothing less.
>
> That's what the nohz_full parameter is for. Now if you're referring to
> change the nohz_full toggle to a runtime interface such as sysfs or such,
> I don't see that's a pressing need, especially considering the added
> complexity. At least I haven't heard much requests for it.
I'm not advocating anything like that, I'm just standing on my soap box
advocating NOT restricting user choices. Cpusets works fine for me, and
I'd like them to keep on working. If I need to add any hooks, buttons,
or rubber tubing I'll do it there ;-)
-Mike
On 05/18/2015 10:30 PM, Mike Galbraith wrote:
> On Mon, 2015-05-18 at 10:52 -0400, Rik van Riel wrote:
>
>> For real time KVM, it is desirable to run the VCPU threads on
>> isolated CPUs as real time tasks.
>>
>> Meanwhile, the emulator threads can run as normal tasks anywhere
>> on the system.
>>
>> This means the /machine cpuset, which all guests live under,
>> needs to contain all the system's CPUs, both isolated and
>> non-isolated, otherwise it will be unable to have the VCPU
>> threads on isolated CPUs and the emulator threads on other
>> CPUs.
>
> Ok, I know next to nothing about virtualization only because I failed at
> maintaining the desired absolute zero, but...
>
> Why do you use isolcpus for this? This KVM setup sounds very similar to
> what I do for real realtime. I use a shield script to set up an
> exclusive isolated set and a system set and move the mundane world into
> the system set. Inject realtime proggy into its exclusive home, it
> creates/pins worker bees, done. Why the cpuset isolcpus hybrid?
Because libvirt doesn't know any better currently.
Ideally in the long run, we would use only cpusets, and not
boot with isolcpus at all.
Of course, things like irqbalance also key off isolcpus, to
avoid assigning irqs to the isolated cpus, so there are quite
a few puzzle pieces, and the best I can hope for in the short
term is to simply resolve the incompatibilities where one
piece of software undoes the settings required by another...
--
All rights reversed