Ingo,
Please pull the sched/0hz branch that can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
sched/0hz
HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
--
Now that scheduler_tick() has become resilient towards the absence of
ticks, current->sched_class->task_tick() is the last piece that needs
at least 1Hz tick to keep scheduler stats alive.
This patchset adds a flag to the isolcpus boot option to offload the
residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
(assuming nothing else requires it) as their residual 1Hz tick is
offloaded to the housekeepers.
For quick testing, say on CPUs 1-7:
"isolcpus=nohz_offload,domain,1-7"
Thanks,
Frederic
---
Frederic Weisbecker (5):
sched: Rename init_rq_hrtick to hrtick_rq_init
sched/isolation: Add scheduler tick offloading interface
nohz: Allow to check if remote CPU tick is stopped
sched/isolation: Residual 1Hz scheduler tick offload
sched/isolation: Document "nohz_offload" flag
Documentation/admin-guide/kernel-parameters.txt | 7 +-
include/linux/sched/isolation.h | 3 +-
include/linux/tick.h | 2 +
kernel/sched/core.c | 94 +++++++++++++++++++++++--
kernel/sched/isolation.c | 10 +++
kernel/sched/sched.h | 2 +
kernel/time/tick-sched.c | 7 ++
7 files changed, 117 insertions(+), 8 deletions(-)
Do that rename in order to normalize the hrtick namespace.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 644fa2e..d72d0e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -333,7 +333,7 @@ void hrtick_start(struct rq *rq, u64 delay)
}
#endif /* CONFIG_SMP */
-static void init_rq_hrtick(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
{
#ifdef CONFIG_SMP
rq->hrtick_csd_pending = 0;
@@ -351,7 +351,7 @@ static inline void hrtick_clear(struct rq *rq)
{
}
-static inline void init_rq_hrtick(struct rq *rq)
+static inline void hrtick_rq_init(struct rq *rq)
{
}
#endif /* CONFIG_SCHED_HRTICK */
@@ -5955,7 +5955,7 @@ void __init sched_init(void)
rq->last_sched_tick = 0;
#endif
#endif /* CONFIG_SMP */
- init_rq_hrtick(rq);
+ hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
}
--
2.7.4
This check is racy but provides a good heuristic to determine whether
a CPU may need a remote tick or not.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/tick.h | 2 ++
kernel/time/tick-sched.c | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7cc3592..944c829 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -114,6 +114,7 @@ enum tick_dep_bits {
#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
+extern int tick_nohz_tick_stopped_cpu(int cpu);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
@@ -125,6 +126,7 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
+static inline int tick_nohz_tick_stopped_cpu(int cpu) { return 0; }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7cc7ab..97c4317 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -486,6 +486,13 @@ int tick_nohz_tick_stopped(void)
return __this_cpu_read(tick_cpu_sched.tick_stopped);
}
+int tick_nohz_tick_stopped_cpu(int cpu)
+{
+ struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+ return ts->tick_stopped;
+}
+
/**
* tick_nohz_update_jiffies - update jiffies when idle was interrupted
*
--
2.7.4
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.
Adding the boot parameter "isolcpus=nohz_offload" will now outsource
these scheduler ticks to the global workqueue so that a housekeeping CPU
handles that tick remotely.
Note it's still up to the user to affine the global workqueues to the
housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
domains isolation.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/isolation.c | 4 +++
kernel/sched/sched.h | 2 ++
3 files changed, 91 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d72d0e9..b964890 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3052,9 +3052,14 @@ void scheduler_tick(void)
*/
u64 scheduler_tick_max_deferment(void)
{
- struct rq *rq = this_rq();
- unsigned long next, now = READ_ONCE(jiffies);
+ struct rq *rq;
+ unsigned long next, now;
+ if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
+ return ktime_to_ns(KTIME_MAX);
+
+ rq = this_rq();
+ now = READ_ONCE(jiffies);
next = rq->last_sched_tick + HZ;
if (time_before_eq(next, now))
@@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
return jiffies_to_nsecs(next - now);
}
-#endif
+
+struct tick_work {
+ int cpu;
+ struct delayed_work work;
+};
+
+static struct tick_work __percpu *tick_work_cpu;
+
+static void sched_tick_remote(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct tick_work *twork = container_of(dwork, struct tick_work, work);
+ int cpu = twork->cpu;
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+
+ /*
+ * Handle the tick only if it appears the remote CPU is running
+ * in full dynticks mode. The check is racy by nature, but
+ * missing a tick or having one too much is no big deal.
+ */
+ if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
+ rq_lock_irq(rq, &rf);
+ update_rq_clock(rq);
+ rq->curr->sched_class->task_tick(rq, rq->curr, 0);
+ rq_unlock_irq(rq, &rf);
+ }
+
+ queue_delayed_work(system_unbound_wq, dwork, HZ);
+}
+
+static void sched_tick_start(int cpu)
+{
+ struct tick_work *twork;
+
+ if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
+ return;
+
+ WARN_ON_ONCE(!tick_work_cpu);
+
+ twork = per_cpu_ptr(tick_work_cpu, cpu);
+ twork->cpu = cpu;
+ INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+ queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void sched_tick_stop(int cpu)
+{
+ struct tick_work *twork;
+
+ if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
+ return;
+
+ WARN_ON_ONCE(!tick_work_cpu);
+
+ twork = per_cpu_ptr(tick_work_cpu, cpu);
+ cancel_delayed_work_sync(&twork->work);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+int __init sched_tick_offload_init(void)
+{
+ tick_work_cpu = alloc_percpu(struct tick_work);
+ if (!tick_work_cpu) {
+ pr_err("Can't allocate remote tick struct\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+#else
+static void sched_tick_start(int cpu) { }
+static void sched_tick_stop(int cpu) { }
+#endif /* CONFIG_NO_HZ_FULL */
#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
defined(CONFIG_PREEMPT_TRACER))
@@ -5713,6 +5793,7 @@ int sched_cpu_starting(unsigned int cpu)
{
set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
+ sched_tick_start(cpu);
return 0;
}
@@ -5724,6 +5805,7 @@ int sched_cpu_dying(unsigned int cpu)
/* Handle pending wakeups and then migrate everything off */
sched_ttwu_pending();
+ sched_tick_stop(cpu);
rq_lock_irqsave(rq, &rf);
if (rq->rd) {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 264ddcd..c5e7e90a 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -12,6 +12,7 @@
#include <linux/kernel.h>
#include <linux/static_key.h>
#include <linux/ctype.h>
+#include "sched.h"
DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
EXPORT_SYMBOL_GPL(housekeeping_overriden);
@@ -60,6 +61,9 @@ void __init housekeeping_init(void)
static_branch_enable(&housekeeping_overriden);
+ if (housekeeping_flags & HK_FLAG_TICK_SCHED)
+ sched_tick_offload_init();
+
/* We need at least one CPU to handle housekeeping work */
WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a2..5a3b82c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1587,6 +1587,7 @@ extern void post_init_entity_util_avg(struct sched_entity *se);
#ifdef CONFIG_NO_HZ_FULL
extern bool sched_can_stop_tick(struct rq *rq);
+extern int __init sched_tick_offload_init(void);
/*
* Tick may be needed by tasks in the runqueue depending on their policy and
@@ -1611,6 +1612,7 @@ static inline void sched_update_tick_dependency(struct rq *rq)
tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
}
#else
+static inline int sched_tick_offload_init(void) { return 0; }
static inline void sched_update_tick_dependency(struct rq *rq) { }
#endif
--
2.7.4
Document the interface to offload the 1Hz scheduler tick in full
dynticks mode. Also improve the comment about the existing "nohz" flag
in order to differentiate its behaviour.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index af7104a..2524296 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1749,7 +1749,12 @@
specified in the flag list (default: domain):
nohz
- Disable the tick when a single task runs.
+ Disable the tick when a single task runs. A residual 1Hz
+ tick remains to maintain scheduler stats alive.
+ nohz_offload
+ Like nohz but the residual 1Hz tick is offloaded to
+ housekeeping CPUs, leaving the CPU free of any tick if
+ nothing else requests it.
domain
Isolate from the general SMP balancing and scheduling
algorithms. Note that performing domain isolation this way
--
2.7.4
Add the boot option that will allow us to offload the 1Hz scheduler tick
to the housekeeping CPU.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/sched/isolation.h | 3 ++-
kernel/sched/isolation.c | 6 ++++++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d849431..c831855 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -11,7 +11,8 @@ enum hk_flags {
HK_FLAG_MISC = (1 << 2),
HK_FLAG_SCHED = (1 << 3),
HK_FLAG_TICK = (1 << 4),
- HK_FLAG_DOMAIN = (1 << 5),
+ HK_FLAG_TICK_SCHED = (1 << 5),
+ HK_FLAG_DOMAIN = (1 << 6),
};
#ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b71b436..264ddcd 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -136,6 +136,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
continue;
}
+ if (!strncmp(str, "nohz_offload,", 13)) {
+ str += 13;
+ flags |= HK_FLAG_TICK | HK_FLAG_TICK_SCHED;
+ continue;
+ }
+
if (!strncmp(str, "domain,", 7)) {
str += 7;
flags |= HK_FLAG_DOMAIN;
--
2.7.4
On Thu, 4 Jan 2018 05:25:32 +0100
Frederic Weisbecker <[email protected]> wrote:
> Ingo,
>
> Please pull the sched/0hz branch that can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> sched/0hz
>
> HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
>
> --
> Now that scheduler_tick() has become resilient towards the absence of
> ticks, current->sched_class->task_tick() is the last piece that needs
> at least 1Hz tick to keep scheduler stats alive.
>
> This patchset adds a flag to the isolcpus boot option to offload the
> residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> (assuming nothing else requires it) as their residual 1Hz tick is
> offloaded to the housekeepers.
>
> For quick testing, say on CPUs 1-7:
>
> "isolcpus=nohz_offload,domain,1-7"
Sorry for being very late to this series, but I've a few comments to
make (one right now and others in individual patches).
Why are extending isolcpus= given that it's a deprecated interface?
Some people have already moved away from isolcpus= now, but with this
new feature they will be forced back to using it.
What about just adding the new functionality to nohz_full=? That is,
no new options, just make the tick go away since this has always been
what nohz_full= was intended to do?
>
> Thanks,
> Frederic
> ---
>
> Frederic Weisbecker (5):
> sched: Rename init_rq_hrtick to hrtick_rq_init
> sched/isolation: Add scheduler tick offloading interface
> nohz: Allow to check if remote CPU tick is stopped
> sched/isolation: Residual 1Hz scheduler tick offload
> sched/isolation: Document "nohz_offload" flag
>
>
> Documentation/admin-guide/kernel-parameters.txt | 7 +-
> include/linux/sched/isolation.h | 3 +-
> include/linux/tick.h | 2 +
> kernel/sched/core.c | 94 +++++++++++++++++++++++--
> kernel/sched/isolation.c | 10 +++
> kernel/sched/sched.h | 2 +
> kernel/time/tick-sched.c | 7 ++
> 7 files changed, 117 insertions(+), 8 deletions(-)
>
On Thu, 4 Jan 2018 05:25:36 +0100
Frederic Weisbecker <[email protected]> wrote:
> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
>
> Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> these scheduler ticks to the global workqueue so that a housekeeping CPU
> handles that tick remotely.
>
> Note it's still up to the user to affine the global workqueues to the
> housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> domains isolation.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Chris Metcalf <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Luiz Capitulino <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> kernel/sched/core.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/isolation.c | 4 +++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 91 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d72d0e9..b964890 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3052,9 +3052,14 @@ void scheduler_tick(void)
> */
> u64 scheduler_tick_max_deferment(void)
> {
> - struct rq *rq = this_rq();
> - unsigned long next, now = READ_ONCE(jiffies);
> + struct rq *rq;
> + unsigned long next, now;
>
> + if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
> + return ktime_to_ns(KTIME_MAX);
> +
> + rq = this_rq();
> + now = READ_ONCE(jiffies);
> next = rq->last_sched_tick + HZ;
>
> if (time_before_eq(next, now))
> @@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
>
> return jiffies_to_nsecs(next - now);
> }
> -#endif
> +
> +struct tick_work {
> + int cpu;
> + struct delayed_work work;
> +};
> +
> +static struct tick_work __percpu *tick_work_cpu;
> +
> +static void sched_tick_remote(struct work_struct *work)
> +{
> + struct delayed_work *dwork = to_delayed_work(work);
> + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> + int cpu = twork->cpu;
> + struct rq *rq = cpu_rq(cpu);
> + struct rq_flags rf;
> +
> + /*
> + * Handle the tick only if it appears the remote CPU is running
> + * in full dynticks mode. The check is racy by nature, but
> + * missing a tick or having one too much is no big deal.
> + */
> + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> + rq_lock_irq(rq, &rf);
> + update_rq_clock(rq);
> + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> + rq_unlock_irq(rq, &rf);
> + }
OK, so this executes task_tick() remotely. What about account_process_tick()?
Don't we need it as well?
In particular, when I run a hog application on a nohz_full core configured
with tick offload, I can see in top that the CPU usage goes from 100%
to idle for a few seconds every couple of seconds. Could this be related?
Also, in my testing I'm sometimes seeing the tick. Sometimes at 10 or
20 seconds interval. Is this expected? I'll dig deeper next week.
> +
> + queue_delayed_work(system_unbound_wq, dwork, HZ);
> +}
> +
> +static void sched_tick_start(int cpu)
> +{
> + struct tick_work *twork;
> +
> + if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
> + return;
> +
> + WARN_ON_ONCE(!tick_work_cpu);
> +
> + twork = per_cpu_ptr(tick_work_cpu, cpu);
> + twork->cpu = cpu;
> + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +static void sched_tick_stop(int cpu)
> +{
> + struct tick_work *twork;
> +
> + if (housekeeping_cpu(cpu, HK_FLAG_TICK_SCHED))
> + return;
> +
> + WARN_ON_ONCE(!tick_work_cpu);
> +
> + twork = per_cpu_ptr(tick_work_cpu, cpu);
> + cancel_delayed_work_sync(&twork->work);
> +}
> +#endif /* CONFIG_HOTPLUG_CPU */
> +
> +int __init sched_tick_offload_init(void)
> +{
> + tick_work_cpu = alloc_percpu(struct tick_work);
> + if (!tick_work_cpu) {
> + pr_err("Can't allocate remote tick struct\n");
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +#else
> +static void sched_tick_start(int cpu) { }
> +static void sched_tick_stop(int cpu) { }
> +#endif /* CONFIG_NO_HZ_FULL */
>
> #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
> defined(CONFIG_PREEMPT_TRACER))
> @@ -5713,6 +5793,7 @@ int sched_cpu_starting(unsigned int cpu)
> {
> set_cpu_rq_start_time(cpu);
> sched_rq_cpu_starting(cpu);
> + sched_tick_start(cpu);
> return 0;
> }
>
> @@ -5724,6 +5805,7 @@ int sched_cpu_dying(unsigned int cpu)
>
> /* Handle pending wakeups and then migrate everything off */
> sched_ttwu_pending();
> + sched_tick_stop(cpu);
>
> rq_lock_irqsave(rq, &rf);
> if (rq->rd) {
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 264ddcd..c5e7e90a 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -12,6 +12,7 @@
> #include <linux/kernel.h>
> #include <linux/static_key.h>
> #include <linux/ctype.h>
> +#include "sched.h"
>
> DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
> EXPORT_SYMBOL_GPL(housekeeping_overriden);
> @@ -60,6 +61,9 @@ void __init housekeeping_init(void)
>
> static_branch_enable(&housekeeping_overriden);
>
> + if (housekeeping_flags & HK_FLAG_TICK_SCHED)
> + sched_tick_offload_init();
> +
> /* We need at least one CPU to handle housekeeping work */
> WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b19552a2..5a3b82c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1587,6 +1587,7 @@ extern void post_init_entity_util_avg(struct sched_entity *se);
>
> #ifdef CONFIG_NO_HZ_FULL
> extern bool sched_can_stop_tick(struct rq *rq);
> +extern int __init sched_tick_offload_init(void);
>
> /*
> * Tick may be needed by tasks in the runqueue depending on their policy and
> @@ -1611,6 +1612,7 @@ static inline void sched_update_tick_dependency(struct rq *rq)
> tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> }
> #else
> +static inline int sched_tick_offload_init(void) { return 0; }
> static inline void sched_update_tick_dependency(struct rq *rq) { }
> #endif
>
On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> On Thu, 4 Jan 2018 05:25:32 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > Ingo,
> >
> > Please pull the sched/0hz branch that can be found at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > sched/0hz
> >
> > HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
> >
> > --
> > Now that scheduler_tick() has become resilient towards the absence of
> > ticks, current->sched_class->task_tick() is the last piece that needs
> > at least 1Hz tick to keep scheduler stats alive.
> >
> > This patchset adds a flag to the isolcpus boot option to offload the
> > residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> > (assuming nothing else requires it) as their residual 1Hz tick is
> > offloaded to the housekeepers.
> >
> > For quick testing, say on CPUs 1-7:
> >
> > "isolcpus=nohz_offload,domain,1-7"
>
> Sorry for being very late to this series, but I've a few comments to
> make (one right now and others in individual patches).
>
> Why are extending isolcpus= given that it's a deprecated interface?
> Some people have already moved away from isolcpus= now, but with this
> new feature they will be forced back to using it.
I tried to remove isolcpus or at least change the way it works so that its
effects are reversible (ie: affine the init task instead of isolating domains)
but that got nacked due to the behaviour's expectations for userspace.
That's when I realized that kernel parameters are like userspace ABIs,
they can't be removed easily whether we deprecate them or not.
Also I needed to be able to control the various isolation features, and
nohz_full is the wrong place to do that as nohz_full is really just an
isolation feature like the others, nohz_full= should really just imply
full dynticks and not watchdog, workqueue or tilegx NAPI isolation...
So isolcpus= is now the place where we control the isolation features
and nohz is one of them.
The complain about isolcpus is the immutable result. I'm thinking about
making it modifiable to cpuset but I only see two possible solutions:
- Make the root cpuset modifiable
- Create a directory called "isolcpus" visible on the first cpuset mount
and move all processes there.
> What about just adding the new functionality to nohz_full=? That is,
> no new options, just make the tick go away since this has always been
> what nohz_full= was intended to do?
We can, or have isolcpus=nohz to do it, as both do almost the same.
But I'm afraid about the overhead for people used to nohz_full= once
they upgrade their kernels and see those workqueues once per second.
We can still affine those workqueues (in fact the whole unbound workqueue
mask) outside the nohz_full range. Still current users may be surprised
about that new overhead on housekeeping CPUs...
On Fri, Jan 12, 2018 at 02:22:58PM -0500, Luiz Capitulino wrote:
> On Thu, 4 Jan 2018 05:25:36 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> >
> > Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> > these scheduler ticks to the global workqueue so that a housekeeping CPU
> > handles that tick remotely.
> >
> > Note it's still up to the user to affine the global workqueues to the
> > housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> > domains isolation.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Chris Metcalf <[email protected]>
> > Cc: Christoph Lameter <[email protected]>
> > Cc: Luiz Capitulino <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Wanpeng Li <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > ---
> > kernel/sched/core.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
> > kernel/sched/isolation.c | 4 +++
> > kernel/sched/sched.h | 2 ++
> > 3 files changed, 91 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d72d0e9..b964890 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3052,9 +3052,14 @@ void scheduler_tick(void)
> > */
> > u64 scheduler_tick_max_deferment(void)
> > {
> > - struct rq *rq = this_rq();
> > - unsigned long next, now = READ_ONCE(jiffies);
> > + struct rq *rq;
> > + unsigned long next, now;
> >
> > + if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
> > + return ktime_to_ns(KTIME_MAX);
> > +
> > + rq = this_rq();
> > + now = READ_ONCE(jiffies);
> > next = rq->last_sched_tick + HZ;
> >
> > if (time_before_eq(next, now))
> > @@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
> >
> > return jiffies_to_nsecs(next - now);
> > }
> > -#endif
> > +
> > +struct tick_work {
> > + int cpu;
> > + struct delayed_work work;
> > +};
> > +
> > +static struct tick_work __percpu *tick_work_cpu;
> > +
> > +static void sched_tick_remote(struct work_struct *work)
> > +{
> > + struct delayed_work *dwork = to_delayed_work(work);
> > + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > + int cpu = twork->cpu;
> > + struct rq *rq = cpu_rq(cpu);
> > + struct rq_flags rf;
> > +
> > + /*
> > + * Handle the tick only if it appears the remote CPU is running
> > + * in full dynticks mode. The check is racy by nature, but
> > + * missing a tick or having one too much is no big deal.
> > + */
> > + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > + rq_lock_irq(rq, &rf);
> > + update_rq_clock(rq);
> > + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > + rq_unlock_irq(rq, &rf);
> > + }
>
> OK, so this executes task_tick() remotely. What about account_process_tick()?
> Don't we need it as well?
Nope, tasks in nohz_full mode have their special accounting that doesn't
rely on the tick.
>
> In particular, when I run a hog application on a nohz_full core configured
> with tick offload, I can see in top that the CPU usage goes from 100%
> to idle for a few seconds every couple of seconds. Could this be related?
>
> Also, in my testing I'm sometimes seeing the tick. Sometimes at 10 or
> 20 seconds interval. Is this expected? I'll dig deeper next week.
That's expected, see the changelog: the offload is not affine by default.
You need to either also isolate the domains:
isolcpus=nohz_offload,domain
or tweak the workqueue cpumask through:
/sys/devices/virtual/workqueue/cpumask
Thanks.
On Tue, 16 Jan 2018 16:41:00 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> > On Thu, 4 Jan 2018 05:25:32 +0100
> > Frederic Weisbecker <[email protected]> wrote:
> >
> > > Ingo,
> > >
> > > Please pull the sched/0hz branch that can be found at:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > sched/0hz
> > >
> > > HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
> > >
> > > --
> > > Now that scheduler_tick() has become resilient towards the absence of
> > > ticks, current->sched_class->task_tick() is the last piece that needs
> > > at least 1Hz tick to keep scheduler stats alive.
> > >
> > > This patchset adds a flag to the isolcpus boot option to offload the
> > > residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> > > (assuming nothing else requires it) as their residual 1Hz tick is
> > > offloaded to the housekeepers.
> > >
> > > For quick testing, say on CPUs 1-7:
> > >
> > > "isolcpus=nohz_offload,domain,1-7"
> >
> > Sorry for being very late to this series, but I've a few comments to
> > make (one right now and others in individual patches).
> >
> > Why are extending isolcpus= given that it's a deprecated interface?
> > Some people have already moved away from isolcpus= now, but with this
> > new feature they will be forced back to using it.
>
> I tried to remove isolcpus or at least change the way it works so that its
> effects are reversible (ie: affine the init task instead of isolating domains)
> but that got nacked due to the behaviour's expectations for userspace.
>
> That's when I realized that kernel parameters are like userspace ABIs,
> they can't be removed easily whether we deprecate them or not.
>
> Also I needed to be able to control the various isolation features, and
> nohz_full is the wrong place to do that as nohz_full is really just an
> isolation feature like the others, nohz_full= should really just imply
> full dynticks and not watchdog, workqueue or tilegx NAPI isolation...
Yeah, I completely agree with that.
> So isolcpus= is now the place where we control the isolation features
> and nohz is one of them.
That's the part I'm not very sure about. We've been advising users to
move away from isolcpus= when possible, but this very wanted nohz_offload
feature will force everyone back to using isolcpus= again.
I have the impression this series is trying to solve two problems:
1. How (and where) we control the various isolation features in the
kernel
2. Where we add the control for the tick offload feature
I think item 1 is too complex to solve right now. IMHO, this series
should focus on item 2. And regarding item 2, I think we have two
choices to make:
1. Make tick offload a first class citizen by making it default to
nohz_full=. If there are regressions, we handle them
2. Add a new option to nohz_full=, like nohz_full=tick_offload
As an avid user of nohz_full I'm dying to see option 1 happening,
but I'm not totally sure what the consequences can be.
Another idea is to add CONFIG_NOHZ_TICK_OFFLOAD as an experimental
feature.
> The complain about isolcpus is the immutable result. I'm thinking about
> making it modifiable to cpuset but I only see two possible solutions:
>
> - Make the root cpuset modifiable
> - Create a directory called "isolcpus" visible on the first cpuset mount
> and move all processes there.
So, if we move the control of the tick offload to nohz_full= itself,
we can completely ditch any isolcpus= change in this series.
I think this should give you a great relief :)
> > What about just adding the new functionality to nohz_full=? That is,
> > no new options, just make the tick go away since this has always been
> > what nohz_full= was intended to do?
>
> We can, or have isolcpus=nohz to do it, as both do almost the same.
>
> But I'm afraid about the overhead for people used to nohz_full= once
> they upgrade their kernels and see those workqueues once per second.
>
> We can still affine those workqueues (in fact the whole unbound workqueue
> mask) outside the nohz_full range. Still current users may be surprised
> about that new overhead on housekeeping CPUs...
>
On Tue, 16 Jan 2018 16:57:45 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Fri, Jan 12, 2018 at 02:22:58PM -0500, Luiz Capitulino wrote:
> > On Thu, 4 Jan 2018 05:25:36 +0100
> > Frederic Weisbecker <[email protected]> wrote:
> >
> > > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > > keep the scheduler stats alive. However this residual tick is a burden
> > > for bare metal tasks that can't stand any interruption at all, or want
> > > to minimize them.
> > >
> > > Adding the boot parameter "isolcpus=nohz_offload" will now outsource
> > > these scheduler ticks to the global workqueue so that a housekeeping CPU
> > > handles that tick remotely.
> > >
> > > Note it's still up to the user to affine the global workqueues to the
> > > housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or
> > > domains isolation.
> > >
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Chris Metcalf <[email protected]>
> > > Cc: Christoph Lameter <[email protected]>
> > > Cc: Luiz Capitulino <[email protected]>
> > > Cc: Mike Galbraith <[email protected]>
> > > Cc: Paul E. McKenney <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Rik van Riel <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Wanpeng Li <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > ---
> > > kernel/sched/core.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
> > > kernel/sched/isolation.c | 4 +++
> > > kernel/sched/sched.h | 2 ++
> > > 3 files changed, 91 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index d72d0e9..b964890 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3052,9 +3052,14 @@ void scheduler_tick(void)
> > > */
> > > u64 scheduler_tick_max_deferment(void)
> > > {
> > > - struct rq *rq = this_rq();
> > > - unsigned long next, now = READ_ONCE(jiffies);
> > > + struct rq *rq;
> > > + unsigned long next, now;
> > >
> > > + if (!housekeeping_cpu(smp_processor_id(), HK_FLAG_TICK_SCHED))
> > > + return ktime_to_ns(KTIME_MAX);
> > > +
> > > + rq = this_rq();
> > > + now = READ_ONCE(jiffies);
> > > next = rq->last_sched_tick + HZ;
> > >
> > > if (time_before_eq(next, now))
> > > @@ -3062,7 +3067,82 @@ u64 scheduler_tick_max_deferment(void)
> > >
> > > return jiffies_to_nsecs(next - now);
> > > }
> > > -#endif
> > > +
> > > +struct tick_work {
> > > + int cpu;
> > > + struct delayed_work work;
> > > +};
> > > +
> > > +static struct tick_work __percpu *tick_work_cpu;
> > > +
> > > +static void sched_tick_remote(struct work_struct *work)
> > > +{
> > > + struct delayed_work *dwork = to_delayed_work(work);
> > > + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > > + int cpu = twork->cpu;
> > > + struct rq *rq = cpu_rq(cpu);
> > > + struct rq_flags rf;
> > > +
> > > + /*
> > > + * Handle the tick only if it appears the remote CPU is running
> > > + * in full dynticks mode. The check is racy by nature, but
> > > + * missing a tick or having one too much is no big deal.
> > > + */
> > > + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > > + rq_lock_irq(rq, &rf);
> > > + update_rq_clock(rq);
> > > + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > > + rq_unlock_irq(rq, &rf);
> > > + }
> >
> > OK, so this executes task_tick() remotely. What about account_process_tick()?
> > Don't we need it as well?
>
> Nope, tasks in nohz_full mode have their special accounting that doesn't
> rely on the tick.
OK, excellent.
> > In particular, when I run a hog application on a nohz_full core configured
> > with tick offload, I can see in top that the CPU usage goes from 100%
> > to idle for a few seconds every couple of seconds. Could this be related?
> >
> > Also, in my testing I'm sometimes seeing the tick. Sometimes at 10 or
> > 20 seconds interval. Is this expected? I'll dig deeper next week.
>
> That's expected, see the changelog: the offload is not affine by default.
> You need to either also isolate the domains:
>
> isolcpus=nohz_offload,domain
>
> or tweak the workqueue cpumask through:
>
> /sys/devices/virtual/workqueue/cpumask
Yeah, I already do that. Later today or tomorrow I'll debug this to
see if the problem is in my setup or not.
>
> Thanks.
>
On Tue, 2018-01-16 at 16:41 +0100, Frederic Weisbecker wrote:
> On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
>
> > Why are extending isolcpus= given that it's a deprecated interface?
> > Some people have already moved away from isolcpus= now, but with this
> > new feature they will be forced back to using it.
>
> I tried to remove isolcpus or at least change the way it works so that its
> effects are reversible (ie: affine the init task instead of isolating domains)
> but that got nacked due to the behaviour's expectations for userspace.
So we paint ourselves into a static corner forever more, despite every
bit of this being all about "properties of sets of cpus", ie precisely
what cpusets was born to do. ?That's sad, dynamic wasn't that far away.
-Mike
On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> On Tue, 16 Jan 2018 16:41:00 +0100
> Frederic Weisbecker <[email protected]> wrote:
> > So isolcpus= is now the place where we control the isolation features
> > and nohz is one of them.
>
> That's the part I'm not very sure about. We've been advising users to
> move away from isolcpus= when possible, but this very wanted nohz_offload
> feature will force everyone back to using isolcpus= again.
Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
the behaviour that you've been advising users against. We are simply
reusing a kernel parameter that was abandoned to now control the isolation
features that were disorganized and opaque behind nohz.
>
> I have the impression this series is trying to solve two problems:
>
> 1. How (and where) we control the various isolation features in the
> kernel
No, that has already been done in the previous merge window. We have a
dedicated isolation subsystem now (kernel/sched/isolation.c) and
an interface to control all these isolation features that were abusively implied
by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
"isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
And there we are.
In the end the goal is to propagate what is passed to "isolcpus=" to cpusets.
>
> 2. Where we add the control for the tick offload feature
>
> I think item 1 is too complex to solve right now. IMHO, this series
> should focus on item 2. And regarding item 2, I think we have two
> choices to make:
>
> 1. Make tick offload a first class citizen by making it default to
> nohz_full=. If there are regressions, we handle them
That's a possible way to go.
>
> 2. Add a new option to nohz_full=, like nohz_full=tick_offload
>
> As an avid user of nohz_full I'm dying to see option 1 happening,
> but I'm not totally sure what the consequences can be.
"nohz_full=" parameter has been badly designed as it implies much more
than just full dynticks. So I'm not really looking forward to expanding
it.
> Another idea is to add CONFIG_NOHZ_TICK_OFFLOAD as an experimental
> feature.
I fear it's way too distro-unfriendly. They will want to have it as a
capability without necessarily running it. Just like they do with
CONFIG_NO_HZ_FULL.
>
> > The complain about isolcpus is the immutable result. I'm thinking about
> > making it modifiable to cpuset but I only see two possible solutions:
> >
> > - Make the root cpuset modifiable
> > - Create a directory called "isolcpus" visible on the first cpuset mount
> > and move all processes there.
>
> So, if we move the control of the tick offload to nohz_full= itself,
> we can completely ditch any isolcpus= change in this series.
>
> I think this should give you a great relief :)
Not at all :)
What would be a great relief to me is that we can finally propagate isolcpus=
to cpusets so that we can continue to expand it without a second thought.
On Tue, Jan 16, 2018 at 06:58:18PM +0100, Mike Galbraith wrote:
> On Tue, 2018-01-16 at 16:41 +0100, Frederic Weisbecker wrote:
> > On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> >
> > > Why are extending isolcpus= given that it's a deprecated interface?
> > > Some people have already moved away from isolcpus= now, but with this
> > > new feature they will be forced back to using it.
> >
> > I tried to remove isolcpus or at least change the way it works so that its
> > effects are reversible (ie: affine the init task instead of isolating domains)
> > but that got nacked due to the behaviour's expectations for userspace.
>
> So we paint ourselves into a static corner forever more, despite every
> bit of this being all about "properties of sets of cpus", ie precisely
> what cpusets was born to do. ?That's sad, dynamic wasn't that far away.
Hence why we need to propagate "isolcpus=" to cpusets.
On Tue, 16 Jan 2018, Mike Galbraith wrote:
> > I tried to remove isolcpus or at least change the way it works so that its
> > effects are reversible (ie: affine the init task instead of isolating domains)
> > but that got nacked due to the behaviour's expectations for userspace.
>
> So we paint ourselves into a static corner forever more, despite every
> bit of this being all about "properties of sets of cpus", ie precisely
> what cpusets was born to do. ?That's sad, dynamic wasn't that far away.
cpusets was born in order to isolate applications to sets of processors.
The properties of sets of cpus was not on the horizon when SGI started
this.
We have sets of cpus associated with affinity masks in the form of bitmaps
etc etc which is much more lightweight than having slug around the cgroup
overhead everywhere.
A simple bitmask is much better if you have to control detailed system
behavior for each core and are planning each processes role because you
need to make full use of the harware resources available.
On Wed, 2018-01-17 at 08:51 -0600, Christopher Lameter wrote:
> On Tue, 16 Jan 2018, Mike Galbraith wrote:
>
> > > I tried to remove isolcpus or at least change the way it works so that its
> > > effects are reversible (ie: affine the init task instead of isolating domains)
> > > but that got nacked due to the behaviour's expectations for userspace.
> >
> > So we paint ourselves into a static corner forever more, despite every
> > bit of this being all about "properties of sets of cpus", ie precisely
> > what cpusets was born to do. ?That's sad, dynamic wasn't that far away.
>
> cpusets was born in order to isolate applications to sets of processors.
> The properties of sets of cpus was not on the horizon when SGI started
> this.
Domain connectivity very much is a property of a set of CPUs, a rather
important one, and one managed by cpusets. ?NOHZ_FULL is a property of
a set of cpus, thus a most excellent fit. ?Other things are as well.
> We have sets of cpus associated with affinity masks in the form of bitmaps
> etc etc which is much more lightweight than having slug around the cgroup
> overhead everywhere.
What does everywhere mean, set creation time?
> A simple bitmask is much better if you have to control detailed system
> behavior for each core and are planning each processes role because you
> need to make full use of the harware resources available.
If you live in a static world, maybe.
I like the flexibility of being able to configure on the fly. ?One tiny
example: for a high performance aircraft manufacturer, having military
simulation background, I know that simulators frequently have to be
ready to go at the drop of a hat, so I twiddled cpusets to let them
flip their extra fancy video game (80 cores, real controls/avionics...
"game over, insert one gold bar to continue" kind of fancy) from low
power idle to full bore hard realtime with one poke to a cpuset file.
Static may be fine for some, for others, dynamic is much better.
-Mike
On Wed, 17 Jan 2018, Mike Galbraith wrote:
> Domain connectivity very much is a property of a set of CPUs, a rather
> important one, and one managed by cpusets. ?NOHZ_FULL is a property of
> a set of cpus, thus a most excellent fit. ?Other things are as well.
Not sure to what domain refers to in this context.
> > We have sets of cpus associated with affinity masks in the form of bitmaps
> > etc etc which is much more lightweight than having slug around the cgroup
> > overhead everywhere.
>
> What does everywhere mean, set creation time?
You would need to create multiple cgroups to create what you want. Those
will "inherit" characteristics from higher levels etc etc. It gets
needlessly complicated and difficult to debug if something goes work.
> > A simple bitmask is much better if you have to control detailed system
> > behavior for each core and are planning each processes role because you
> > need to make full use of the harware resources available.
>
> If you live in a static world, maybe.
Why would that be restricted to a static world?
> I like the flexibility of being able to configure on the fly. ?One tiny
> example: for a high performance aircraft manufacturer, having military
> simulation background, I know that simulators frequently have to be
> ready to go at the drop of a hat, so I twiddled cpusets to let them
> flip their extra fancy video game (80 cores, real controls/avionics...
> "game over, insert one gold bar to continue" kind of fancy) from low
> power idle to full bore hard realtime with one poke to a cpuset file.
>
> Static may be fine for some, for others, dynamic is much better.
The problem is that I may be flipping a flag in a cpuset to enable
something but some other cpuset somewhere in the complex hieracy does
something different that causes a conflict. The directness to control is
lost. Instead there is the fog of complexity created by the cgroups that
have various plugins and whatnot.
On Wed, 2018-01-17 at 10:32 -0600, Christopher Lameter wrote:
> On Wed, 17 Jan 2018, Mike Galbraith wrote:
>
> > Domain connectivity very much is a property of a set of CPUs, a rather
> > important one, and one managed by cpusets. ?NOHZ_FULL is a property of
> > a set of cpus, thus a most excellent fit. ?Other things are as well.
>
> Not sure to what domain refers to in this context.
Scheduler domains, load balancing.
> > > We have sets of cpus associated with affinity masks in the form of bitmaps
> > > etc etc which is much more lightweight than having slug around the cgroup
> > > overhead everywhere.
> >
> > What does everywhere mean, set creation time?
>
> You would need to create multiple cgroups to create what you want. Those
> will "inherit" characteristics from higher levels etc etc. It gets
> needlessly complicated and difficult to debug if something goes work.
It's only as complicated as you make it. ?What I create is dirt simple,
an exclusive system set and an exclusive realtime set, both directly
under root. ?It doesn't get any simpler than that.
> > > A simple bitmask is much better if you have to control detailed system
> > > behavior for each core and are planning each processes role because you
> > > need to make full use of the harware resources available.
> >
> > If you live in a static world, maybe.
>
> Why would that be restricted to a static world?
Guess I misunderstood, unimportant.
> > I like the flexibility of being able to configure on the fly. ?One tiny
> > example: for a high performance aircraft manufacturer, having military
> > simulation background, I know that simulators frequently have to be
> > ready to go at the drop of a hat, so I twiddled cpusets to let them
> > flip their extra fancy video game (80 cores, real controls/avionics...
> > "game over, insert one gold bar to continue" kind of fancy) from low
> > power idle to full bore hard realtime with one poke to a cpuset file.
> >
> > Static may be fine for some, for others, dynamic is much better.
>
> The problem is that I may be flipping a flag in a cpuset to enable
> something but some other cpuset somewhere in the complex hieracy does
> something different that causes a conflict.
That's what exclusive sets are for, zero set overlap. ?It would be very
difficult to both connect and disconnect scheduler domains :)
> The directness to control is
> lost. Instead there is the fog of complexity created by the cgroups that
> have various plugins and whatnot.
You don't have to use any of the other controllers, I don't, just tell
systemthing to pretty please NOT co-mount controllers, and whatever to
ensure it keeps its tentacles off of your toys, and you're fine.
-Mike
On Tue, 16 Jan 2018 23:51:29 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> > On Tue, 16 Jan 2018 16:41:00 +0100
> > Frederic Weisbecker <[email protected]> wrote:
> > > So isolcpus= is now the place where we control the isolation features
> > > and nohz is one of them.
> >
> > That's the part I'm not very sure about. We've been advising users to
> > move away from isolcpus= when possible, but this very wanted nohz_offload
> > feature will force everyone back to using isolcpus= again.
>
> Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
> the behaviour that you've been advising users against. We are simply
> reusing a kernel parameter that was abandoned to now control the isolation
> features that were disorganized and opaque behind nohz.
>
> >
> > I have the impression this series is trying to solve two problems:
> >
> > 1. How (and where) we control the various isolation features in the
> > kernel
>
> No, that has already been done in the previous merge window. We have a
> dedicated isolation subsystem now (kernel/sched/isolation.c) and
> an interface to control all these isolation features that were abusively implied
> by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
> "isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
> And there we are.
OK, I get it now. But then series has to un-deprecate isolcpus= otherwise
it doesn't make sense to use it.
On Wed, Jan 17, 2018 at 12:38:01PM -0500, Luiz Capitulino wrote:
> On Tue, 16 Jan 2018 23:51:29 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> > > On Tue, 16 Jan 2018 16:41:00 +0100
> > > Frederic Weisbecker <[email protected]> wrote:
> > > > So isolcpus= is now the place where we control the isolation features
> > > > and nohz is one of them.
> > >
> > > That's the part I'm not very sure about. We've been advising users to
> > > move away from isolcpus= when possible, but this very wanted nohz_offload
> > > feature will force everyone back to using isolcpus= again.
> >
> > Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
> > the behaviour that you've been advising users against. We are simply
> > reusing a kernel parameter that was abandoned to now control the isolation
> > features that were disorganized and opaque behind nohz.
> >
> > >
> > > I have the impression this series is trying to solve two problems:
> > >
> > > 1. How (and where) we control the various isolation features in the
> > > kernel
> >
> > No, that has already been done in the previous merge window. We have a
> > dedicated isolation subsystem now (kernel/sched/isolation.c) and
> > an interface to control all these isolation features that were abusively implied
> > by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
> > "isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
> > And there we are.
>
> OK, I get it now. But then series has to un-deprecate isolcpus= otherwise
> it doesn't make sense to use it.
Good point. Also I think you convinced me toward just applying that tick offload
on the existing nohz kernel parameter right away, that is, to both existing "nohz_full="
and "isolcpus=nohz".
After all that tick offload is an implementation detail.
Like you said if people complain about a regression, we can still fix it
with a new option. But eventually I doubt this will be needed.
I'll respin with that.
Thanks!
On Thu, 18 Jan 2018 04:04:43 +0100
Frederic Weisbecker <[email protected]> wrote:
> On Wed, Jan 17, 2018 at 12:38:01PM -0500, Luiz Capitulino wrote:
> > On Tue, 16 Jan 2018 23:51:29 +0100
> > Frederic Weisbecker <[email protected]> wrote:
> >
> > > On Tue, Jan 16, 2018 at 11:52:11AM -0500, Luiz Capitulino wrote:
> > > > On Tue, 16 Jan 2018 16:41:00 +0100
> > > > Frederic Weisbecker <[email protected]> wrote:
> > > > > So isolcpus= is now the place where we control the isolation features
> > > > > and nohz is one of them.
> > > >
> > > > That's the part I'm not very sure about. We've been advising users to
> > > > move away from isolcpus= when possible, but this very wanted nohz_offload
> > > > feature will force everyone back to using isolcpus= again.
> > >
> > > Note "isolcpus=nohz" only implies nohz. You need to add "domain" to get
> > > the behaviour that you've been advising users against. We are simply
> > > reusing a kernel parameter that was abandoned to now control the isolation
> > > features that were disorganized and opaque behind nohz.
> > >
> > > >
> > > > I have the impression this series is trying to solve two problems:
> > > >
> > > > 1. How (and where) we control the various isolation features in the
> > > > kernel
> > >
> > > No, that has already been done in the previous merge window. We have a
> > > dedicated isolation subsystem now (kernel/sched/isolation.c) and
> > > an interface to control all these isolation features that were abusively implied
> > > by nohz. The initial plan was to introduce "cpu_isolation=" but it looked too much like
> > > "isolcpus=". Then in fact, why not using "isolcpus=" and give it a second life.
> > > And there we are.
> >
> > OK, I get it now. But then series has to un-deprecate isolcpus= otherwise
> > it doesn't make sense to use it.
>
> Good point. Also I think you convinced me toward just applying that tick offload
> on the existing nohz kernel parameter right away, that is, to both existing "nohz_full="
> and "isolcpus=nohz".
>
> After all that tick offload is an implementation detail.
>
> Like you said if people complain about a regression, we can still fix it
> with a new option. But eventually I doubt this will be needed.
>
> I'll respin with that.
Exciting times!
Btw, I do have this problem where I have a hog app on an isolated core
with isolcpus=nohz_offload,domain,... and I see top -d1 going from 100%
to 0% and then back from 0% to 100% every few seconds or so. I'll debug
it when you post the next version.