2020-11-19 12:17:56

by Li, Aubrey

[permalink] [raw]
Subject: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

Add idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, if the idle driver indicates to stop tick, this CPU
is set in the idle cpumask to be a wakeup target. And if the CPU
is not in idle, the CPU is cleared in idle cpumask during scheduler
tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
schbench have no notable change, uperf has:

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

threads baseline-avg %std patch-avg %std
96 1 0.83 1.23 3.27
144 1 1.03 1.67 2.67
192 1 0.69 1.81 3.59
240 1 2.84 1.51 2.67

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
cpumask with sched_domain_span(sd), nohz=off case remains the original
behavior.

Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Jiang Biao <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Aubrey Li <[email protected]>
---
include/linux/sched/topology.h | 13 +++++++++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 52 +++++++++++++++++++++++++++++++++-
kernel/sched/idle.c | 8 ++++--
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 3 +-
6 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ /*
+ * Span of all idle CPUs in this domain.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus_span[];
};

+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus_span);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1e0da56abca..c86ae0495163 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3994,6 +3994,7 @@ void scheduler_tick(void)
rq_lock(rq, &rf);

update_rq_clock(rq);
+ update_idle_cpumask(rq, false);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
@@ -7192,6 +7193,7 @@ void __init sched_init(void)
rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
#endif
#endif /* CONFIG_SMP */
+ rq->last_idle_state = 1;
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48a6d442b444..d67fba5e406b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

time = cpu_clock(this);

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ /*
+ * sched_domain_shared is set only at shared cache level,
+ * this works only because select_idle_cpu is called with
+ * sd_llc.
+ */
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
#endif /* CONFIG_SMP */

+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(struct rq *rq, bool set_idle)
+{
+ struct sched_domain *sd;
+ int cpu = cpu_of(rq);
+ int idle_state;
+
+ /*
+ * If called from scheduler tick, only update
+ * idle cpumask if the CPU is busy, as idle
+ * cpumask is also updated on idle entry.
+ *
+ */
+ if (!set_idle && idle_cpu(cpu))
+ return;
+ /*
+ * Also set SCHED_IDLE cpu in idle cpumask to
+ * allow SCHED_IDLE cpu as a wakeup target
+ */
+ idle_state = set_idle || sched_idle_cpu(cpu);
+ /*
+ * No need to update idle cpumask if the state
+ * does not change.
+ */
+ if (rq->last_idle_state == idle_state)
+ return;
+
+ rcu_read_lock();
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
+ if (!sd || !sd->shared)
+ goto unlock;
+
+ if (idle_state)
+ cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+ else
+ cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+
+ rq->last_idle_state = idle_state;
+unlock:
+ rcu_read_unlock();
+}
+
static unsigned long wakeup_gran(struct sched_entity *se)
{
unsigned long gran = sysctl_sched_wakeup_granularity;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f324dc36fc43..6f5947673e66 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -163,6 +163,7 @@ static void cpuidle_idle_call(void)
*/

if (cpuidle_not_available(drv, dev)) {
+ update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();

default_idle_call();
@@ -193,6 +194,7 @@ static void cpuidle_idle_call(void)
max_latency_ns = dev->forced_idle_latency_limit_ns;
}

+ update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();

next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
@@ -205,10 +207,12 @@ static void cpuidle_idle_call(void)
*/
next_state = cpuidle_select(drv, dev, &stop_tick);

- if (stop_tick || tick_nohz_tick_stopped())
+ if (stop_tick || tick_nohz_tick_stopped()) {
+ update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();
- else
+ } else {
tick_nohz_idle_retain_tick();
+ }

entered_state = call_cpuidle(drv, dev, next_state);
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8d1ca65db3b0..db460b20217a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1004,6 +1004,7 @@ struct rq {
/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
#endif /* CONFIG_SMP */
+ unsigned char last_idle_state;

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
@@ -1088,6 +1089,7 @@ static inline void update_idle_core(struct rq *rq)
#else
static inline void update_idle_core(struct rq *rq) { }
#endif
+void update_idle_cpumask(struct rq *rq, bool set_idle);

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1bd7e3af904f..541bd3a7de4d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
}

sd->private = sdd;
@@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)

*per_cpu_ptr(sdd->sd, j) = sd;

- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.25.1


2020-11-23 09:33:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

Hi Aubrey,

On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
>
> Add idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, if the idle driver indicates to stop tick, this CPU
> is set in the idle cpumask to be a wakeup target. And if the CPU
> is not in idle, the CPU is cleared in idle cpumask during scheduler
> tick to ratelimit idle cpumask update.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Benchmarks were tested on a x86 4 socket system with 24 cores per
> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
> schbench have no notable change, uperf has:
>
> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>
> threads baseline-avg %std patch-avg %std
> 96 1 0.83 1.23 3.27
> 144 1 1.03 1.67 2.67
> 192 1 0.69 1.81 3.59
> 240 1 2.84 1.51 2.67
>
> v4->v5:
> - add update_idle_cpumask for s2idle case
> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> idle_cpumask() everywhere
>
> v3->v4:
> - change setting idle cpumask from every idle entry to tickless idle
> if cpu driver is available.

Could you remind me why you did this change ? Clearing the cpumask is
done during the tick to rate limit the number of updates of the
cpumask but It's not clear for me why you have associated the set with
the tick stop condition too.

This change means that a cpu will not be part of the idle mask if the
tick is not stopped. On some arm/arm64 platforms, the tick stops only
if the idle duration is expected to be higher than 1-2ms which starts
to be significantly long. Also, the cpuidle governor can easily
mis-predict a short idle duration whereas it will be finally a long
idle duration; In this case, the next tick will correct the situation
and select a deeper state, but this can happen up to 4ms later on
arm/arm64.

So I would prefer to keep trying to set the idle mask everytime the
cpu enters idle. If a tick has not happened between 2 idle phases, the
cpumask will not be updated and the overhead will be mostly testing if
(rq->last_idle_state == idle_state).


> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>
> v2->v3:
> - change setting idle cpumask to every idle entry, otherwise schbench
> has a regression of 99th percentile latency.
> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> idle cpumask is ratelimited in the idle exiting path.
> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>
> v1->v2:
> - idle cpumask is updated in the nohz routines, by initializing idle
> cpumask with sched_domain_span(sd), nohz=off case remains the original
> behavior.
>
> Cc: Mel Gorman <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Jiang Biao <[email protected]>
> Cc: Tim Chen <[email protected]>
> Signed-off-by: Aubrey Li <[email protected]>
> ---
> include/linux/sched/topology.h | 13 +++++++++
> kernel/sched/core.c | 2 ++
> kernel/sched/fair.c | 52 +++++++++++++++++++++++++++++++++-
> kernel/sched/idle.c | 8 ++++--
> kernel/sched/sched.h | 2 ++
> kernel/sched/topology.c | 3 +-
> 6 files changed, 76 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 820511289857..b47b85163607 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b1e0da56abca..c86ae0495163 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3994,6 +3994,7 @@ void scheduler_tick(void)
> rq_lock(rq, &rf);
>
> update_rq_clock(rq);
> + update_idle_cpumask(rq, false);
> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> curr->sched_class->task_tick(rq, curr, 0);
> @@ -7192,6 +7193,7 @@ void __init sched_init(void)
> rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
> #endif
> #endif /* CONFIG_SMP */
> + rq->last_idle_state = 1;
> hrtick_rq_init(rq);
> atomic_set(&rq->nr_iowait, 0);
> }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48a6d442b444..d67fba5e406b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + /*
> + * sched_domain_shared is set only at shared cache level,
> + * this works only because select_idle_cpu is called with
> + * sd_llc.
> + */
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> }
> #endif /* CONFIG_SMP */
>
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + */
> +void update_idle_cpumask(struct rq *rq, bool set_idle)
> +{
> + struct sched_domain *sd;
> + int cpu = cpu_of(rq);
> + int idle_state;
> +
> + /*
> + * If called from scheduler tick, only update
> + * idle cpumask if the CPU is busy, as idle
> + * cpumask is also updated on idle entry.
> + *
> + */
> + if (!set_idle && idle_cpu(cpu))
> + return;
> + /*
> + * Also set SCHED_IDLE cpu in idle cpumask to
> + * allow SCHED_IDLE cpu as a wakeup target
> + */
> + idle_state = set_idle || sched_idle_cpu(cpu);
> + /*
> + * No need to update idle cpumask if the state
> + * does not change.
> + */
> + if (rq->last_idle_state == idle_state)
> + return;
> +
> + rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> + if (!sd || !sd->shared)
> + goto unlock;
> +
> + if (idle_state)
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> + else
> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> +
> + rq->last_idle_state = idle_state;
> +unlock:
> + rcu_read_unlock();
> +}
> +
> static unsigned long wakeup_gran(struct sched_entity *se)
> {
> unsigned long gran = sysctl_sched_wakeup_granularity;
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index f324dc36fc43..6f5947673e66 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -163,6 +163,7 @@ static void cpuidle_idle_call(void)
> */
>
> if (cpuidle_not_available(drv, dev)) {
> + update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> default_idle_call();
> @@ -193,6 +194,7 @@ static void cpuidle_idle_call(void)
> max_latency_ns = dev->forced_idle_latency_limit_ns;
> }
>
> + update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> @@ -205,10 +207,12 @@ static void cpuidle_idle_call(void)
> */
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> - if (stop_tick || tick_nohz_tick_stopped())
> + if (stop_tick || tick_nohz_tick_stopped()) {
> + update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
> - else
> + } else {
> tick_nohz_idle_retain_tick();
> + }
>
> entered_state = call_cpuidle(drv, dev, next_state);
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 8d1ca65db3b0..db460b20217a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1004,6 +1004,7 @@ struct rq {
> /* This is used to determine avg_idle's max value */
> u64 max_idle_balance_cost;
> #endif /* CONFIG_SMP */
> + unsigned char last_idle_state;
>
> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> u64 prev_irq_time;
> @@ -1088,6 +1089,7 @@ static inline void update_idle_core(struct rq *rq)
> #else
> static inline void update_idle_core(struct rq *rq) { }
> #endif
> +void update_idle_cpumask(struct rq *rq, bool set_idle);
>
> DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1bd7e3af904f..541bd3a7de4d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> }
>
> sd->private = sdd;
> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.25.1
>

2020-11-24 23:44:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

Hi Aubrey,

Le mardi 24 nov. 2020 ? 15:01:38 (+0800), Li, Aubrey a ?crit :
> Hi Vincent,
>
> On 2020/11/23 17:27, Vincent Guittot wrote:
> > Hi Aubrey,
> >
> > On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
> >>
> >> Add idle cpumask to track idle cpus in sched domain. When a CPU
> >> enters idle, if the idle driver indicates to stop tick, this CPU
> >> is set in the idle cpumask to be a wakeup target. And if the CPU
> >> is not in idle, the CPU is cleared in idle cpumask during scheduler
> >> tick to ratelimit idle cpumask update.
> >>
> >> When a task wakes up to select an idle cpu, scanning idle cpumask
> >> has low cost than scanning all the cpus in last level cache domain,
> >> especially when the system is heavily loaded.
> >>
> >> Benchmarks were tested on a x86 4 socket system with 24 cores per
> >> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
> >> schbench have no notable change, uperf has:
> >>
> >> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
> >>
> >> threads baseline-avg %std patch-avg %std
> >> 96 1 0.83 1.23 3.27
> >> 144 1 1.03 1.67 2.67
> >> 192 1 0.69 1.81 3.59
> >> 240 1 2.84 1.51 2.67
> >>
> >> v4->v5:
> >> - add update_idle_cpumask for s2idle case
> >> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >> idle_cpumask() everywhere
> >>
> >> v3->v4:
> >> - change setting idle cpumask from every idle entry to tickless idle
> >> if cpu driver is available.
> >
> > Could you remind me why you did this change ? Clearing the cpumask is
> > done during the tick to rate limit the number of updates of the
> > cpumask but It's not clear for me why you have associated the set with
> > the tick stop condition too.
>
> I found the current implementation has better performance at a more
> suitable load range.
>
> The two kinds of implementions(v4 and v5) have the same rate(scheduler
> tick) to shrink idle cpumask when the system is busy, but

I'm ok with the part above

>
> - Setting the idle mask everytime the cpu enters idle requires a much
> heavier load level to preserve the idle cpumask(not call into idle),
> otherwise the bits cleared in scheduler tick will be restored when the
> cpu enters idle. That is, idle cpumask is almost equal to the domain
> cpumask during task wakeup if the system load is not heavy enough.

But setting the?idle cpumask is useful?because it helps to select an idle
cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
the idle cpu mask is useful in heavy cases because a system, which is
already fully busy with work,?doesn't want to waste time looking for an
idle cpu that doesn't exist. But if there is an idle cpu, we should?still
looks for it.

>
>
> - Associating with tick stop tolerates idle to preserve the idle cpumask
> but only short idle, which causes tick retains. This is more fitable for
> the real workload.

I don't agree with this and real use cases with interaction will probably
not agree as well as they want to run on an idle cpu if any but not wait
on an already busy one.
Also keep in mind that a tick can be up to 10ms long

>
> >
> > This change means that a cpu will not be part of the idle mask if the
> > tick is not stopped. On some arm/arm64 platforms, the tick stops only
> > if the idle duration is expected to be higher than 1-2ms which starts
> > to be significantly long. Also, the cpuidle governor can easily
> > mis-predict a short idle duration whereas it will be finally a long
> > idle duration; In this case, the next tick will correct the situation
> > and select a deeper state, but this can happen up to 4ms later on
> > arm/arm64.
>
> Yes this is intented. If the tick is not stopped, that indicates the
> CPU is very busy, cpu idle governor selected the polling idle state, and/or
> the expected idle duration is shorter than the tick period length. For

As mentioned above a tick can be up to 10ms long which is not a short idle
duration.

Then the?governor also mispredicts the idle duration and this is one
reason that the tick is not stopped because it will give the opportunity
to reevaluate the idle state in case of misprediction.

> example, uperf enters and exits 80 times between two ticks when utilizes
> 100% CPU, and the average idle residency < 50us.

But scheduler looks for idle state of prev cpu before looping the idle cpu
mask so i'm not sure that uperf is impacted in this case because scheduler
will select prev cpu before loop idle cpu mask.

>
> If this CPU is added to idle cpumask, the wakeup task likely needs to
> wait in the runqueue as this CPU will run its current task very soon.
>
> >
> > So I would prefer to keep trying to set the idle mask everytime the
> > cpu enters idle. If a tick has not happened between 2 idle phases, the
> > cpumask will not be updated and the overhead will be mostly testing if
> > (rq->last_idle_state == idle_state).
>
> Not sure if I addressed your concern, did you see any workloads any cases
> v4 performs better than v5?

Yes, I see some perf regression on my octo arm64 system for hackbench with
only 1 group (and for few ther ones but it's less obvious). There is no
perf impact with more groups most probably because the cpus are no more idle.

The regression happens even if the shallowest idle state is the only one to
be enabled.

- 2 x 4 cores arm64 system

12 iterations of hackbench -l (256000/#grp) -g #grp

Only the shallowest state enabled
(as a sidenote, we don't have polling mode on arm64)

grp tip/sched/core + your patchset v5 + the change below
1 1.296(+/- 2.06 %) 1.890(+/-35,10 %) -45% 1.311(+/- 2.63%) -1.19%
4 1.249(+/- 2.67 %) 1.255(+/- 3.10 %) - 0.46 % 1.235(+/- 4.77%) -0.28%
8 1.175(+/- 1.20 %) 1.180(+/- 2.33 %) - 0.47 % 1.179(+/- 1.60%) -0.38%
16 1.213(+/- 1.10 %) 1.218(+/- 0.68 %) - 0.41 % 1.219(+/- 2.15%) -0.52%

All idle states enabled (3 states)

grp tip/sched/core + your patchset v5 + the change below
1 1.429(+/-19.36 %) 1.678(+/-34.29 %) -17% 1.426(+/-17.18%) +0.21%
4 1.244(+/- 3.04 %) 1.261(+/- 3.21 %) - 1.39 % 1.260(+/- 2.55%) -1.29%
8 1.169(+/- 1.43 %) 1.183(+/- 2.09 %) - 1.23 % 1.198(+/- 2.49%) -2.51%
16 1.219(+/- 1.23 %) 1.218(+/- 0.68 %) - 0.59 % 1.225(+/- 1.43%) -0.49%

The change below is a bit different from your v3 because I wanted to check if
the root cause was the set of cpuilde mask only when tick is stopped.

---
kernel/sched/idle.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a38d8822ce0d..ca32197778b0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -156,6 +156,7 @@ static void cpuidle_idle_call(void)
return;
}

+ update_idle_cpumask(this_rq(), true);
/*
* The RCU framework needs to be told that we are entering an idle
* section, so no more rcu read side critical sections and one more
@@ -163,7 +164,6 @@ static void cpuidle_idle_call(void)
*/

if (cpuidle_not_available(drv, dev)) {
- update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();

default_idle_call();
@@ -194,7 +194,6 @@ static void cpuidle_idle_call(void)
max_latency_ns = dev->forced_idle_latency_limit_ns;
}

- update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();

next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
@@ -208,7 +207,6 @@ static void cpuidle_idle_call(void)
next_state = cpuidle_select(drv, dev, &stop_tick);

if (stop_tick || tick_nohz_tick_stopped()) {
- update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();
} else {
tick_nohz_idle_retain_tick();
--
2.17.1


>
> Thanks,
> -Aubrey
>
> >
> >
> >> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
> >>
> >> v2->v3:
> >> - change setting idle cpumask to every idle entry, otherwise schbench
> >> has a regression of 99th percentile latency.
> >> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> >> idle cpumask is ratelimited in the idle exiting path.
> >> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
> >>
> >> v1->v2:
> >> - idle cpumask is updated in the nohz routines, by initializing idle
> >> cpumask with sched_domain_span(sd), nohz=off case remains the original
> >> behavior.
> >>

2020-11-25 01:54:25

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

Hi Vincent,

On 2020/11/23 17:27, Vincent Guittot wrote:
> Hi Aubrey,
>
> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>> threads baseline-avg %std patch-avg %std
>> 96 1 0.83 1.23 3.27
>> 144 1 1.03 1.67 2.67
>> 192 1 0.69 1.81 3.59
>> 240 1 2.84 1.51 2.67
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>> idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>> if cpu driver is available.
>
> Could you remind me why you did this change ? Clearing the cpumask is
> done during the tick to rate limit the number of updates of the
> cpumask but It's not clear for me why you have associated the set with
> the tick stop condition too.

I found the current implementation has better performance at a more
suitable load range.

The two kinds of implementions(v4 and v5) have the same rate(scheduler
tick) to shrink idle cpumask when the system is busy, but

- Setting the idle mask everytime the cpu enters idle requires a much
heavier load level to preserve the idle cpumask(not call into idle),
otherwise the bits cleared in scheduler tick will be restored when the
cpu enters idle. That is, idle cpumask is almost equal to the domain
cpumask during task wakeup if the system load is not heavy enough.

- Associating with tick stop tolerates idle to preserve the idle cpumask
but only short idle, which causes tick retains. This is more fitable for
the real workload.

>
> This change means that a cpu will not be part of the idle mask if the
> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> if the idle duration is expected to be higher than 1-2ms which starts
> to be significantly long. Also, the cpuidle governor can easily
> mis-predict a short idle duration whereas it will be finally a long
> idle duration; In this case, the next tick will correct the situation
> and select a deeper state, but this can happen up to 4ms later on
> arm/arm64.

Yes this is intented. If the tick is not stopped, that indicates the
CPU is very busy, cpu idle governor selected the polling idle state, and/or
the expected idle duration is shorter than the tick period length. For
example, uperf enters and exits 80 times between two ticks when utilizes
100% CPU, and the average idle residency < 50us.

If this CPU is added to idle cpumask, the wakeup task likely needs to
wait in the runqueue as this CPU will run its current task very soon.

>
> So I would prefer to keep trying to set the idle mask everytime the
> cpu enters idle. If a tick has not happened between 2 idle phases, the
> cpumask will not be updated and the overhead will be mostly testing if
> (rq->last_idle_state == idle_state).

Not sure if I addressed your concern, did you see any workloads any cases
v4 performs better than v5?

Thanks,
-Aubrey

>
>
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>> has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>> idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>> cpumask with sched_domain_span(sd), nohz=off case remains the original
>> behavior.
>>

2020-11-25 02:13:02

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On 2020/11/25 1:01, Vincent Guittot wrote:
> Hi Aubrey,
>
> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>> Hi Vincent,
>>
>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>> Hi Aubrey,
>>>
>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>> tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has low cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>> schbench have no notable change, uperf has:
>>>>
>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>
>>>> threads baseline-avg %std patch-avg %std
>>>> 96 1 0.83 1.23 3.27
>>>> 144 1 1.03 1.67 2.67
>>>> 192 1 0.69 1.81 3.59
>>>> 240 1 2.84 1.51 2.67
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>> idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>> if cpu driver is available.
>>>
>>> Could you remind me why you did this change ? Clearing the cpumask is
>>> done during the tick to rate limit the number of updates of the
>>> cpumask but It's not clear for me why you have associated the set with
>>> the tick stop condition too.
>>
>> I found the current implementation has better performance at a more
>> suitable load range.
>>
>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>> tick) to shrink idle cpumask when the system is busy, but
>
> I'm ok with the part above
>
>>
>> - Setting the idle mask everytime the cpu enters idle requires a much
>> heavier load level to preserve the idle cpumask(not call into idle),
>> otherwise the bits cleared in scheduler tick will be restored when the
>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>> cpumask during task wakeup if the system load is not heavy enough.
>
> But setting the idle cpumask is useful because it helps to select an idle
> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> the idle cpu mask is useful in heavy cases because a system, which is
> already fully busy with work, doesn't want to waste time looking for an
> idle cpu that doesn't exist.

Yes, this is what v3 does.

> But if there is an idle cpu, we should still looks for it.

IMHO, this is a potential opportunity can be improved. The idle cpu could be
in different idle state, the idle duration could be long or could be very short.
For example, if there are two idle cpus:

- CPU1 is very busy, the pattern is 50us idle and 950us work.
- CPU2 is in idle for a tick length and wake up to do the regular work

If both added to the idle cpumask, we want the latter one, or we can just add
the later one into the idle cpumask. That's why I want to associate tick stop
signal with it.

>
>>
>>
>> - Associating with tick stop tolerates idle to preserve the idle cpumask
>> but only short idle, which causes tick retains. This is more fitable for
>> the real workload.
>
> I don't agree with this and real use cases with interaction will probably
> not agree as well as they want to run on an idle cpu if any but not wait
> on an already busy one.

The problem is scan overhead, scanning idle cpu need time. If an idle cpu
is in the short idle mode, it's very likely that when it's picked up for a
wakeup task, it goes back to work again, and the wakeup task has to wait too,
maybe longer because the running task just starts.

One benefit of waiting on the previous one is warm cache.

> Also keep in mind that a tick can be up to 10ms long

Right, but the point here is, if this 10ms tick retains, the CPU should be
in the short idle mode.

>
>>
>>>
>>> This change means that a cpu will not be part of the idle mask if the
>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
>>> if the idle duration is expected to be higher than 1-2ms which starts
>>> to be significantly long. Also, the cpuidle governor can easily
>>> mis-predict a short idle duration whereas it will be finally a long
>>> idle duration; In this case, the next tick will correct the situation
>>> and select a deeper state, but this can happen up to 4ms later on
>>> arm/arm64.
>>
>> Yes this is intented. If the tick is not stopped, that indicates the
>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
>> the expected idle duration is shorter than the tick period length. For
>
> As mentioned above a tick can be up to 10ms long which is not a short idle
> duration.

Usually when the tick retains, the CPU is in the short idle mode or even polling
instead of idle.

>
> Then the governor also mispredicts the idle duration and this is one
> reason that the tick is not stopped because it will give the opportunity
> to reevaluate the idle state in case of misprediction.
>
We always predict the next state based on the past states, so misprediction
does happen. This is not what this patch is trying to solve. I'm certainly
open if there is a better signal instead of stop_tick from idle governor.


>> example, uperf enters and exits 80 times between two ticks when utilizes
>> 100% CPU, and the average idle residency < 50us.
>
> But scheduler looks for idle state of prev cpu before looping the idle cpu
> mask so i'm not sure that uperf is impacted in this case because scheduler
> will select prev cpu before loop idle cpu mask.
>
>>
>> If this CPU is added to idle cpumask, the wakeup task likely needs to
>> wait in the runqueue as this CPU will run its current task very soon.
>>
>>>
>>> So I would prefer to keep trying to set the idle mask everytime the
>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
>>> cpumask will not be updated and the overhead will be mostly testing if
>>> (rq->last_idle_state == idle_state).
>>
>> Not sure if I addressed your concern, did you see any workloads any cases
>> v4 performs better than v5?
>
> Yes, I see some perf regression on my octo arm64 system for hackbench with
> only 1 group (and for few ther ones but it's less obvious). There is no
> perf impact with more groups most probably because the cpus are no more idle.
>
> The regression happens even if the shallowest idle state is the only one to
> be enabled.

Thanks for the data.

>
> - 2 x 4 cores arm64 system
>
> 12 iterations of hackbench -l (256000/#grp) -g #grp
>
> Only the shallowest state enabled

> (as a sidenote, we don't have polling mode on arm64)
Okay, this might be the cause of the difference between yours and mine. So do you
think if it makes sense to let idle governor to return a polling flag and associate
it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
may not be suitable for the wake up target.

>
> grp tip/sched/core + your patchset v5 + the change below
> 1 1.296(+/- 2.06 %) 1.890(+/-35,10 %) -45% 1.311(+/- 2.63%) -1.19%
> 4 1.249(+/- 2.67 %) 1.255(+/- 3.10 %) - 0.46 % 1.235(+/- 4.77%) -0.28%
> 8 1.175(+/- 1.20 %) 1.180(+/- 2.33 %) - 0.47 % 1.179(+/- 1.60%) -0.38%
> 16 1.213(+/- 1.10 %) 1.218(+/- 0.68 %) - 0.41 % 1.219(+/- 2.15%) -0.52%
>
> All idle states enabled (3 states)
>
> grp tip/sched/core + your patchset v5 + the change below
> 1 1.429(+/-19.36 %) 1.678(+/-34.29 %) -17% 1.426(+/-17.18%) +0.21%
> 4 1.244(+/- 3.04 %) 1.261(+/- 3.21 %) - 1.39 % 1.260(+/- 2.55%) -1.29%
> 8 1.169(+/- 1.43 %) 1.183(+/- 2.09 %) - 1.23 % 1.198(+/- 2.49%) -2.51%
> 16 1.219(+/- 1.23 %) 1.218(+/- 0.68 %) - 0.59 % 1.225(+/- 1.43%) -0.49%
>
> The change below is a bit different from your v3 because I wanted to check if
> the root cause was the set of cpuilde mask only when tick is stopped.

Okay, I see your point.

Thanks,
-Aubrey

2020-11-25 08:33:41

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <[email protected]> wrote:
>
> On 2020/11/25 1:01, Vincent Guittot wrote:
> > Hi Aubrey,
> >
> > Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
> >> Hi Vincent,
> >>
> >> On 2020/11/23 17:27, Vincent Guittot wrote:
> >>> Hi Aubrey,
> >>>
> >>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
> >>>>
> >>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
> >>>> enters idle, if the idle driver indicates to stop tick, this CPU
> >>>> is set in the idle cpumask to be a wakeup target. And if the CPU
> >>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
> >>>> tick to ratelimit idle cpumask update.
> >>>>
> >>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>> has low cost than scanning all the cpus in last level cache domain,
> >>>> especially when the system is heavily loaded.
> >>>>
> >>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
> >>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
> >>>> schbench have no notable change, uperf has:
> >>>>
> >>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
> >>>>
> >>>> threads baseline-avg %std patch-avg %std
> >>>> 96 1 0.83 1.23 3.27
> >>>> 144 1 1.03 1.67 2.67
> >>>> 192 1 0.69 1.81 3.59
> >>>> 240 1 2.84 1.51 2.67
> >>>>
> >>>> v4->v5:
> >>>> - add update_idle_cpumask for s2idle case
> >>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>>> idle_cpumask() everywhere
> >>>>
> >>>> v3->v4:
> >>>> - change setting idle cpumask from every idle entry to tickless idle
> >>>> if cpu driver is available.
> >>>
> >>> Could you remind me why you did this change ? Clearing the cpumask is
> >>> done during the tick to rate limit the number of updates of the
> >>> cpumask but It's not clear for me why you have associated the set with
> >>> the tick stop condition too.
> >>
> >> I found the current implementation has better performance at a more
> >> suitable load range.
> >>
> >> The two kinds of implementions(v4 and v5) have the same rate(scheduler
> >> tick) to shrink idle cpumask when the system is busy, but
> >
> > I'm ok with the part above
> >
> >>
> >> - Setting the idle mask everytime the cpu enters idle requires a much
> >> heavier load level to preserve the idle cpumask(not call into idle),
> >> otherwise the bits cleared in scheduler tick will be restored when the
> >> cpu enters idle. That is, idle cpumask is almost equal to the domain
> >> cpumask during task wakeup if the system load is not heavy enough.
> >
> > But setting the idle cpumask is useful because it helps to select an idle
> > cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> > the idle cpu mask is useful in heavy cases because a system, which is
> > already fully busy with work, doesn't want to waste time looking for an
> > idle cpu that doesn't exist.
>
> Yes, this is what v3 does.
>
> > But if there is an idle cpu, we should still looks for it.
>
> IMHO, this is a potential opportunity can be improved. The idle cpu could be
> in different idle state, the idle duration could be long or could be very short.
> For example, if there are two idle cpus:
>
> - CPU1 is very busy, the pattern is 50us idle and 950us work.
> - CPU2 is in idle for a tick length and wake up to do the regular work
>
> If both added to the idle cpumask, we want the latter one, or we can just add
> the later one into the idle cpumask. That's why I want to associate tick stop
> signal with it.
>
> >
> >>
> >>
> >> - Associating with tick stop tolerates idle to preserve the idle cpumask
> >> but only short idle, which causes tick retains. This is more fitable for
> >> the real workload.
> >
> > I don't agree with this and real use cases with interaction will probably
> > not agree as well as they want to run on an idle cpu if any but not wait
> > on an already busy one.
>
> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
> is in the short idle mode, it's very likely that when it's picked up for a
> wakeup task, it goes back to work again, and the wakeup task has to wait too,
> maybe longer because the running task just starts.
>
> One benefit of waiting on the previous one is warm cache.
>
> > Also keep in mind that a tick can be up to 10ms long
>
> Right, but the point here is, if this 10ms tick retains, the CPU should be
> in the short idle mode.

But 10, 4 or even 1ms is quite long for a system and that's even more
true compared to scanning the idle cpu mask

>
> >
> >>
> >>>
> >>> This change means that a cpu will not be part of the idle mask if the
> >>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> >>> if the idle duration is expected to be higher than 1-2ms which starts
> >>> to be significantly long. Also, the cpuidle governor can easily
> >>> mis-predict a short idle duration whereas it will be finally a long
> >>> idle duration; In this case, the next tick will correct the situation
> >>> and select a deeper state, but this can happen up to 4ms later on
> >>> arm/arm64.
> >>
> >> Yes this is intented. If the tick is not stopped, that indicates the
> >> CPU is very busy, cpu idle governor selected the polling idle state, and/or
> >> the expected idle duration is shorter than the tick period length. For
> >
> > As mentioned above a tick can be up to 10ms long which is not a short idle
> > duration.
>
> Usually when the tick retains, the CPU is in the short idle mode or even polling
> instead of idle.

Also keep in mind that cpuidle can select a shallow state and retains
tick because of the wake up latency constraint and not the idle
duration. So you can't really make the assumption that retaining tick
means short idle duration

>
> >
> > Then the governor also mispredicts the idle duration and this is one
> > reason that the tick is not stopped because it will give the opportunity
> > to reevaluate the idle state in case of misprediction.
> >
> We always predict the next state based on the past states, so misprediction
> does happen. This is not what this patch is trying to solve. I'm certainly

My point here was to say that one original goal of cpuidle for
retaining the tick was to handle case where the governor mispredicts a
short idle time. Retaining the tick prevents the cpu to stay too long
in this shallow idle state and to waste power which seems to happen
often enough to be raised by people

> open if there is a better signal instead of stop_tick from idle governor.
>
>
> >> example, uperf enters and exits 80 times between two ticks when utilizes
> >> 100% CPU, and the average idle residency < 50us.
> >
> > But scheduler looks for idle state of prev cpu before looping the idle cpu
> > mask so i'm not sure that uperf is impacted in this case because scheduler
> > will select prev cpu before loop idle cpu mask.
> >
> >>
> >> If this CPU is added to idle cpumask, the wakeup task likely needs to
> >> wait in the runqueue as this CPU will run its current task very soon.
> >>
> >>>
> >>> So I would prefer to keep trying to set the idle mask everytime the
> >>> cpu enters idle. If a tick has not happened between 2 idle phases, the
> >>> cpumask will not be updated and the overhead will be mostly testing if
> >>> (rq->last_idle_state == idle_state).
> >>
> >> Not sure if I addressed your concern, did you see any workloads any cases
> >> v4 performs better than v5?
> >
> > Yes, I see some perf regression on my octo arm64 system for hackbench with
> > only 1 group (and for few ther ones but it's less obvious). There is no
> > perf impact with more groups most probably because the cpus are no more idle.
> >
> > The regression happens even if the shallowest idle state is the only one to
> > be enabled.
>
> Thanks for the data.
>
> >
> > - 2 x 4 cores arm64 system
> >
> > 12 iterations of hackbench -l (256000/#grp) -g #grp
> >
> > Only the shallowest state enabled
>
> > (as a sidenote, we don't have polling mode on arm64)
> Okay, this might be the cause of the difference between yours and mine. So do you
> think if it makes sense to let idle governor to return a polling flag and associate
> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
> may not be suitable for the wake up target.

I don't know much about polling but can't this mode be used up to a tick too ?

>
> >
> > grp tip/sched/core + your patchset v5 + the change below
> > 1 1.296(+/- 2.06 %) 1.890(+/-35,10 %) -45% 1.311(+/- 2.63%) -1.19%
> > 4 1.249(+/- 2.67 %) 1.255(+/- 3.10 %) - 0.46 % 1.235(+/- 4.77%) -0.28%
> > 8 1.175(+/- 1.20 %) 1.180(+/- 2.33 %) - 0.47 % 1.179(+/- 1.60%) -0.38%
> > 16 1.213(+/- 1.10 %) 1.218(+/- 0.68 %) - 0.41 % 1.219(+/- 2.15%) -0.52%
> >
> > All idle states enabled (3 states)
> >
> > grp tip/sched/core + your patchset v5 + the change below
> > 1 1.429(+/-19.36 %) 1.678(+/-34.29 %) -17% 1.426(+/-17.18%) +0.21%
> > 4 1.244(+/- 3.04 %) 1.261(+/- 3.21 %) - 1.39 % 1.260(+/- 2.55%) -1.29%
> > 8 1.169(+/- 1.43 %) 1.183(+/- 2.09 %) - 1.23 % 1.198(+/- 2.49%) -2.51%
> > 16 1.219(+/- 1.23 %) 1.218(+/- 0.68 %) - 0.59 % 1.225(+/- 1.43%) -0.49%
> >
> > The change below is a bit different from your v3 because I wanted to check if
> > the root cause was the set of cpuilde mask only when tick is stopped.
>
> Okay, I see your point.
>
> Thanks,
> -Aubrey

2020-11-25 13:40:03

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On 2020/11/25 16:31, Vincent Guittot wrote:
> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/11/25 1:01, Vincent Guittot wrote:
>>> Hi Aubrey,
>>>
>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>>>> Hi Vincent,
>>>>
>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
>>>>>>
>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>>>> tick to ratelimit idle cpumask update.
>>>>>>
>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>>> especially when the system is heavily loaded.
>>>>>>
>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>>>> schbench have no notable change, uperf has:
>>>>>>
>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>>>
>>>>>> threads baseline-avg %std patch-avg %std
>>>>>> 96 1 0.83 1.23 3.27
>>>>>> 144 1 1.03 1.67 2.67
>>>>>> 192 1 0.69 1.81 3.59
>>>>>> 240 1 2.84 1.51 2.67
>>>>>>
>>>>>> v4->v5:
>>>>>> - add update_idle_cpumask for s2idle case
>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>>> idle_cpumask() everywhere
>>>>>>
>>>>>> v3->v4:
>>>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>>> if cpu driver is available.
>>>>>
>>>>> Could you remind me why you did this change ? Clearing the cpumask is
>>>>> done during the tick to rate limit the number of updates of the
>>>>> cpumask but It's not clear for me why you have associated the set with
>>>>> the tick stop condition too.
>>>>
>>>> I found the current implementation has better performance at a more
>>>> suitable load range.
>>>>
>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>>>> tick) to shrink idle cpumask when the system is busy, but
>>>
>>> I'm ok with the part above
>>>
>>>>
>>>> - Setting the idle mask everytime the cpu enters idle requires a much
>>>> heavier load level to preserve the idle cpumask(not call into idle),
>>>> otherwise the bits cleared in scheduler tick will be restored when the
>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>>>> cpumask during task wakeup if the system load is not heavy enough.
>>>
>>> But setting the idle cpumask is useful because it helps to select an idle
>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
>>> the idle cpu mask is useful in heavy cases because a system, which is
>>> already fully busy with work, doesn't want to waste time looking for an
>>> idle cpu that doesn't exist.
>>
>> Yes, this is what v3 does.
>>
>>> But if there is an idle cpu, we should still looks for it.
>>
>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
>> in different idle state, the idle duration could be long or could be very short.
>> For example, if there are two idle cpus:
>>
>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
>> - CPU2 is in idle for a tick length and wake up to do the regular work
>>
>> If both added to the idle cpumask, we want the latter one, or we can just add
>> the later one into the idle cpumask. That's why I want to associate tick stop
>> signal with it.
>>
>>>
>>>>
>>>>
>>>> - Associating with tick stop tolerates idle to preserve the idle cpumask
>>>> but only short idle, which causes tick retains. This is more fitable for
>>>> the real workload.
>>>
>>> I don't agree with this and real use cases with interaction will probably
>>> not agree as well as they want to run on an idle cpu if any but not wait
>>> on an already busy one.
>>
>> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
>> is in the short idle mode, it's very likely that when it's picked up for a
>> wakeup task, it goes back to work again, and the wakeup task has to wait too,
>> maybe longer because the running task just starts.
>>
>> One benefit of waiting on the previous one is warm cache.
>>
>>> Also keep in mind that a tick can be up to 10ms long
>>
>> Right, but the point here is, if this 10ms tick retains, the CPU should be
>> in the short idle mode.
>
> But 10, 4 or even 1ms is quite long for a system and that's even more
> true compared to scanning the idle cpu mask
>
>>
>>>
>>>>
>>>>>
>>>>> This change means that a cpu will not be part of the idle mask if the
>>>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
>>>>> if the idle duration is expected to be higher than 1-2ms which starts
>>>>> to be significantly long. Also, the cpuidle governor can easily
>>>>> mis-predict a short idle duration whereas it will be finally a long
>>>>> idle duration; In this case, the next tick will correct the situation
>>>>> and select a deeper state, but this can happen up to 4ms later on
>>>>> arm/arm64.
>>>>
>>>> Yes this is intented. If the tick is not stopped, that indicates the
>>>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
>>>> the expected idle duration is shorter than the tick period length. For
>>>
>>> As mentioned above a tick can be up to 10ms long which is not a short idle
>>> duration.
>>
>> Usually when the tick retains, the CPU is in the short idle mode or even polling
>> instead of idle.
>
> Also keep in mind that cpuidle can select a shallow state and retains
> tick because of the wake up latency constraint and not the idle
> duration. So you can't really make the assumption that retaining tick
> means short idle duration
>
idle governor has short idle information, probably can let idle governor
expose a short idle indicator?

>>
>>>
>>> Then the governor also mispredicts the idle duration and this is one
>>> reason that the tick is not stopped because it will give the opportunity
>>> to reevaluate the idle state in case of misprediction.
>>>
>> We always predict the next state based on the past states, so misprediction
>> does happen. This is not what this patch is trying to solve. I'm certainly
>
> My point here was to say that one original goal of cpuidle for
> retaining the tick was to handle case where the governor mispredicts a
> short idle time. Retaining the tick prevents the cpu to stay too long
> in this shallow idle state and to waste power which seems to happen
> often enough to be raised by people

I see, thanks!

>
>> open if there is a better signal instead of stop_tick from idle governor.
>>
>>
>>>> example, uperf enters and exits 80 times between two ticks when utilizes
>>>> 100% CPU, and the average idle residency < 50us.
>>>
>>> But scheduler looks for idle state of prev cpu before looping the idle cpu
>>> mask so i'm not sure that uperf is impacted in this case because scheduler
>>> will select prev cpu before loop idle cpu mask.
>>>
>>>>
>>>> If this CPU is added to idle cpumask, the wakeup task likely needs to
>>>> wait in the runqueue as this CPU will run its current task very soon.
>>>>
>>>>>
>>>>> So I would prefer to keep trying to set the idle mask everytime the
>>>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
>>>>> cpumask will not be updated and the overhead will be mostly testing if
>>>>> (rq->last_idle_state == idle_state).
>>>>
>>>> Not sure if I addressed your concern, did you see any workloads any cases
>>>> v4 performs better than v5?
>>>
>>> Yes, I see some perf regression on my octo arm64 system for hackbench with
>>> only 1 group (and for few ther ones but it's less obvious). There is no
>>> perf impact with more groups most probably because the cpus are no more idle.
>>>
>>> The regression happens even if the shallowest idle state is the only one to
>>> be enabled.
>>
>> Thanks for the data.
>>
>>>
>>> - 2 x 4 cores arm64 system
>>>
>>> 12 iterations of hackbench -l (256000/#grp) -g #grp
>>>
>>> Only the shallowest state enabled
>>
>>> (as a sidenote, we don't have polling mode on arm64)
>> Okay, this might be the cause of the difference between yours and mine. So do you
>> think if it makes sense to let idle governor to return a polling flag and associate
>> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
>> may not be suitable for the wake up target.
>
> I don't know much about polling but can't this mode be used up to a tick too ?
>I think so. So short idle need a definition. I'm not sure if it's a good idea to define
the short idle as a tunable and default set it to tick >> 2?

Updating idle cpumask everytime cpu enters idle works for me, as we have state change
check, so we won't actually update idle cpumask everytime the cpu enters idle.

But I'm still willing to exclude short idle case, what do you think?

Thanks,
-Aubrey


2020-11-26 11:30:46

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On 2020/11/26 16:14, Vincent Guittot wrote:
> On Wed, 25 Nov 2020 at 14:37, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/11/25 16:31, Vincent Guittot wrote:
>>> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <[email protected]> wrote:
>>>>
>>>> On 2020/11/25 1:01, Vincent Guittot wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>>>>>> Hi Vincent,
>>>>>>
>>>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>>>>>> Hi Aubrey,
>>>>>>>
>>>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>>>>>> tick to ratelimit idle cpumask update.
>>>>>>>>
>>>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>>>>> especially when the system is heavily loaded.
>>>>>>>>
>>>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>>>>>> schbench have no notable change, uperf has:
>>>>>>>>
>>>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>>>>>
>>>>>>>> threads baseline-avg %std patch-avg %std
>>>>>>>> 96 1 0.83 1.23 3.27
>>>>>>>> 144 1 1.03 1.67 2.67
>>>>>>>> 192 1 0.69 1.81 3.59
>>>>>>>> 240 1 2.84 1.51 2.67
>>>>>>>>
>>>>>>>> v4->v5:
>>>>>>>> - add update_idle_cpumask for s2idle case
>>>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>>>>> idle_cpumask() everywhere
>>>>>>>>
>>>>>>>> v3->v4:
>>>>>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>>>>> if cpu driver is available.
>>>>>>>
>>>>>>> Could you remind me why you did this change ? Clearing the cpumask is
>>>>>>> done during the tick to rate limit the number of updates of the
>>>>>>> cpumask but It's not clear for me why you have associated the set with
>>>>>>> the tick stop condition too.
>>>>>>
>>>>>> I found the current implementation has better performance at a more
>>>>>> suitable load range.
>>>>>>
>>>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>>>>>> tick) to shrink idle cpumask when the system is busy, but
>>>>>
>>>>> I'm ok with the part above
>>>>>
>>>>>>
>>>>>> - Setting the idle mask everytime the cpu enters idle requires a much
>>>>>> heavier load level to preserve the idle cpumask(not call into idle),
>>>>>> otherwise the bits cleared in scheduler tick will be restored when the
>>>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>>>>>> cpumask during task wakeup if the system load is not heavy enough.
>>>>>
>>>>> But setting the idle cpumask is useful because it helps to select an idle
>>>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
>>>>> the idle cpu mask is useful in heavy cases because a system, which is
>>>>> already fully busy with work, doesn't want to waste time looking for an
>>>>> idle cpu that doesn't exist.
>>>>
>>>> Yes, this is what v3 does.
>>>>
>>>>> But if there is an idle cpu, we should still looks for it.
>>>>
>>>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
>>>> in different idle state, the idle duration could be long or could be very short.
>>>> For example, if there are two idle cpus:
>>>>
>>>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
>>>> - CPU2 is in idle for a tick length and wake up to do the regular work
>>>>
>>>> If both added to the idle cpumask, we want the latter one, or we can just add
>>>> the later one into the idle cpumask. That's why I want to associate tick stop
>>>> signal with it.
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> - Associating with tick stop tolerates idle to preserve the idle cpumask
>>>>>> but only short idle, which causes tick retains. This is more fitable for
>>>>>> the real workload.
>>>>>
>>>>> I don't agree with this and real use cases with interaction will probably
>>>>> not agree as well as they want to run on an idle cpu if any but not wait
>>>>> on an already busy one.
>>>>
>>>> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
>>>> is in the short idle mode, it's very likely that when it's picked up for a
>>>> wakeup task, it goes back to work again, and the wakeup task has to wait too,
>>>> maybe longer because the running task just starts.
>>>>
>>>> One benefit of waiting on the previous one is warm cache.
>>>>
>>>>> Also keep in mind that a tick can be up to 10ms long
>>>>
>>>> Right, but the point here is, if this 10ms tick retains, the CPU should be
>>>> in the short idle mode.
>>>
>>> But 10, 4 or even 1ms is quite long for a system and that's even more
>>> true compared to scanning the idle cpu mask
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> This change means that a cpu will not be part of the idle mask if the
>>>>>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
>>>>>>> if the idle duration is expected to be higher than 1-2ms which starts
>>>>>>> to be significantly long. Also, the cpuidle governor can easily
>>>>>>> mis-predict a short idle duration whereas it will be finally a long
>>>>>>> idle duration; In this case, the next tick will correct the situation
>>>>>>> and select a deeper state, but this can happen up to 4ms later on
>>>>>>> arm/arm64.
>>>>>>
>>>>>> Yes this is intented. If the tick is not stopped, that indicates the
>>>>>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
>>>>>> the expected idle duration is shorter than the tick period length. For
>>>>>
>>>>> As mentioned above a tick can be up to 10ms long which is not a short idle
>>>>> duration.
>>>>
>>>> Usually when the tick retains, the CPU is in the short idle mode or even polling
>>>> instead of idle.
>>>
>>> Also keep in mind that cpuidle can select a shallow state and retains
>>> tick because of the wake up latency constraint and not the idle
>>> duration. So you can't really make the assumption that retaining tick
>>> means short idle duration
>>>
>> idle governor has short idle information, probably can let idle governor
>> expose a short idle indicator?
>>
>>>>
>>>>>
>>>>> Then the governor also mispredicts the idle duration and this is one
>>>>> reason that the tick is not stopped because it will give the opportunity
>>>>> to reevaluate the idle state in case of misprediction.
>>>>>
>>>> We always predict the next state based on the past states, so misprediction
>>>> does happen. This is not what this patch is trying to solve. I'm certainly
>>>
>>> My point here was to say that one original goal of cpuidle for
>>> retaining the tick was to handle case where the governor mispredicts a
>>> short idle time. Retaining the tick prevents the cpu to stay too long
>>> in this shallow idle state and to waste power which seems to happen
>>> often enough to be raised by people
>>
>> I see, thanks!
>>
>>>
>>>> open if there is a better signal instead of stop_tick from idle governor.
>>>>
>>>>
>>>>>> example, uperf enters and exits 80 times between two ticks when utilizes
>>>>>> 100% CPU, and the average idle residency < 50us.
>>>>>
>>>>> But scheduler looks for idle state of prev cpu before looping the idle cpu
>>>>> mask so i'm not sure that uperf is impacted in this case because scheduler
>>>>> will select prev cpu before loop idle cpu mask.
>>>>>
>>>>>>
>>>>>> If this CPU is added to idle cpumask, the wakeup task likely needs to
>>>>>> wait in the runqueue as this CPU will run its current task very soon.
>>>>>>
>>>>>>>
>>>>>>> So I would prefer to keep trying to set the idle mask everytime the
>>>>>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
>>>>>>> cpumask will not be updated and the overhead will be mostly testing if
>>>>>>> (rq->last_idle_state == idle_state).
>>>>>>
>>>>>> Not sure if I addressed your concern, did you see any workloads any cases
>>>>>> v4 performs better than v5?
>>>>>
>>>>> Yes, I see some perf regression on my octo arm64 system for hackbench with
>>>>> only 1 group (and for few ther ones but it's less obvious). There is no
>>>>> perf impact with more groups most probably because the cpus are no more idle.
>>>>>
>>>>> The regression happens even if the shallowest idle state is the only one to
>>>>> be enabled.
>>>>
>>>> Thanks for the data.
>>>>
>>>>>
>>>>> - 2 x 4 cores arm64 system
>>>>>
>>>>> 12 iterations of hackbench -l (256000/#grp) -g #grp
>>>>>
>>>>> Only the shallowest state enabled
>>>>
>>>>> (as a sidenote, we don't have polling mode on arm64)
>>>> Okay, this might be the cause of the difference between yours and mine. So do you
>>>> think if it makes sense to let idle governor to return a polling flag and associate
>>>> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
>>>> may not be suitable for the wake up target.
>>>
>>> I don't know much about polling but can't this mode be used up to a tick too ?
>>> I think so. So short idle need a definition. I'm not sure if it's a good idea to define
>> the short idle as a tunable and default set it to tick >> 2?
>
> I have never been fond of heuristic like tick << 2 or sys tunable
>
> TBH, I'm not sure that using the tick is a good idea. And such kind of
> parameter need more thought
>
>>
>> Updating idle cpumask everytime cpu enters idle works for me, as we have state change
>> check, so we won't actually update idle cpumask everytime the cpu enters idle.
>
> Yes, In this case, the overhead stays reasonable and is limited to the
> test of a rq->last_idle_state
> This will benefit heavy use a case by reducing the scanning time and
> will not regress other use case.
>
>>
>> But I'm still willing to exclude short idle case, what do you think?
>
> something similar to patch v3 or patch v5 + my changes seems to be a
> good 1st step that will benefit heavy use cases with regressing other
> ones.
>
> Trying to exclude short idle case will need more thoughts and changes
> especially about to how to get this information and if it is reliable.
>

okay, I'll post a v6 with v5 + your change below after data measurement.
May I add you a signed-off-by to the patch?

Thanks,
-Aubrey

---
kernel/sched/idle.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a38d8822ce0d..ca32197778b0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -156,6 +156,7 @@ static void cpuidle_idle_call(void)
return;
}

+ update_idle_cpumask(this_rq(), true);
/*
* The RCU framework needs to be told that we are entering an idle
* section, so no more rcu read side critical sections and one more
@@ -163,7 +164,6 @@ static void cpuidle_idle_call(void)
*/

if (cpuidle_not_available(drv, dev)) {
- update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();

default_idle_call();
@@ -194,7 +194,6 @@ static void cpuidle_idle_call(void)
max_latency_ns = dev->forced_idle_latency_limit_ns;
}

- update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();

next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
@@ -208,7 +207,6 @@ static void cpuidle_idle_call(void)
next_state = cpuidle_select(drv, dev, &stop_tick);

if (stop_tick || tick_nohz_tick_stopped()) {
- update_idle_cpumask(this_rq(), true);
tick_nohz_idle_stop_tick();
} else {
tick_nohz_idle_retain_tick();
--
2.17.1

2020-11-26 11:39:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Thu, 26 Nov 2020 at 10:35, Li, Aubrey <[email protected]> wrote:
>
> On 2020/11/26 16:14, Vincent Guittot wrote:
> > On Wed, 25 Nov 2020 at 14:37, Li, Aubrey <[email protected]> wrote:
> >>
> >> On 2020/11/25 16:31, Vincent Guittot wrote:
> >>> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <[email protected]> wrote:
> >>>>
> >>>> On 2020/11/25 1:01, Vincent Guittot wrote:
> >>>>> Hi Aubrey,
> >>>>>
> >>>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
> >>>>>> Hi Vincent,
> >>>>>>
> >>>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
> >>>>>>> Hi Aubrey,
> >>>>>>>
> >>>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
> >>>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
> >>>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
> >>>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
> >>>>>>>> tick to ratelimit idle cpumask update.
> >>>>>>>>
> >>>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>>>>>> has low cost than scanning all the cpus in last level cache domain,
> >>>>>>>> especially when the system is heavily loaded.
> >>>>>>>>
> >>>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
> >>>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
> >>>>>>>> schbench have no notable change, uperf has:
> >>>>>>>>
> >>>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
> >>>>>>>>
> >>>>>>>> threads baseline-avg %std patch-avg %std
> >>>>>>>> 96 1 0.83 1.23 3.27
> >>>>>>>> 144 1 1.03 1.67 2.67
> >>>>>>>> 192 1 0.69 1.81 3.59
> >>>>>>>> 240 1 2.84 1.51 2.67
> >>>>>>>>
> >>>>>>>> v4->v5:
> >>>>>>>> - add update_idle_cpumask for s2idle case
> >>>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>>>>>>> idle_cpumask() everywhere
> >>>>>>>>
> >>>>>>>> v3->v4:
> >>>>>>>> - change setting idle cpumask from every idle entry to tickless idle
> >>>>>>>> if cpu driver is available.
> >>>>>>>
> >>>>>>> Could you remind me why you did this change ? Clearing the cpumask is
> >>>>>>> done during the tick to rate limit the number of updates of the
> >>>>>>> cpumask but It's not clear for me why you have associated the set with
> >>>>>>> the tick stop condition too.
> >>>>>>
> >>>>>> I found the current implementation has better performance at a more
> >>>>>> suitable load range.
> >>>>>>
> >>>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
> >>>>>> tick) to shrink idle cpumask when the system is busy, but
> >>>>>
> >>>>> I'm ok with the part above
> >>>>>
> >>>>>>
> >>>>>> - Setting the idle mask everytime the cpu enters idle requires a much
> >>>>>> heavier load level to preserve the idle cpumask(not call into idle),
> >>>>>> otherwise the bits cleared in scheduler tick will be restored when the
> >>>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
> >>>>>> cpumask during task wakeup if the system load is not heavy enough.
> >>>>>
> >>>>> But setting the idle cpumask is useful because it helps to select an idle
> >>>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> >>>>> the idle cpu mask is useful in heavy cases because a system, which is
> >>>>> already fully busy with work, doesn't want to waste time looking for an
> >>>>> idle cpu that doesn't exist.
> >>>>
> >>>> Yes, this is what v3 does.
> >>>>
> >>>>> But if there is an idle cpu, we should still looks for it.
> >>>>
> >>>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
> >>>> in different idle state, the idle duration could be long or could be very short.
> >>>> For example, if there are two idle cpus:
> >>>>
> >>>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
> >>>> - CPU2 is in idle for a tick length and wake up to do the regular work
> >>>>
> >>>> If both added to the idle cpumask, we want the latter one, or we can just add
> >>>> the later one into the idle cpumask. That's why I want to associate tick stop
> >>>> signal with it.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> - Associating with tick stop tolerates idle to preserve the idle cpumask
> >>>>>> but only short idle, which causes tick retains. This is more fitable for
> >>>>>> the real workload.
> >>>>>
> >>>>> I don't agree with this and real use cases with interaction will probably
> >>>>> not agree as well as they want to run on an idle cpu if any but not wait
> >>>>> on an already busy one.
> >>>>
> >>>> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
> >>>> is in the short idle mode, it's very likely that when it's picked up for a
> >>>> wakeup task, it goes back to work again, and the wakeup task has to wait too,
> >>>> maybe longer because the running task just starts.
> >>>>
> >>>> One benefit of waiting on the previous one is warm cache.
> >>>>
> >>>>> Also keep in mind that a tick can be up to 10ms long
> >>>>
> >>>> Right, but the point here is, if this 10ms tick retains, the CPU should be
> >>>> in the short idle mode.
> >>>
> >>> But 10, 4 or even 1ms is quite long for a system and that's even more
> >>> true compared to scanning the idle cpu mask
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> This change means that a cpu will not be part of the idle mask if the
> >>>>>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> >>>>>>> if the idle duration is expected to be higher than 1-2ms which starts
> >>>>>>> to be significantly long. Also, the cpuidle governor can easily
> >>>>>>> mis-predict a short idle duration whereas it will be finally a long
> >>>>>>> idle duration; In this case, the next tick will correct the situation
> >>>>>>> and select a deeper state, but this can happen up to 4ms later on
> >>>>>>> arm/arm64.
> >>>>>>
> >>>>>> Yes this is intented. If the tick is not stopped, that indicates the
> >>>>>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
> >>>>>> the expected idle duration is shorter than the tick period length. For
> >>>>>
> >>>>> As mentioned above a tick can be up to 10ms long which is not a short idle
> >>>>> duration.
> >>>>
> >>>> Usually when the tick retains, the CPU is in the short idle mode or even polling
> >>>> instead of idle.
> >>>
> >>> Also keep in mind that cpuidle can select a shallow state and retains
> >>> tick because of the wake up latency constraint and not the idle
> >>> duration. So you can't really make the assumption that retaining tick
> >>> means short idle duration
> >>>
> >> idle governor has short idle information, probably can let idle governor
> >> expose a short idle indicator?
> >>
> >>>>
> >>>>>
> >>>>> Then the governor also mispredicts the idle duration and this is one
> >>>>> reason that the tick is not stopped because it will give the opportunity
> >>>>> to reevaluate the idle state in case of misprediction.
> >>>>>
> >>>> We always predict the next state based on the past states, so misprediction
> >>>> does happen. This is not what this patch is trying to solve. I'm certainly
> >>>
> >>> My point here was to say that one original goal of cpuidle for
> >>> retaining the tick was to handle case where the governor mispredicts a
> >>> short idle time. Retaining the tick prevents the cpu to stay too long
> >>> in this shallow idle state and to waste power which seems to happen
> >>> often enough to be raised by people
> >>
> >> I see, thanks!
> >>
> >>>
> >>>> open if there is a better signal instead of stop_tick from idle governor.
> >>>>
> >>>>
> >>>>>> example, uperf enters and exits 80 times between two ticks when utilizes
> >>>>>> 100% CPU, and the average idle residency < 50us.
> >>>>>
> >>>>> But scheduler looks for idle state of prev cpu before looping the idle cpu
> >>>>> mask so i'm not sure that uperf is impacted in this case because scheduler
> >>>>> will select prev cpu before loop idle cpu mask.
> >>>>>
> >>>>>>
> >>>>>> If this CPU is added to idle cpumask, the wakeup task likely needs to
> >>>>>> wait in the runqueue as this CPU will run its current task very soon.
> >>>>>>
> >>>>>>>
> >>>>>>> So I would prefer to keep trying to set the idle mask everytime the
> >>>>>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
> >>>>>>> cpumask will not be updated and the overhead will be mostly testing if
> >>>>>>> (rq->last_idle_state == idle_state).
> >>>>>>
> >>>>>> Not sure if I addressed your concern, did you see any workloads any cases
> >>>>>> v4 performs better than v5?
> >>>>>
> >>>>> Yes, I see some perf regression on my octo arm64 system for hackbench with
> >>>>> only 1 group (and for few ther ones but it's less obvious). There is no
> >>>>> perf impact with more groups most probably because the cpus are no more idle.
> >>>>>
> >>>>> The regression happens even if the shallowest idle state is the only one to
> >>>>> be enabled.
> >>>>
> >>>> Thanks for the data.
> >>>>
> >>>>>
> >>>>> - 2 x 4 cores arm64 system
> >>>>>
> >>>>> 12 iterations of hackbench -l (256000/#grp) -g #grp
> >>>>>
> >>>>> Only the shallowest state enabled
> >>>>
> >>>>> (as a sidenote, we don't have polling mode on arm64)
> >>>> Okay, this might be the cause of the difference between yours and mine. So do you
> >>>> think if it makes sense to let idle governor to return a polling flag and associate
> >>>> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
> >>>> may not be suitable for the wake up target.
> >>>
> >>> I don't know much about polling but can't this mode be used up to a tick too ?
> >>> I think so. So short idle need a definition. I'm not sure if it's a good idea to define
> >> the short idle as a tunable and default set it to tick >> 2?
> >
> > I have never been fond of heuristic like tick << 2 or sys tunable
> >
> > TBH, I'm not sure that using the tick is a good idea. And such kind of
> > parameter need more thought
> >
> >>
> >> Updating idle cpumask everytime cpu enters idle works for me, as we have state change
> >> check, so we won't actually update idle cpumask everytime the cpu enters idle.
> >
> > Yes, In this case, the overhead stays reasonable and is limited to the
> > test of a rq->last_idle_state
> > This will benefit heavy use a case by reducing the scanning time and
> > will not regress other use case.
> >
> >>
> >> But I'm still willing to exclude short idle case, what do you think?
> >
> > something similar to patch v3 or patch v5 + my changes seems to be a
> > good 1st step that will benefit heavy use cases with regressing other
> > ones.
> >
> > Trying to exclude short idle case will need more thoughts and changes
> > especially about to how to get this information and if it is reliable.
> >
>
> okay, I'll post a v6 with v5 + your change below after data measurement.
> May I add you a signed-off-by to the patch?

my signed-off is not needed. My changes should be considered as
comments of your patch but sometimes a code example is easier than a
long comment

>
> Thanks,
> -Aubrey
>
> ---
> kernel/sched/idle.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a38d8822ce0d..ca32197778b0 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -156,6 +156,7 @@ static void cpuidle_idle_call(void)
> return;
> }
>
> + update_idle_cpumask(this_rq(), true);
> /*
> * The RCU framework needs to be told that we are entering an idle
> * section, so no more rcu read side critical sections and one more
> @@ -163,7 +164,6 @@ static void cpuidle_idle_call(void)
> */
>
> if (cpuidle_not_available(drv, dev)) {
> - update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> default_idle_call();
> @@ -194,7 +194,6 @@ static void cpuidle_idle_call(void)
> max_latency_ns = dev->forced_idle_latency_limit_ns;
> }
>
> - update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> @@ -208,7 +207,6 @@ static void cpuidle_idle_call(void)
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> if (stop_tick || tick_nohz_tick_stopped()) {
> - update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
> } else {
> tick_nohz_idle_retain_tick();
> --
> 2.17.1
>

2020-11-27 03:14:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Wed, 25 Nov 2020 at 14:37, Li, Aubrey <[email protected]> wrote:
>
> On 2020/11/25 16:31, Vincent Guittot wrote:
> > On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <[email protected]> wrote:
> >>
> >> On 2020/11/25 1:01, Vincent Guittot wrote:
> >>> Hi Aubrey,
> >>>
> >>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
> >>>> Hi Vincent,
> >>>>
> >>>> On 2020/11/23 17:27, Vincent Guittot wrote:
> >>>>> Hi Aubrey,
> >>>>>
> >>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <[email protected]> wrote:
> >>>>>>
> >>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
> >>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
> >>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
> >>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
> >>>>>> tick to ratelimit idle cpumask update.
> >>>>>>
> >>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>>>> has low cost than scanning all the cpus in last level cache domain,
> >>>>>> especially when the system is heavily loaded.
> >>>>>>
> >>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
> >>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
> >>>>>> schbench have no notable change, uperf has:
> >>>>>>
> >>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
> >>>>>>
> >>>>>> threads baseline-avg %std patch-avg %std
> >>>>>> 96 1 0.83 1.23 3.27
> >>>>>> 144 1 1.03 1.67 2.67
> >>>>>> 192 1 0.69 1.81 3.59
> >>>>>> 240 1 2.84 1.51 2.67
> >>>>>>
> >>>>>> v4->v5:
> >>>>>> - add update_idle_cpumask for s2idle case
> >>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>>>>> idle_cpumask() everywhere
> >>>>>>
> >>>>>> v3->v4:
> >>>>>> - change setting idle cpumask from every idle entry to tickless idle
> >>>>>> if cpu driver is available.
> >>>>>
> >>>>> Could you remind me why you did this change ? Clearing the cpumask is
> >>>>> done during the tick to rate limit the number of updates of the
> >>>>> cpumask but It's not clear for me why you have associated the set with
> >>>>> the tick stop condition too.
> >>>>
> >>>> I found the current implementation has better performance at a more
> >>>> suitable load range.
> >>>>
> >>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
> >>>> tick) to shrink idle cpumask when the system is busy, but
> >>>
> >>> I'm ok with the part above
> >>>
> >>>>
> >>>> - Setting the idle mask everytime the cpu enters idle requires a much
> >>>> heavier load level to preserve the idle cpumask(not call into idle),
> >>>> otherwise the bits cleared in scheduler tick will be restored when the
> >>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
> >>>> cpumask during task wakeup if the system load is not heavy enough.
> >>>
> >>> But setting the idle cpumask is useful because it helps to select an idle
> >>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> >>> the idle cpu mask is useful in heavy cases because a system, which is
> >>> already fully busy with work, doesn't want to waste time looking for an
> >>> idle cpu that doesn't exist.
> >>
> >> Yes, this is what v3 does.
> >>
> >>> But if there is an idle cpu, we should still looks for it.
> >>
> >> IMHO, this is a potential opportunity can be improved. The idle cpu could be
> >> in different idle state, the idle duration could be long or could be very short.
> >> For example, if there are two idle cpus:
> >>
> >> - CPU1 is very busy, the pattern is 50us idle and 950us work.
> >> - CPU2 is in idle for a tick length and wake up to do the regular work
> >>
> >> If both added to the idle cpumask, we want the latter one, or we can just add
> >> the later one into the idle cpumask. That's why I want to associate tick stop
> >> signal with it.
> >>
> >>>
> >>>>
> >>>>
> >>>> - Associating with tick stop tolerates idle to preserve the idle cpumask
> >>>> but only short idle, which causes tick retains. This is more fitable for
> >>>> the real workload.
> >>>
> >>> I don't agree with this and real use cases with interaction will probably
> >>> not agree as well as they want to run on an idle cpu if any but not wait
> >>> on an already busy one.
> >>
> >> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
> >> is in the short idle mode, it's very likely that when it's picked up for a
> >> wakeup task, it goes back to work again, and the wakeup task has to wait too,
> >> maybe longer because the running task just starts.
> >>
> >> One benefit of waiting on the previous one is warm cache.
> >>
> >>> Also keep in mind that a tick can be up to 10ms long
> >>
> >> Right, but the point here is, if this 10ms tick retains, the CPU should be
> >> in the short idle mode.
> >
> > But 10, 4 or even 1ms is quite long for a system and that's even more
> > true compared to scanning the idle cpu mask
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> This change means that a cpu will not be part of the idle mask if the
> >>>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> >>>>> if the idle duration is expected to be higher than 1-2ms which starts
> >>>>> to be significantly long. Also, the cpuidle governor can easily
> >>>>> mis-predict a short idle duration whereas it will be finally a long
> >>>>> idle duration; In this case, the next tick will correct the situation
> >>>>> and select a deeper state, but this can happen up to 4ms later on
> >>>>> arm/arm64.
> >>>>
> >>>> Yes this is intented. If the tick is not stopped, that indicates the
> >>>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
> >>>> the expected idle duration is shorter than the tick period length. For
> >>>
> >>> As mentioned above a tick can be up to 10ms long which is not a short idle
> >>> duration.
> >>
> >> Usually when the tick retains, the CPU is in the short idle mode or even polling
> >> instead of idle.
> >
> > Also keep in mind that cpuidle can select a shallow state and retains
> > tick because of the wake up latency constraint and not the idle
> > duration. So you can't really make the assumption that retaining tick
> > means short idle duration
> >
> idle governor has short idle information, probably can let idle governor
> expose a short idle indicator?
>
> >>
> >>>
> >>> Then the governor also mispredicts the idle duration and this is one
> >>> reason that the tick is not stopped because it will give the opportunity
> >>> to reevaluate the idle state in case of misprediction.
> >>>
> >> We always predict the next state based on the past states, so misprediction
> >> does happen. This is not what this patch is trying to solve. I'm certainly
> >
> > My point here was to say that one original goal of cpuidle for
> > retaining the tick was to handle case where the governor mispredicts a
> > short idle time. Retaining the tick prevents the cpu to stay too long
> > in this shallow idle state and to waste power which seems to happen
> > often enough to be raised by people
>
> I see, thanks!
>
> >
> >> open if there is a better signal instead of stop_tick from idle governor.
> >>
> >>
> >>>> example, uperf enters and exits 80 times between two ticks when utilizes
> >>>> 100% CPU, and the average idle residency < 50us.
> >>>
> >>> But scheduler looks for idle state of prev cpu before looping the idle cpu
> >>> mask so i'm not sure that uperf is impacted in this case because scheduler
> >>> will select prev cpu before loop idle cpu mask.
> >>>
> >>>>
> >>>> If this CPU is added to idle cpumask, the wakeup task likely needs to
> >>>> wait in the runqueue as this CPU will run its current task very soon.
> >>>>
> >>>>>
> >>>>> So I would prefer to keep trying to set the idle mask everytime the
> >>>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
> >>>>> cpumask will not be updated and the overhead will be mostly testing if
> >>>>> (rq->last_idle_state == idle_state).
> >>>>
> >>>> Not sure if I addressed your concern, did you see any workloads any cases
> >>>> v4 performs better than v5?
> >>>
> >>> Yes, I see some perf regression on my octo arm64 system for hackbench with
> >>> only 1 group (and for few ther ones but it's less obvious). There is no
> >>> perf impact with more groups most probably because the cpus are no more idle.
> >>>
> >>> The regression happens even if the shallowest idle state is the only one to
> >>> be enabled.
> >>
> >> Thanks for the data.
> >>
> >>>
> >>> - 2 x 4 cores arm64 system
> >>>
> >>> 12 iterations of hackbench -l (256000/#grp) -g #grp
> >>>
> >>> Only the shallowest state enabled
> >>
> >>> (as a sidenote, we don't have polling mode on arm64)
> >> Okay, this might be the cause of the difference between yours and mine. So do you
> >> think if it makes sense to let idle governor to return a polling flag and associate
> >> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
> >> may not be suitable for the wake up target.
> >
> > I don't know much about polling but can't this mode be used up to a tick too ?
> >I think so. So short idle need a definition. I'm not sure if it's a good idea to define
> the short idle as a tunable and default set it to tick >> 2?

I have never been fond of heuristic like tick << 2 or sys tunable

TBH, I'm not sure that using the tick is a good idea. And such kind of
parameter need more thought

>
> Updating idle cpumask everytime cpu enters idle works for me, as we have state change
> check, so we won't actually update idle cpumask everytime the cpu enters idle.

Yes, In this case, the overhead stays reasonable and is limited to the
test of a rq->last_idle_state
This will benefit heavy use a case by reducing the scanning time and
will not regress other use case.

>
> But I'm still willing to exclude short idle case, what do you think?

something similar to patch v3 or patch v5 + my changes seems to be a
good 1st step that will benefit heavy use cases with regressing other
ones.

Trying to exclude short idle case will need more thoughts and changes
especially about to how to get this information and if it is reliable.

>
> Thanks,
> -Aubrey
>
>

2020-12-07 15:52:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote:
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index f324dc36fc43..6f5947673e66 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -163,6 +163,7 @@ static void cpuidle_idle_call(void)
> */
>
> if (cpuidle_not_available(drv, dev)) {
> + update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> default_idle_call();
> @@ -193,6 +194,7 @@ static void cpuidle_idle_call(void)
> max_latency_ns = dev->forced_idle_latency_limit_ns;
> }
>
> + update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> @@ -205,10 +207,12 @@ static void cpuidle_idle_call(void)
> */
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> - if (stop_tick || tick_nohz_tick_stopped())
> + if (stop_tick || tick_nohz_tick_stopped()) {
> + update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();

We already have a callback in tick_nohz_idle_stop_tick(), namely
nohz_balance_enter_idle().

2020-12-07 15:54:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote:
> Add idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, if the idle driver indicates to stop tick, this CPU
> is set in the idle cpumask to be a wakeup target. And if the CPU
> is not in idle, the CPU is cleared in idle cpumask during scheduler
> tick to ratelimit idle cpumask update.

So this will only have SIS consider CPUs that have the tick stopped?
That might affect the facebook tail-latency workloads quite a bit.

2020-12-07 16:57:04

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup


On 07/12/20 15:48, Peter Zijlstra wrote:
> On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote:
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index f324dc36fc43..6f5947673e66 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -163,6 +163,7 @@ static void cpuidle_idle_call(void)
>> */
>>
>> if (cpuidle_not_available(drv, dev)) {
>> + update_idle_cpumask(this_rq(), true);
>> tick_nohz_idle_stop_tick();
>>
>> default_idle_call();
>> @@ -193,6 +194,7 @@ static void cpuidle_idle_call(void)
>> max_latency_ns = dev->forced_idle_latency_limit_ns;
>> }
>>
>> + update_idle_cpumask(this_rq(), true);
>> tick_nohz_idle_stop_tick();
>>
>> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
>> @@ -205,10 +207,12 @@ static void cpuidle_idle_call(void)
>> */
>> next_state = cpuidle_select(drv, dev, &stop_tick);
>>
>> - if (stop_tick || tick_nohz_tick_stopped())
>> + if (stop_tick || tick_nohz_tick_stopped()) {
>> + update_idle_cpumask(this_rq(), true);
>> tick_nohz_idle_stop_tick();
>
> We already have a callback in tick_nohz_idle_stop_tick(), namely
> nohz_balance_enter_idle().

That's a no-op for !NO_HZ_COMMON though. For similar reasons, Aubrey moved
the clearing of the cpumask to scheduler_tick().

Are you saying this mechanism should only be driven for NO_HZ kernels? I
would tend to agree with Vincent that this could still be useful for idling
without cutting the tick (!NO_HZ or shallow idle state); see:

20201124170136.GA26613@vingu-book

2020-12-07 17:16:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Mon, Dec 07, 2020 at 04:50:15PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote:
> > Add idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, if the idle driver indicates to stop tick, this CPU
> > is set in the idle cpumask to be a wakeup target. And if the CPU
> > is not in idle, the CPU is cleared in idle cpumask during scheduler
> > tick to ratelimit idle cpumask update.
>
> So this will only have SIS consider CPUs that have the tick stopped?
> That might affect the facebook tail-latency workloads quite a bit.
>

This is the concern I had as well. That's why patch "sched/fair: Avoid
revisiting CPUs multiple times during select_idle_sibling" was created. The
intent was that a mask could be used as a hint about what runqueues to
look at first but continue looking at other runqueues without examining
the same runqueue twice. While the patch would not be merged on its own,
it may still be relevant in the context of an idle CPU mask depending on
how up to date it is.

--
Mel Gorman
SUSE Labs

2020-12-07 18:10:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

On Mon, Dec 07, 2020 at 04:52:24PM +0000, Valentin Schneider wrote:
>
> On 07/12/20 15:48, Peter Zijlstra wrote:
> > On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote:
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index f324dc36fc43..6f5947673e66 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -163,6 +163,7 @@ static void cpuidle_idle_call(void)
> >> */
> >>
> >> if (cpuidle_not_available(drv, dev)) {
> >> + update_idle_cpumask(this_rq(), true);
> >> tick_nohz_idle_stop_tick();
> >>
> >> default_idle_call();
> >> @@ -193,6 +194,7 @@ static void cpuidle_idle_call(void)
> >> max_latency_ns = dev->forced_idle_latency_limit_ns;
> >> }
> >>
> >> + update_idle_cpumask(this_rq(), true);
> >> tick_nohz_idle_stop_tick();
> >>
> >> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> >> @@ -205,10 +207,12 @@ static void cpuidle_idle_call(void)
> >> */
> >> next_state = cpuidle_select(drv, dev, &stop_tick);
> >>
> >> - if (stop_tick || tick_nohz_tick_stopped())
> >> + if (stop_tick || tick_nohz_tick_stopped()) {
> >> + update_idle_cpumask(this_rq(), true);
> >> tick_nohz_idle_stop_tick();
> >
> > We already have a callback in tick_nohz_idle_stop_tick(), namely
> > nohz_balance_enter_idle().
>
> That's a no-op for !NO_HZ_COMMON though. For similar reasons, Aubrey moved
> the clearing of the cpumask to scheduler_tick().
>
> Are you saying this mechanism should only be driven for NO_HZ kernels?

IFF it keys off of the tick being stopped, then yes. But as said in the
other email, I think that's a very dubious thing to begin with.