2021-07-26 10:26:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

From: Aubrey Li <[email protected]>

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v10-v11:
- Forward port to 5.13-rc5 based kernel

v9->v10:
- Update scan cost only when the idle cpumask is scanned, i.e, the
idle cpumask is not empty

v8->v9:
- rebase on top of tip/sched/core, no code change

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
cpumask with sched_domain_span(sd), nohz=off case remains the original
behavior

[[email protected]: RCU protection in update_idle_cpumask]
Cc: Peter Zijlstra <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Jiang Biao <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Aubrey Li <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched/topology.h | 13 ++++++++++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++-
kernel/sched/idle.c | 5 ++++
kernel/sched/sched.h | 4 +++
kernel/sched/topology.c | 3 ++-
6 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..905e382ece1a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,8 +74,21 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ /*
+ * Span of all idle CPUs in this domain.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus_span[];
};

+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus_span);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c073b67f1ea..2751614ce0cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4965,6 +4965,7 @@ void scheduler_tick(void)

#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
+ update_idle_cpumask(cpu, rq->idle_balance);
trigger_load_balance(rq);
#endif
}
@@ -9031,6 +9032,7 @@ void __init sched_init(void)
rq->wake_stamp = jiffies;
rq->wake_avg_idle = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->last_idle_state = 1;

INIT_LIST_HEAD(&rq->cfs_tasks);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b180205e6b25..fe87af2ccc80 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6230,7 +6230,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ /*
+ * sched_domain_shared is set only at shared cache level,
+ * this works only because select_idle_cpu is called with
+ * sd_llc.
+ */
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
@@ -7018,6 +7023,45 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

return newidle_balance(rq, rf) != 0;
}
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+ struct sched_domain *sd;
+ struct rq *rq = cpu_rq(cpu);
+ int idle_state;
+
+ /*
+ * Also set SCHED_IDLE cpu in idle cpumask to
+ * allow SCHED_IDLE cpu as a wakeup target.
+ */
+ idle_state = idle || sched_idle_cpu(cpu);
+ /*
+ * No need to update idle cpumask if the state
+ * does not change.
+ */
+ if (rq->last_idle_state == idle_state)
+ return;
+
+ rcu_read_lock();
+ sd = per_cpu(sd_llc, cpu);
+ if (unlikely(!sd))
+ goto unlock;
+
+ if (idle_state)
+ cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+ else
+ cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+
+ rq->last_idle_state = idle_state;
+unlock:
+ rcu_read_unlock();
+}
#endif /* CONFIG_SMP */

static unsigned long wakeup_gran(struct sched_entity *se)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 912b47aa99d8..86bfe81cc280 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -289,6 +289,11 @@ static void do_idle(void)
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
+ /*
+ * The CPU is about to go idle, set it in idle cpumask
+ * to be a wake up target.
+ */
+ update_idle_cpumask(cpu, true);

arch_cpu_idle_enter();
rcu_nocb_flush_deferred_wakeup();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4d47a0969710..2d6456fa15cb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -998,6 +998,7 @@ struct rq {

unsigned char nohz_idle_balance;
unsigned char idle_balance;
+ unsigned char last_idle_state;

unsigned long misfit_task_load;

@@ -1846,6 +1847,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)

extern int group_balance_cpu(struct sched_group *sg);

+void update_idle_cpumask(int cpu, bool idle);
+
#ifdef CONFIG_SCHED_DEBUG
void update_sched_domain_debugfs(void);
void dirty_sched_domain_sysctl(int cpu);
@@ -1864,6 +1867,7 @@ extern void flush_smp_call_function_from_idle(void);

#else /* !CONFIG_SMP: */
static inline void flush_smp_call_function_from_idle(void) { }
+static inline void update_idle_cpumask(int cpu, bool idle) { }
#endif

#include "stats.h"
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..2075bc417b90 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1611,6 +1611,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
}

sd->private = sdd;
@@ -1970,7 +1971,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)

*per_cpu_ptr(sdd->sd, j) = sd;

- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.26.2


Subject: RE: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup



> -----Original Message-----
> From: Mel Gorman [mailto:[email protected]]
> Sent: Monday, July 26, 2021 10:23 PM
> To: LKML <[email protected]>
> Cc: Ingo Molnar <[email protected]>; Peter Zijlstra <[email protected]>;
> Vincent Guittot <[email protected]>; Valentin Schneider
> <[email protected]>; Aubrey Li <[email protected]>; Mel
> Gorman <[email protected]>
> Subject: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task
> wakeup
>
> From: Aubrey Li <[email protected]>
>
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> v10-v11:
> - Forward port to 5.13-rc5 based kernel
>
> v9->v10:
> - Update scan cost only when the idle cpumask is scanned, i.e, the
> idle cpumask is not empty
>
> v8->v9:
> - rebase on top of tip/sched/core, no code change
>
> v7->v8:
> - refine update_idle_cpumask, no functionality change
> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>
> v6->v7:
> - place the whole idle cpumask mechanism under CONFIG_SMP
>
> v5->v6:
> - decouple idle cpumask update from stop_tick signal, set idle CPU
> in idle cpumask every time the CPU enters idle
>
> v4->v5:
> - add update_idle_cpumask for s2idle case
> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> idle_cpumask() everywhere
>
> v3->v4:
> - change setting idle cpumask from every idle entry to tickless idle
> if cpu driver is available
> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>
> v2->v3:
> - change setting idle cpumask to every idle entry, otherwise schbench
> has a regression of 99th percentile latency
> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> idle cpumask is ratelimited in the idle exiting path
> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>
> v1->v2:
> - idle cpumask is updated in the nohz routines, by initializing idle
> cpumask with sched_domain_span(sd), nohz=off case remains the original
> behavior
>
> [[email protected]: RCU protection in update_idle_cpumask]
> Cc: Peter Zijlstra <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Jiang Biao <[email protected]>
> Cc: Tim Chen <[email protected]>
> Signed-off-by: Aubrey Li <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/sched/topology.h | 13 ++++++++++
> kernel/sched/core.c | 2 ++
> kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++-
> kernel/sched/idle.c | 5 ++++
> kernel/sched/sched.h | 4 +++
> kernel/sched/topology.c | 3 ++-
> 6 files changed, 71 insertions(+), 2 deletions(-)
>

Hi Mel, Aubrey,
A similar thing Yicong and me has discussed is having a mask or a count for
idle cores. And then we can only scan idle cores in this mask in
select_idle_cpu().

A obvious problem is that has_idle_cores is a bool, it can seriously lag
from the real status. I mean, after system enters the status without idle
cores, has_idle_cores could be still true.

Right now, we are setting has_idle_cores to true while cpu enters idle
and its smt sibling is also idle. But we are setting has_idle_cores to
false only after we scan all cores in a llc.

So we have thought for a while to provide an idle core mask. But never
really made a workable patch.

Mel's patch7/9 limits the number of cores which will be scanned in
select_idle_cpu(), it might somehow alleviate the problem we redundantly
scan all cores while we actually have no idle core even has_idle_cores
is true.

However, if we can get idle core mask, it could be also good to
select_idle_core()? Maybe after that, we don't have to enforce
proportional scan limits while scanning for an idle core?

> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..905e382ece1a 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -74,8 +74,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated
> */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7c073b67f1ea..2751614ce0cb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
>
> #ifdef CONFIG_SMP
> rq->idle_balance = idle_cpu(cpu);
> + update_idle_cpumask(cpu, rq->idle_balance);
> trigger_load_balance(rq);
> #endif
> }
> @@ -9031,6 +9032,7 @@ void __init sched_init(void)
> rq->wake_stamp = jiffies;
> rq->wake_avg_idle = rq->avg_idle;
> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> + rq->last_idle_state = 1;
>
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b180205e6b25..fe87af2ccc80 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6230,7 +6230,12 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
> if (!this_sd)
> return -1;
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + /*
> + * sched_domain_shared is set only at shared cache level,
> + * this works only because select_idle_cpu is called with
> + * sd_llc.
> + */
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>
> if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
> @@ -7018,6 +7023,45 @@ balance_fair(struct rq *rq, struct task_struct *prev,
> struct rq_flags *rf)
>
> return newidle_balance(rq, rf) != 0;
> }
> +
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + *
> + * This function is called with interrupts disabled.
> + */
> +void update_idle_cpumask(int cpu, bool idle)
> +{
> + struct sched_domain *sd;
> + struct rq *rq = cpu_rq(cpu);
> + int idle_state;
> +
> + /*
> + * Also set SCHED_IDLE cpu in idle cpumask to
> + * allow SCHED_IDLE cpu as a wakeup target.
> + */
> + idle_state = idle || sched_idle_cpu(cpu);
> + /*
> + * No need to update idle cpumask if the state
> + * does not change.
> + */
> + if (rq->last_idle_state == idle_state)
> + return;
> +
> + rcu_read_lock();
> + sd = per_cpu(sd_llc, cpu);
> + if (unlikely(!sd))
> + goto unlock;
> +
> + if (idle_state)
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> + else
> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> +
> + rq->last_idle_state = idle_state;
> +unlock:
> + rcu_read_unlock();
> +}
> #endif /* CONFIG_SMP */
>
> static unsigned long wakeup_gran(struct sched_entity *se)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 912b47aa99d8..86bfe81cc280 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -289,6 +289,11 @@ static void do_idle(void)
> cpuhp_report_idle_dead();
> arch_cpu_idle_dead();
> }
> + /*
> + * The CPU is about to go idle, set it in idle cpumask
> + * to be a wake up target.
> + */
> + update_idle_cpumask(cpu, true);
>
> arch_cpu_idle_enter();
> rcu_nocb_flush_deferred_wakeup();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4d47a0969710..2d6456fa15cb 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -998,6 +998,7 @@ struct rq {
>
> unsigned char nohz_idle_balance;
> unsigned char idle_balance;
> + unsigned char last_idle_state;
>
> unsigned long misfit_task_load;
>
> @@ -1846,6 +1847,8 @@ static inline unsigned int group_first_cpu(struct
> sched_group *group)
>
> extern int group_balance_cpu(struct sched_group *sg);
>
> +void update_idle_cpumask(int cpu, bool idle);
> +
> #ifdef CONFIG_SCHED_DEBUG
> void update_sched_domain_debugfs(void);
> void dirty_sched_domain_sysctl(int cpu);
> @@ -1864,6 +1867,7 @@ extern void flush_smp_call_function_from_idle(void);
>
> #else /* !CONFIG_SMP: */
> static inline void flush_smp_call_function_from_idle(void) { }
> +static inline void update_idle_cpumask(int cpu, bool idle) { }
> #endif
>
> #include "stats.h"
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..2075bc417b90 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1611,6 +1611,7 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> }
>
> sd->private = sdd;
> @@ -1970,7 +1971,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) +
> cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.26.2


Thanks
Barry


2021-08-04 11:08:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

On Mon, Aug 02, 2021 at 10:41:13AM +0000, Song Bao Hua (Barry Song) wrote:
> Hi Mel, Aubrey,
> A similar thing Yicong and me has discussed is having a mask or a count for
> idle cores. And then we can only scan idle cores in this mask in
> select_idle_cpu().
>

Either approach would require a lot of updates.

> A obvious problem is that has_idle_cores is a bool, it can seriously lag
> from the real status. I mean, after system enters the status without idle
> cores, has_idle_cores could be still true.
>

True.

> Right now, we are setting has_idle_cores to true while cpu enters idle
> and its smt sibling is also idle. But we are setting has_idle_cores to
> false only after we scan all cores in a llc.
>
> So we have thought for a while to provide an idle core mask. But never
> really made a workable patch.
>
> Mel's patch7/9 limits the number of cores which will be scanned in
> select_idle_cpu(), it might somehow alleviate the problem we redundantly
> scan all cores while we actually have no idle core even has_idle_cores
> is true.
>

I prototyped a patch that tracked the location of a known idle core and
use it as a starting hint for a search. It's cleared if the core is
selected for use. Unfortunately, it was not a universal win so was
dropped for the moment but either way, improving the accurate of
has_idle_cores without being too expensive would be niuce.

> However, if we can get idle core mask, it could be also good to
> select_idle_core()? Maybe after that, we don't have to enforce
> proportional scan limits while scanning for an idle core?
>

To remove the limits entirely, it would have to be very accurate and
it's hard to see how that can be done without having a heavily updated
shared cache line.

--
Mel Gorman
SUSE Labs

2021-08-04 18:46:17

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

On 8/4/21 6:26 PM, Mel Gorman wrote:
> On Mon, Aug 02, 2021 at 10:41:13AM +0000, Song Bao Hua (Barry Song) wrote:
>> Hi Mel, Aubrey,
>> A similar thing Yicong and me has discussed is having a mask or a count for
>> idle cores. And then we can only scan idle cores in this mask in
>> select_idle_cpu().
>>
>
> Either approach would require a lot of updates.
>
>> A obvious problem is that has_idle_cores is a bool, it can seriously lag
>> from the real status. I mean, after system enters the status without idle
>> cores, has_idle_cores could be still true.
>>
>
> True.
>
>> Right now, we are setting has_idle_cores to true while cpu enters idle
>> and its smt sibling is also idle. But we are setting has_idle_cores to
>> false only after we scan all cores in a llc.
>>
>> So we have thought for a while to provide an idle core mask. But never
>> really made a workable patch.
>>
>> Mel's patch7/9 limits the number of cores which will be scanned in
>> select_idle_cpu(), it might somehow alleviate the problem we redundantly
>> scan all cores while we actually have no idle core even has_idle_cores
>> is true.
>>
>
> I prototyped a patch that tracked the location of a known idle core and
> use it as a starting hint for a search. It's cleared if the core is
> selected for use. Unfortunately, it was not a universal win so was
> dropped for the moment but either way, improving the accurate of
> has_idle_cores without being too expensive would be niuce.

The idle core information is already in idle cpu mask, do we have a quick
and cheap way to extract out?


>
>> However, if we can get idle core mask, it could be also good to
>> select_idle_core()? Maybe after that, we don't have to enforce
>> proportional scan limits while scanning for an idle core?
>>
>
> To remove the limits entirely, it would have to be very accurate and
> it's hard to see how that can be done without having a heavily updated
> shared cache line.
>

Bad idea, is it helpful to make a sparse cpu mask for this? :p

Thanks,
-Aubrey

2021-09-17 12:13:29

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
>
> #ifdef CONFIG_SMP
> rq->idle_balance = idle_cpu(cpu);
> + update_idle_cpumask(cpu, rq->idle_balance);
> trigger_load_balance(rq);
> #endif
> }

might be stupid, a question bothering yicong and me is that why don't we
choose to update_idle_cpumask() while idle task exits and switches to a
normal task?
for example, before tick comes, cpu has exited from idle, but we are only
able to update it in tick. this makes idle_cpus_span inaccurate, thus we
will scan cpu which isn't actually idle in select_idle_sibling.
is it because of the huge update overhead?

Thanks
barry

2021-09-17 13:33:10

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

On 9/17/21 12:15 PM, Barry Song wrote:
>> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
>>
>> #ifdef CONFIG_SMP
>> rq->idle_balance = idle_cpu(cpu);
>> + update_idle_cpumask(cpu, rq->idle_balance);
>> trigger_load_balance(rq);
>> #endif
>> }
>
> might be stupid, a question bothering yicong and me is that why don't we
> choose to update_idle_cpumask() while idle task exits and switches to a
> normal task?

I implemented that way and we discussed before(RFC v1 ?), updating a cpumask
at every enter/exit idle is more expensive than we expected, though it's
per LLC domain, Vincent saw a significant regression IIRC. You can also
take a look at nohz.idle_cpus_mask as a reference.

> for example, before tick comes, cpu has exited from idle, but we are only
> able to update it in tick. this makes idle_cpus_span inaccurate, thus we
> will scan cpu which isn't actually idle in select_idle_sibling.
> is it because of the huge update overhead?
>

Yes, we'll have false positive but we don't miss true positive. So things
won't be worse than the current way.

Thanks,
-Aubrey

2021-09-17 15:15:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

On Fri, Sep 17, 2021 at 05:11:11PM +0800, Aubrey Li wrote:
> On 9/17/21 12:15 PM, Barry Song wrote:
> >> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
> >>
> >> #ifdef CONFIG_SMP
> >> rq->idle_balance = idle_cpu(cpu);
> >> + update_idle_cpumask(cpu, rq->idle_balance);
> >> trigger_load_balance(rq);
> >> #endif
> >> }
> >
> > might be stupid, a question bothering yicong and me is that why don't we
> > choose to update_idle_cpumask() while idle task exits and switches to a
> > normal task?
>
> I implemented that way and we discussed before(RFC v1 ?), updating a cpumask
> at every enter/exit idle is more expensive than we expected, though it's
> per LLC domain, Vincent saw a significant regression IIRC. You can also
> take a look at nohz.idle_cpus_mask as a reference.
>

It's possible to track it differently and I prototyped it some time
back. The results were mixed at the time. It helped some workloads
and was marginal on others. It appeared to help hackbench but I found
that hackbench is much more vulnerable to the wakeup_granularity and
overscheduling. For hackbench, it makes more sense to target that directly
before revisiting the alt-idlecore to see what it really helps. I'm waiting
on test results on various ways wakeup_gran can be scaled depending on
rq activity.

For alternative idle core tracking, the current 5.15-rc1 rebase
prototype looks like this

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/commit/?h=sched-altidlecore-v2r8&id=b2af1a88245f6cbeb28343e89f3183a77b29d52d

Test results still pending and as usual the queue is busy. I swear, my
primary bottleneck for doing anything is benchmark and validation :(

--
Mel Gorman
SUSE Labs

2021-09-17 16:58:31

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup

[Aubrey wrote]
> The idle core information is already in idle cpu mask, do we have a quick
> and cheap way to extract out?

seems hard.

since we need an intermediate cpumask memory to save the result of cpumask
bitmap operations, the only way i can get the idle core and make sure "cpus"
isn't broken is:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 361927803e31..abd844bcfb86 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6187,8 +6187,14 @@ void __update_idle_core(struct rq *rq)
*/
static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
- bool idle = true;
- int cpu;
+ bool idle = true, possible_idle;
+ int cpu, wb, wa;
+
+ wb = cpumask_weight(cpus);
+ cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+ wa = cpumask_weight(cpus);
+
+ possible_idle = (wa - wb == cpumask_weight(cpu_smt_mask(core)));

if (!static_branch_likely(&sched_smt_present))
return __select_idle_cpu(core, p);
@@ -6212,7 +6218,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
if (idle)
return core;

- cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
return -1;
}

but the questions are:
1. is it really cheap?
2. how to use possible_idle? right now, select_idle_core() is also selecting idle cpu. we shouldn't simply return when possible_idle is false.

Thanks
Barry