2019-07-08 07:58:16

by Parth Shah

[permalink] [raw]
Subject: [RFC 0/2] Optimize the idle CPU search

When searching for an idle_sibling, scheduler first iterates to search for
an idle core and then for an idle CPU. By maintaining the idle CPU mask
while iterating through idle cores, we can mark non-idle CPUs for which
idle CPU search would not have to iterate through again. This is especially
true in a moderately load system

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in lower iteration count.

The results show that the time for `select_idle_cpu` decreases and there is
no regression on time search for `select_idle_core` and almost no
regression on schbench as well. With proper tuning schbench shows benefit
as well when idle_core search fails most times.

When doing this, rename locally used cpumask 'select_idle_mask' to
something else to use this existing mask for such optimization.

Patch set based on tip/core/core

Results
===========
IBM POWER9 system: 2-socket, 44 cores, 176 CPUs

Function latency (with tb tick):
(lower is better)
+--------------+----------+--------+-------------+--------+
| select_idle_ | Baseline | stddev | Patch | stddev |
+--------------+----------+--------+-------------+--------+
| core | 2080 | 1307 | 1975(+5.3%) | 1286 |
| cpu | 834 | 393 | 91(+89%) | 64 |
| sibling | 0.96 | 0.003 | 0.89(+7%) | 0.02 |
+--------------+----------+--------+-------------+--------+

Schbench:
- schbench -m 44 -t 1
(lower is better)
+------+----------+--------+------------+--------+
| %ile | Baseline | stddev | Patch | stddev |
+------+----------+--------+------------+--------+
| 50 | 9.9 | 2 | 10(-1.01) | 1.4 |
| 95 | 465 | 3.9 | 465(0%) | 2 |
| 99 | 561 | 24 | 483(-1.0%) | 14 |
| 99.5 | 631 | 29 | 635(-0.6%) | 32 |
| 99.9 | 801 | 41 | 763(+4.7%) | 125 |
+------+----------+--------+------------+--------+

- 44 threads spread across cores to make select_idle_core return -1 most
times
- schbench -m 44 -t 1
+-------+----------+--------+-----------+--------+
| %ile | Baseline | stddev | patch | stddev |
+-------+----------+--------+-----------+--------+
| 50 | 10 | 9 | 12(-20%) | 1 |
| 95 | 468 | 3 | 31(+93%) | 1 |
| 99 | 577 | 16 | 477(+17%) | 38 |
| 99.95 | 647 | 26 | 482(+25%) | 2 |
| 99.99 | 835 | 61 | 492(+41%) | 2 |
+-------+----------+--------+-----------+--------+


Hackbench:
- 44 threads spread across cores to make select_idle_core return -1 most
times
- perf bench sched messaging -g 1 -l 100000
(lower is better)
+----------+--------+--------------+--------+
| Baseline | stddev | patch | stddev |
+----------+--------+--------------+--------+
| 16.107 | 0.62 | 16.02(+0.5%) | 0.32 |
+----------+--------+--------------+--------+


Series:
- Patch 01: Rename select_idle_mask to reuse the name in next patch
- Patch 02: Optimize the wakeup fast path


Parth Shah (2):
sched/fair: Rename select_idle_mask to iterator_mask
sched/fair: Optimize idle CPU search

kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 15 ++++++++++-----
2 files changed, 13 insertions(+), 5 deletions(-)

--
2.17.1


2019-07-08 08:25:32

by Parth Shah

[permalink] [raw]
Subject: [RFC 2/2] sched/fair: Optimize the idle CPU search

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in the lower iteration count.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 13 +++++++++----
2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5a6bdc956c8..196e4eaca66e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5951,6 +5951,7 @@ static struct kmem_cache *task_group_cache __read_mostly;

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
DECLARE_PER_CPU(cpumask_var_t, iterator_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);

void __init sched_init(void)
{
@@ -5991,6 +5992,8 @@ void __init sched_init(void)
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
per_cpu(iterator_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+ per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+ cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_CPUMASK_OFFSTACK */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20affe03379d..2b70b94b3e66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5295,6 +5295,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
DEFINE_PER_CPU(cpumask_var_t, iterator_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);

#ifdef CONFIG_NO_HZ_COMMON
/*
@@ -6084,6 +6085,7 @@ void __update_idle_core(struct rq *rq)
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(iterator_mask);
+ struct cpumask *idle_cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;

if (!static_branch_likely(&sched_smt_present))
@@ -6099,8 +6101,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int

for_each_cpu(cpu, cpu_smt_mask(core)) {
__cpumask_clear_cpu(cpu, cpus);
- if (!available_idle_cpu(cpu))
+ if (!available_idle_cpu(cpu)) {
idle = false;
+ __cpumask_clear_cpu(cpu, idle_cpus);
+ }
}

if (idle)
@@ -6161,6 +6165,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 time, cost;
s64 delta;
int cpu, nr = INT_MAX;
+ struct cpumask *idle_cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6186,11 +6191,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

time = local_clock();

- for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+ for_each_cpu_wrap(cpu, idle_cpus, target) {
if (!--nr)
return -1;
- if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
- continue;
if (available_idle_cpu(cpu))
break;
}
@@ -6210,6 +6213,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
struct sched_domain *sd;
int i, recent_used_cpu;
+ struct cpumask *idle_cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

if (available_idle_cpu(target))
return target;
@@ -6239,6 +6243,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;

+ cpumask_and(idle_cpus, sched_domain_span(sd), &p->cpus_allowed);
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
--
2.17.1

2019-07-08 08:25:44

by Parth Shah

[permalink] [raw]
Subject: [RFC 1/2] sched/fair: Rename select_idle_mask to iterator_mask

Per cpu variable 'select_idle_mask' serves the only purpose of an iterator
inside select_idle_core method. Also there is an opportunity to optimize
the search for an idle CPU for which this mask is required in the
subsequent patch. Hence renaming this per_cpu variable to iterator mask
which can be used locally for CPU iteration.

Subsequent patch uses the select_idle_mask to keep track of the idle CPUs
which can be shared across function calls.


Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778c48a7fda..d5a6bdc956c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5950,7 +5950,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
-DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, iterator_mask);

void __init sched_init(void)
{
@@ -5989,7 +5989,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
- per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+ per_cpu(iterator_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdab7eb6f351..20affe03379d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5294,7 +5294,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, iterator_mask);

#ifdef CONFIG_NO_HZ_COMMON
/*
@@ -6083,7 +6083,7 @@ void __update_idle_core(struct rq *rq)
*/
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(iterator_mask);
int core, cpu;

if (!static_branch_likely(&sched_smt_present))
--
2.17.1

2019-07-08 11:04:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/2] Optimize the idle CPU search

On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:
> When searching for an idle_sibling, scheduler first iterates to search for
> an idle core and then for an idle CPU. By maintaining the idle CPU mask
> while iterating through idle cores, we can mark non-idle CPUs for which
> idle CPU search would not have to iterate through again. This is especially
> true in a moderately load system
>
> Optimize idle CPUs search by marking already found non idle CPUs during
> idle core search. This reduces iteration count when searching for idle
> CPUs, resulting in lower iteration count.

Have you seen these patches:

https://lkml.kernel.org/r/[email protected]

I've meant to get back to that, but never quite had the time :/

2019-07-08 13:05:06

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC 0/2] Optimize the idle CPU search



On 7/8/19 1:38 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:
>> When searching for an idle_sibling, scheduler first iterates to search for
>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>> while iterating through idle cores, we can mark non-idle CPUs for which
>> idle CPU search would not have to iterate through again. This is especially
>> true in a moderately load system
>>
>> Optimize idle CPUs search by marking already found non idle CPUs during
>> idle core search. This reduces iteration count when searching for idle
>> CPUs, resulting in lower iteration count.
>
> Have you seen these patches:
>
> https://lkml.kernel.org/r/[email protected]
>
> I've meant to get back to that, but never quite had the time :/
>

Ok, I was not aware of this patch-set.
It seems interesting and I will evaluate it.


Thanks

2019-07-09 00:12:32

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 0/2] Optimize the idle CPU search


On 7/8/19 10:24 AM, Parth Shah wrote:
> When searching for an idle_sibling, scheduler first iterates to search for
> an idle core and then for an idle CPU. By maintaining the idle CPU mask
> while iterating through idle cores, we can mark non-idle CPUs for which
> idle CPU search would not have to iterate through again. This is especially
> true in a moderately load system
>
> Optimize idle CPUs search by marking already found non idle CPUs during
> idle core search. This reduces iteration count when searching for idle
> CPUs, resulting in lower iteration count.
>
I believe this can co-exist with latency-nice? We can derive the 'nr' in
select_idle_cpu from latency-nice and use the new mask to iterate.

2019-07-09 05:39:37

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC 0/2] Optimize the idle CPU search



On 7/9/19 5:38 AM, Subhra Mazumdar wrote:
>
> On 7/8/19 10:24 AM, Parth Shah wrote:
>> When searching for an idle_sibling, scheduler first iterates to search for
>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>> while iterating through idle cores, we can mark non-idle CPUs for which
>> idle CPU search would not have to iterate through again. This is especially
>> true in a moderately load system
>>
>> Optimize idle CPUs search by marking already found non idle CPUs during
>> idle core search. This reduces iteration count when searching for idle
>> CPUs, resulting in lower iteration count.
>>
> I believe this can co-exist with latency-nice? We can derive the 'nr' in
> select_idle_cpu from latency-nice and use the new mask to iterate.
>

I agree, can be done with latency-nice.

Maybe something like below?
smt = nr_cpus / nr_cores
nr = smt + (p->latency_nice * (total_cpus-smt) / max_latency_nice)

This limits lower bounds to 1 core and goes through all the cores if
latency_nice is maximum for a task.


Thanks
Parth

2019-07-09 06:49:55

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 0/2] Optimize the idle CPU search


On 7/8/19 1:38 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:
>> When searching for an idle_sibling, scheduler first iterates to search for
>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>> while iterating through idle cores, we can mark non-idle CPUs for which
>> idle CPU search would not have to iterate through again. This is especially
>> true in a moderately load system
>>
>> Optimize idle CPUs search by marking already found non idle CPUs during
>> idle core search. This reduces iteration count when searching for idle
>> CPUs, resulting in lower iteration count.
> Have you seen these patches:
>
> https://lkml.kernel.org/r/[email protected]
>
> I've meant to get back to that, but never quite had the time :/
The most relevant bit of this was folding select_idle_core and
select_idle_cpu. But it may be good to keep them separate as workloads
which just want any idle cpu can find one and break early by disabling
the idle core search. And this can still work with latency-nice which can
moderate both idle core and idle cpu search.

2019-07-09 06:51:43

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 0/2] Optimize the idle CPU search


On 7/9/19 11:08 AM, Parth Shah wrote:
>
> On 7/9/19 5:38 AM, Subhra Mazumdar wrote:
>> On 7/8/19 10:24 AM, Parth Shah wrote:
>>> When searching for an idle_sibling, scheduler first iterates to search for
>>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>>> while iterating through idle cores, we can mark non-idle CPUs for which
>>> idle CPU search would not have to iterate through again. This is especially
>>> true in a moderately load system
>>>
>>> Optimize idle CPUs search by marking already found non idle CPUs during
>>> idle core search. This reduces iteration count when searching for idle
>>> CPUs, resulting in lower iteration count.
>>>
>> I believe this can co-exist with latency-nice? We can derive the 'nr' in
>> select_idle_cpu from latency-nice and use the new mask to iterate.
>>
> I agree, can be done with latency-nice.
>
> Maybe something like below?
> smt = nr_cpus / nr_cores
> nr = smt + (p->latency_nice * (total_cpus-smt) / max_latency_nice)
>
> This limits lower bounds to 1 core and goes through all the cores if
> latency_nice is maximum for a task.
Yes I had similar in mind.