Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.
Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
+ if (sched_feat(SIS_CORE)) {
+ i = select_idle_core(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
--
2.9.3
On 6/27/19 6:59 AM, subhra mazumdar wrote:
> Use SIS_CORE to disable idle core search. For some workloads
> select_idle_core becomes a scalability bottleneck, removing it improves
> throughput. Also there are workloads where disabling it can hurt latency,
> so need to have an option.
>
> Signed-off-by: subhra mazumdar <[email protected]>
> ---
> kernel/sched/fair.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c1ca88e..6a74808 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if (!sd)
> return target;
>
> - i = select_idle_core(p, sd, target);
> - if ((unsigned)i < nr_cpumask_bits)
> - return i;
> + if (sched_feat(SIS_CORE)) {
> + i = select_idle_core(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
This can have significant performance loss if disabled. The select_idle_core spreads
workloads quickly across the cores, hence disabling this leaves much of the work to
be offloaded to load balancer to move task across the cores. Latency sensitive
and long running multi-threaded workload should see the regression under this conditions.
Also, systems like POWER9 has sd_llc as a pair of core only. So it
won't benefit from the limits and hence also hiding your code in select_idle_cpu
behind static keys will be much preferred.
>
> i = select_idle_cpu(p, sd, target);
> if ((unsigned)i < nr_cpumask_bits)
>
Best,
Parth
On 6/28/19 12:01 PM, Parth Shah wrote:
>
> On 6/27/19 6:59 AM, subhra mazumdar wrote:
>> Use SIS_CORE to disable idle core search. For some workloads
>> select_idle_core becomes a scalability bottleneck, removing it improves
>> throughput. Also there are workloads where disabling it can hurt latency,
>> so need to have an option.
>>
>> Signed-off-by: subhra mazumdar <[email protected]>
>> ---
>> kernel/sched/fair.c | 8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c1ca88e..6a74808 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> if (!sd)
>> return target;
>>
>> - i = select_idle_core(p, sd, target);
>> - if ((unsigned)i < nr_cpumask_bits)
>> - return i;
>> + if (sched_feat(SIS_CORE)) {
>> + i = select_idle_core(p, sd, target);
>> + if ((unsigned)i < nr_cpumask_bits)
>> + return i;
>> + }
> This can have significant performance loss if disabled. The select_idle_core spreads
> workloads quickly across the cores, hence disabling this leaves much of the work to
> be offloaded to load balancer to move task across the cores. Latency sensitive
> and long running multi-threaded workload should see the regression under this conditions.
Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why
it is a feature that is ON by default, but can be turned OFF for specific
workloads on x86 SMT2 that can benefit from it.
> Also, systems like POWER9 has sd_llc as a pair of core only. So it
> won't benefit from the limits and hence also hiding your code in select_idle_cpu
> behind static keys will be much preferred.
If it doesn't hurt then I don't see the point.
Thanks,
Subhra
On 6/29/19 3:59 AM, Subhra Mazumdar wrote:
>
> On 6/28/19 12:01 PM, Parth Shah wrote:
>>
>> On 6/27/19 6:59 AM, subhra mazumdar wrote:
>>> Use SIS_CORE to disable idle core search. For some workloads
>>> select_idle_core becomes a scalability bottleneck, removing it improves
>>> throughput. Also there are workloads where disabling it can hurt latency,
>>> so need to have an option.
>>>
>>> Signed-off-by: subhra mazumdar <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 8 +++++---
>>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c1ca88e..6a74808 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>> if (!sd)
>>> return target;
>>>
>>> - i = select_idle_core(p, sd, target);
>>> - if ((unsigned)i < nr_cpumask_bits)
>>> - return i;
>>> + if (sched_feat(SIS_CORE)) {
>>> + i = select_idle_core(p, sd, target);
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>> This can have significant performance loss if disabled. The select_idle_core spreads
>> workloads quickly across the cores, hence disabling this leaves much of the work to
>> be offloaded to load balancer to move task across the cores. Latency sensitive
>> and long running multi-threaded workload should see the regression under this conditions.
> Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why
> it is a feature that is ON by default, but can be turned OFF for specific
> workloads on x86 SMT2 that can benefit from it.
>> Also, systems like POWER9 has sd_llc as a pair of core only. So it
>> won't benefit from the limits and hence also hiding your code in select_idle_cpu
>> behind static keys will be much preferred.
> If it doesn't hurt then I don't see the point.
>
So these is the result from POWER9 system with your patches:
System configuration: 2 Socket, 44 cores, 176 CPUs
Experiment setup:
===========
=> Setup 1:
- 44 tasks doing just while(1), this is to make select_idle_core return -1 most times
- perf bench sched messaging -g 1 -l 1000000
+-----------+--------+--------------+--------+
| Baseline | stddev | Patch | stddev |
+-----------+--------+--------------+--------+
| 135 | 3.21 | 158(-17.03%) | 4.69 |
+-----------+--------+--------------+--------+
=> Setup 2:
- schbench -m44 -t 1
+=======+==========+=========+=========+==========+
| %ile | Baseline | stddev | patch | stddev |
+=======+==========+=========+=========+==========+
| 50 | 10 | 3.49 | 10 | 2.29 |
+-------+----------+---------+---------+----------+
| 95 | 467 | 4.47 | 469 | 0.81 |
+-------+----------+---------+---------+----------+
| 99 | 571 | 21.32 | 584 | 18.69 |
+-------+----------+---------+---------+----------+
| 99.5 | 629 | 30.05 | 641 | 20.95 |
+-------+----------+---------+---------+----------+
| 99.9 | 780 | 40.38 | 773 | 44.2 |
+-------+----------+---------+---------+----------+
I guess it doesn't make much difference in schbench results but hackbench (perf bench)
seems to have an observable regression.
Best,
Parth
>>> Also, systems like POWER9 has sd_llc as a pair of core only. So it
>>> won't benefit from the limits and hence also hiding your code in select_idle_cpu
>>> behind static keys will be much preferred.
>> If it doesn't hurt then I don't see the point.
>>
> So these is the result from POWER9 system with your patches:
> System configuration: 2 Socket, 44 cores, 176 CPUs
>
> Experiment setup:
> ===========
> => Setup 1:
> - 44 tasks doing just while(1), this is to make select_idle_core return -1 most times
> - perf bench sched messaging -g 1 -l 1000000
> +-----------+--------+--------------+--------+
> | Baseline | stddev | Patch | stddev |
> +-----------+--------+--------------+--------+
> | 135 | 3.21 | 158(-17.03%) | 4.69 |
> +-----------+--------+--------------+--------+
>
> => Setup 2:
> - schbench -m44 -t 1
> +=======+==========+=========+=========+==========+
> | %ile | Baseline | stddev | patch | stddev |
> +=======+==========+=========+=========+==========+
> | 50 | 10 | 3.49 | 10 | 2.29 |
> +-------+----------+---------+---------+----------+
> | 95 | 467 | 4.47 | 469 | 0.81 |
> +-------+----------+---------+---------+----------+
> | 99 | 571 | 21.32 | 584 | 18.69 |
> +-------+----------+---------+---------+----------+
> | 99.5 | 629 | 30.05 | 641 | 20.95 |
> +-------+----------+---------+---------+----------+
> | 99.9 | 780 | 40.38 | 773 | 44.2 |
> +-------+----------+---------+---------+----------+
>
> I guess it doesn't make much difference in schbench results but hackbench (perf bench)
> seems to have an observable regression.
>
>
> Best,
> Parth
>
If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with
the select_idle_cpu changes as the limits are 1 and 2 core. Previously the
lower bound was 4 cpus and upper bound calculated by the prop. Now it is
1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra
computation of cpumask_weight causing the regression rather than the
sliding window itself (one way to check this would be hardcode 4 in place
of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit
suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per
core showed improvement with Hackbench.
Thanks,
Subhra
On 7/2/19 2:07 AM, Subhra Mazumdar wrote:
>
>>>> Also, systems like POWER9 has sd_llc as a pair of core only. So it
>>>> won't benefit from the limits and hence also hiding your code in select_idle_cpu
>>>> behind static keys will be much preferred.
>>> If it doesn't hurt then I don't see the point.
>>>
>> So these is the result from POWER9 system with your patches:
>> System configuration: 2 Socket, 44 cores, 176 CPUs
>>
>> Experiment setup:
>> ===========
>> => Setup 1:
>> - 44 tasks doing just while(1), this is to make select_idle_core return -1 most times
>> - perf bench sched messaging -g 1 -l 1000000
>> +-----------+--------+--------------+--------+
>> | Baseline | stddev | Patch | stddev |
>> +-----------+--------+--------------+--------+
>> | 135 | 3.21 | 158(-17.03%) | 4.69 |
>> +-----------+--------+--------------+--------+
>>
>> => Setup 2:
>> - schbench -m44 -t 1
>> +=======+==========+=========+=========+==========+
>> | %ile | Baseline | stddev | patch | stddev |
>> +=======+==========+=========+=========+==========+
>> | 50 | 10 | 3.49 | 10 | 2.29 |
>> +-------+----------+---------+---------+----------+
>> | 95 | 467 | 4.47 | 469 | 0.81 |
>> +-------+----------+---------+---------+----------+
>> | 99 | 571 | 21.32 | 584 | 18.69 |
>> +-------+----------+---------+---------+----------+
>> | 99.5 | 629 | 30.05 | 641 | 20.95 |
>> +-------+----------+---------+---------+----------+
>> | 99.9 | 780 | 40.38 | 773 | 44.2 |
>> +-------+----------+---------+---------+----------+
>>
>> I guess it doesn't make much difference in schbench results but hackbench (perf bench)
>> seems to have an observable regression.
>>
>>
>> Best,
>> Parth
>>
> If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with
> the select_idle_cpu changes as the limits are 1 and 2 core. Previously the
> lower bound was 4 cpus and upper bound calculated by the prop. Now it is
> 1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra
> computation of cpumask_weight causing the regression rather than the
> sliding window itself (one way to check this would be hardcode 4 in place
> of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit
> suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per
> core showed improvement with Hackbench.
>
Same experiment with hackbench and with perf analysis shows increase in L1
cache miss rate with these patches
(Lower is better)
Baseline(%) Patch(%)
----------------------- ------------- -----------
Total Cache miss rate 17.01 19(-11%)
L1 icache miss rate 5.45 6.7(-22%)
So is is possible for idle_cpu search to try checking target_cpu first and
then goto sliding window if not found.
Below diff works as expected in IBM POWER9 system and resolves the problem
of far wakeup upto large extent.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2e9b5c3ac5..fae035ce1162 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 time, cost;
s64 delta;
int cpu, limit, floor, target_tmp, nr = INT_MAX;
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
time = local_clock();
- for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+ cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+ for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
+ __cpumask_clear_cpu(cpu, cpus);
+ if (available_idle_cpu(cpu))
+ goto idle_cpu_exit;
+ }
+
+ for_each_cpu_wrap(cpu, cpus, target_tmp) {
per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
- if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
- continue;
if (available_idle_cpu(cpu))
break;
}
+idle_cpu_exit:
time = local_clock() - time;
cost = this_sd->avg_scan_cost;
delta = (s64)(time - cost) / 8;
Best,
Parth
On 7/4/19 6:04 PM, Parth Shah wrote:
> Same experiment with hackbench and with perf analysis shows increase in L1
> cache miss rate with these patches
> (Lower is better)
> Baseline(%) Patch(%)
> ----------------------- ------------- -----------
> Total Cache miss rate 17.01 19(-11%)
> L1 icache miss rate 5.45 6.7(-22%)
>
>
>
> So is is possible for idle_cpu search to try checking target_cpu first and
> then goto sliding window if not found.
> Below diff works as expected in IBM POWER9 system and resolves the problem
> of far wakeup upto large extent.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff2e9b5c3ac5..fae035ce1162 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> u64 time, cost;
> s64 delta;
> int cpu, limit, floor, target_tmp, nr = INT_MAX;
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>
> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> if (!this_sd)
> @@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = local_clock();
>
> - for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
> + cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
> + for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
> + __cpumask_clear_cpu(cpu, cpus);
> + if (available_idle_cpu(cpu))
> + goto idle_cpu_exit;
> + }
> +
> + for_each_cpu_wrap(cpu, cpus, target_tmp) {
> per_cpu(next_cpu, target) = cpu;
> if (!--nr)
> return -1;
> - if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> - continue;
> if (available_idle_cpu(cpu))
> break;
> }
>
> +idle_cpu_exit:
> time = local_clock() - time;
> cost = this_sd->avg_scan_cost;
> delta = (s64)(time - cost) / 8;
>
>
>
> Best,
> Parth
How about calling select_idle_smt before select_idle_cpu from
select_idle_sibling? That should have the same effect.