2020-12-08 15:39:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

Changelog since v1
o Drop single-pass patch (vincent)
o Scope variables used for SIS_AVG_CPU (dietmar)
o Remove redundant assignment (dietmar

This reduces the amount of runqueue scanning in select_idle_sibling in
the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 moves all SIS_PROP-related calculations under SIS_PROP

Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
of scanning. It should be relatively uncontroversial

Patch 4 returns an idle candidate if one is found while scanning for a
free core.

--
2.26.2

Mel Gorman (4):
sched/fair: Remove SIS_AVG_CPU
sched/fair: Move avg_scan_cost calculations under SIS_PROP
sched/fair: Do not replace recent_used_cpu with the new target
sched/fair: Return an idle cpu if one is found after a failed search
for an idle core

kernel/sched/fair.c | 51 ++++++++++++++++++++---------------------
kernel/sched/features.h | 1 -
2 files changed, 25 insertions(+), 27 deletions(-)

--
2.26.2


2020-12-08 15:39:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It
was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++-----------
kernel/sched/features.h | 1 -
2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98075f9ea9a8..ac7b34e7372b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
- u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- avg_idle = this_rq()->avg_idle / 512;
- avg_cost = this_sd->avg_scan_cost + 1;
+ if (sched_feat(SIS_PROP)) {
+ u64 avg_cost, avg_idle, span_avg;

- if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
- return -1;
+ /*
+ * Due to large variance we need a large fuzz factor;
+ * hackbench in particularly is sensitive here.
+ */
+ avg_idle = this_rq()->avg_idle / 512;
+ avg_cost = this_sd->avg_scan_cost + 1;

- if (sched_feat(SIS_PROP)) {
- u64 span_avg = sd->span_weight * avg_idle;
+ span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
-SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)

/*
--
2.26.2

2020-12-08 15:41:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7b34e7372b..5c41875aec23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;

@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
- }
-
- time = cpu_clock(this);

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ time = cpu_clock(this);
+ }

for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
break;
}

- time = cpu_clock(this) - time;
- update_avg(&this_sd->avg_scan_cost, time);
+ if (sched_feat(SIS_PROP)) {
+ time = cpu_clock(this) - time;
+ update_avg(&this_sd->avg_scan_cost, time);
+ }

return cpu;
}
--
2.26.2

2020-12-08 15:42:55

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the last
wakeup. It still works, but is less efficient than it can be after all
the changes that went in since that reduce unnecessary migrations, load
balancer changes etc. This patch preserves recent_used_cpu for longer.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

5.10.0-rc6 5.10.0-rc6
baseline-v2 altrecent-v2
Hmean 1 508.39 ( 0.00%) 502.05 * -1.25%*
Hmean 2 986.70 ( 0.00%) 983.65 * -0.31%*
Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%*
Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%*
Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%*
Hmean 32 10142.57 ( 0.00%) 10253.73 * 1.10%*
Hmean 64 14348.40 ( 0.00%) 12506.31 * -12.84%*
Hmean 128 21842.59 ( 0.00%) 21967.13 * 0.57%*
Hmean 256 20813.75 ( 0.00%) 21534.52 * 3.46%*
Hmean 320 20684.33 ( 0.00%) 21070.14 * 1.87%*

The different was marginal except for 64 threads which showed in the
baseline that the result was very unstable where as the patch was much
more stable. This is somewhat machine specific as on a separate 80-cpu
Broadwell machine the same test reported.

5.10.0-rc6 5.10.0-rc6
baseline-v2 altrecent-v2
Hmean 1 310.36 ( 0.00%) 291.81 * -5.98%*
Hmean 2 340.86 ( 0.00%) 547.22 * 60.54%*
Hmean 4 912.29 ( 0.00%) 1063.21 * 16.54%*
Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%*
Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%*
Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%*
Hmean 64 11733.91 ( 0.00%) 11473.66 * -2.22%*
Hmean 128 17727.24 ( 0.00%) 16784.23 * -5.32%*
Hmean 256 16089.23 ( 0.00%) 16110.79 * 0.13%*
Hmean 320 15992.60 ( 0.00%) 16071.64 * 0.49%*

schedstats were not used in this series but from an earlier debugging
effort, the schedstats after the test run were as follows;

Ops SIS Search 5653107942.00 5726545742.00
Ops SIS Domain Search 3365067916.00 3319768543.00
Ops SIS Scanned 112173512543.00 99194352541.00
Ops SIS Domain Scanned 109885472517.00 96787575342.00
Ops SIS Failures 2923185114.00 2950166441.00
Ops SIS Recent Used Hit 56547.00 118064916.00
Ops SIS Recent Used Miss 1590899250.00 354942791.00
Ops SIS Recent Attempts 1590955797.00 473007707.00
Ops SIS Search Efficiency 5.04 5.77
Ops SIS Domain Search Eff 3.06 3.43
Ops SIS Fast Success Rate 40.47 42.03
Ops SIS Success Rate 48.29 48.48
Ops SIS Recent Success Rate 0.00 24.96

First interesting point is the ridiculous number of times runqueues are
enabled -- almost 97 billion times over the course of 40 minutes

With the patch, "Recent Used Hit" is over 2000 times more likely to
succeed. The failure rate also increases by quite a lot but the cost is
marginal even if the "Fast Success Rate" only increases by 2% overall. What
cannot be observed from these stats is where the biggest impact as these
stats cover low utilisation to over saturation.

If graphed over time, the graphs show that the sched domain is only
scanned at negligible rates until the machine is fully busy. With
low utilisation, the "Fast Success Rate" is almost 100% until the
machine is fully busy. For 320 clients, the success rate is close to
0% which is unsurprising.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..413d895bbbf8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)

/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
+ p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
asym_fits_capacity(task_util, recent_used_cpu)) {
- /*
- * Replace recent_used_cpu with prev as it is a potential
- * candidate for the next wake:
- */
- p->recent_used_cpu = prev;
return recent_used_cpu;
}

@@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
- if (want_affine)
- current->recent_used_cpu = cpu;
}
rcu_read_unlock();

--
2.26.2

2020-12-08 15:44:18

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

select_idle_core is called when SMT is active and there is likely a free
core available. It may find idle CPUs but this information is simply
discarded and the scan starts over again with select_idle_cpu.

This patch caches information on idle CPUs found during the search for
a core and uses one if no core is found. This is a tradeoff. There may
be a slight impact when utilisation is low and an idle core can be
found quickly. It provides improvements as the number of busy CPUs
approaches 50% of the domain size when SMT is enabled.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

5.10.0-rc6 5.10.0-rc6
schedstat idlecandidate
Hmean 1 500.06 ( 0.00%) 505.67 * 1.12%*
Hmean 2 975.90 ( 0.00%) 974.06 * -0.19%*
Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%*
Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%*
Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%*
Hmean 32 10435.31 ( 0.00%) 10312.58 * -1.18%*
Hmean 64 12325.51 ( 0.00%) 13792.01 * 11.90%*
Hmean 128 21225.21 ( 0.00%) 20963.44 * -1.23%*
Hmean 256 20532.83 ( 0.00%) 20335.62 * -0.96%*
Hmean 320 20334.81 ( 0.00%) 20147.25 * -0.92%*

Note that there is a significant corner case. As the SMT scan may be
terminated early, not all CPUs have been visited and select_idle_cpu()
is still called for a full scan. This case is handled in the next
patch.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 413d895bbbf8..ed6f45832d97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
*/
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
+ int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;

@@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
idle = false;
break;
}
+
+ if (idle_candidate == -1 &&
+ cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ idle_candidate = cpu;
+ }
}

if (idle)
@@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
*/
set_idle_cores(target, 0);

- return -1;
+ return idle_candidate;
}

/*
--
2.26.2

2020-12-08 16:07:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
>
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/fair.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> if (!this_sd)
> return -1;

Just noticed while reviewing the patch that the above related to
this_sd can also go under sched_feat(SIS_PROP)

>
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
> if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
>
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> nr = div_u64(span_avg, avg_cost);
> else
> nr = 4;
> - }
> -
> - time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + time = cpu_clock(this);
> + }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> break;
> }
>
> - time = cpu_clock(this) - time;
> - update_avg(&this_sd->avg_scan_cost, time);
> + if (sched_feat(SIS_PROP)) {
> + time = cpu_clock(this) - time;
> + update_avg(&this_sd->avg_scan_cost, time);
> + }
>
> return cpu;
> }
> --
> 2.26.2
>

2020-12-08 16:16:20

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
>
> SIS_AVG_CPU was introduced as a means of avoiding a search when the
> average search cost indicated that the search would likely fail. It
> was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> select_idle_cpu() more aggressive") and later replaced with a proportional
> search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> scale select_idle_cpu()").
>
> While there are corner cases where SIS_AVG_CPU is better, it has now been
> disabled for almost three years. As the intent of SIS_PROP is to reduce
> the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> on SIS_PROP as a throttling mechanism.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 20 +++++++++-----------
> kernel/sched/features.h | 1 -
> 2 files changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98075f9ea9a8..ac7b34e7372b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> struct sched_domain *this_sd;
> - u64 avg_cost, avg_idle;
> u64 time;
> int this = smp_processor_id();
> int cpu, nr = INT_MAX;
> @@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> if (!this_sd)
> return -1;
>
> - /*
> - * Due to large variance we need a large fuzz factor; hackbench in
> - * particularly is sensitive here.
> - */
> - avg_idle = this_rq()->avg_idle / 512;
> - avg_cost = this_sd->avg_scan_cost + 1;
> + if (sched_feat(SIS_PROP)) {
> + u64 avg_cost, avg_idle, span_avg;
>
> - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> - return -1;
> + /*
> + * Due to large variance we need a large fuzz factor;
> + * hackbench in particularly is sensitive here.
> + */
> + avg_idle = this_rq()->avg_idle / 512;
> + avg_cost = this_sd->avg_scan_cost + 1;
>
> - if (sched_feat(SIS_PROP)) {
> - u64 span_avg = sd->span_weight * avg_idle;
> + span_avg = sd->span_weight * avg_idle;
> if (span_avg > 4*avg_cost)
> nr = div_u64(span_avg, avg_cost);
> else
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 68d369cba9e4..e875eabb6600 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
> /*
> * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> */
> -SCHED_FEAT(SIS_AVG_CPU, false)
> SCHED_FEAT(SIS_PROP, true)
>
> /*
> --
> 2.26.2
>

2020-12-08 16:19:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
>
> After select_idle_sibling, p->recent_used_cpu is set to the
> new target. However on the next wakeup, prev will be the same as
> recent_used_cpu unless the load balancer has moved the task since the last
> wakeup. It still works, but is less efficient than it can be after all
> the changes that went in since that reduce unnecessary migrations, load
> balancer changes etc. This patch preserves recent_used_cpu for longer.
>
> With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
>
> 5.10.0-rc6 5.10.0-rc6
> baseline-v2 altrecent-v2
> Hmean 1 508.39 ( 0.00%) 502.05 * -1.25%*
> Hmean 2 986.70 ( 0.00%) 983.65 * -0.31%*
> Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%*
> Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%*
> Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%*
> Hmean 32 10142.57 ( 0.00%) 10253.73 * 1.10%*
> Hmean 64 14348.40 ( 0.00%) 12506.31 * -12.84%*
> Hmean 128 21842.59 ( 0.00%) 21967.13 * 0.57%*
> Hmean 256 20813.75 ( 0.00%) 21534.52 * 3.46%*
> Hmean 320 20684.33 ( 0.00%) 21070.14 * 1.87%*
>
> The different was marginal except for 64 threads which showed in the
> baseline that the result was very unstable where as the patch was much
> more stable. This is somewhat machine specific as on a separate 80-cpu
> Broadwell machine the same test reported.
>
> 5.10.0-rc6 5.10.0-rc6
> baseline-v2 altrecent-v2
> Hmean 1 310.36 ( 0.00%) 291.81 * -5.98%*
> Hmean 2 340.86 ( 0.00%) 547.22 * 60.54%*
> Hmean 4 912.29 ( 0.00%) 1063.21 * 16.54%*
> Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%*
> Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%*
> Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%*
> Hmean 64 11733.91 ( 0.00%) 11473.66 * -2.22%*
> Hmean 128 17727.24 ( 0.00%) 16784.23 * -5.32%*
> Hmean 256 16089.23 ( 0.00%) 16110.79 * 0.13%*
> Hmean 320 15992.60 ( 0.00%) 16071.64 * 0.49%*
>
> schedstats were not used in this series but from an earlier debugging
> effort, the schedstats after the test run were as follows;
>
> Ops SIS Search 5653107942.00 5726545742.00
> Ops SIS Domain Search 3365067916.00 3319768543.00
> Ops SIS Scanned 112173512543.00 99194352541.00
> Ops SIS Domain Scanned 109885472517.00 96787575342.00
> Ops SIS Failures 2923185114.00 2950166441.00
> Ops SIS Recent Used Hit 56547.00 118064916.00
> Ops SIS Recent Used Miss 1590899250.00 354942791.00
> Ops SIS Recent Attempts 1590955797.00 473007707.00
> Ops SIS Search Efficiency 5.04 5.77
> Ops SIS Domain Search Eff 3.06 3.43
> Ops SIS Fast Success Rate 40.47 42.03
> Ops SIS Success Rate 48.29 48.48
> Ops SIS Recent Success Rate 0.00 24.96
>
> First interesting point is the ridiculous number of times runqueues are
> enabled -- almost 97 billion times over the course of 40 minutes
>
> With the patch, "Recent Used Hit" is over 2000 times more likely to
> succeed. The failure rate also increases by quite a lot but the cost is
> marginal even if the "Fast Success Rate" only increases by 2% overall. What
> cannot be observed from these stats is where the biggest impact as these
> stats cover low utilisation to over saturation.
>
> If graphed over time, the graphs show that the sched domain is only
> scanned at negligible rates until the machine is fully busy. With
> low utilisation, the "Fast Success Rate" is almost 100% until the
> machine is fully busy. For 320 clients, the success rate is close to
> 0% which is unsurprising.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c41875aec23..413d895bbbf8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>
> /* Check a recently used CPU as a potential idle candidate: */
> recent_used_cpu = p->recent_used_cpu;
> + p->recent_used_cpu = prev;
> if (recent_used_cpu != prev &&
> recent_used_cpu != target &&
> cpus_share_cache(recent_used_cpu, target) &&
> (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> asym_fits_capacity(task_util, recent_used_cpu)) {
> - /*
> - * Replace recent_used_cpu with prev as it is a potential
> - * candidate for the next wake:
> - */
> - p->recent_used_cpu = prev;
> return recent_used_cpu;
> }
>
> @@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> - if (want_affine)
> - current->recent_used_cpu = cpu;
> }
> rcu_read_unlock();
>
> --
> 2.26.2
>

2020-12-08 16:21:12

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
>
> select_idle_core is called when SMT is active and there is likely a free
> core available. It may find idle CPUs but this information is simply
> discarded and the scan starts over again with select_idle_cpu.
>
> This patch caches information on idle CPUs found during the search for
> a core and uses one if no core is found. This is a tradeoff. There may
> be a slight impact when utilisation is low and an idle core can be
> found quickly. It provides improvements as the number of busy CPUs
> approaches 50% of the domain size when SMT is enabled.
>
> With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
>
> 5.10.0-rc6 5.10.0-rc6
> schedstat idlecandidate
> Hmean 1 500.06 ( 0.00%) 505.67 * 1.12%*
> Hmean 2 975.90 ( 0.00%) 974.06 * -0.19%*
> Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%*
> Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%*
> Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%*
> Hmean 32 10435.31 ( 0.00%) 10312.58 * -1.18%*
> Hmean 64 12325.51 ( 0.00%) 13792.01 * 11.90%*
> Hmean 128 21225.21 ( 0.00%) 20963.44 * -1.23%*
> Hmean 256 20532.83 ( 0.00%) 20335.62 * -0.96%*
> Hmean 320 20334.81 ( 0.00%) 20147.25 * -0.92%*
>
> Note that there is a significant corner case. As the SMT scan may be
> terminated early, not all CPUs have been visited and select_idle_cpu()
> is still called for a full scan. This case is handled in the next
> patch.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 413d895bbbf8..ed6f45832d97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
> */
> static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
> {
> + int idle_candidate = -1;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int core, cpu;
>
> @@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> idle = false;
> break;
> }
> +
> + if (idle_candidate == -1 &&
> + cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + idle_candidate = cpu;
> + }
> }
>
> if (idle)
> @@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> */
> set_idle_cores(target, 0);
>
> - return -1;
> + return idle_candidate;
> }
>
> /*
> --
> 2.26.2
>

2020-12-08 16:35:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On Tue, Dec 08, 2020 at 05:03:21PM +0100, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
> >
> > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> > check and while we are at it, exclude the cost of initialising the CPU
> > mask from the average scan cost.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > kernel/sched/fair.c | 14 ++++++++------
> > 1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac7b34e7372b..5c41875aec23 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > if (!this_sd)
> > return -1;
>
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
>

Technically yes but I also decided against it. It's a functional difference
depending on whether SIS_PROP is set in the highly unlikely case that
this_sd == NULL. I was also thinking in terms of what happens if SIS_PROP
was disabled and enabled while a search is in progress even if it's very
unlikely. In that case, this_sd would be uninitialised. That might be
impossible in practice depending on how static branching is implemented
but I don't think we should rely on the internals of jump labels and play
it safe. I can move it in if you feel strongly about it but I think the
disable/enable race is enough of a concern to leave it alone.

--
Mel Gorman
SUSE Labs

2020-12-09 05:35:02

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On 2020/12/9 0:03, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
>>
>> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
>> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
>> check and while we are at it, exclude the cost of initialising the CPU
>> mask from the average scan cost.
>>
>> Signed-off-by: Mel Gorman <[email protected]>
>> ---
>> kernel/sched/fair.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ac7b34e7372b..5c41875aec23 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> if (!this_sd)
>> return -1;
>
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
>
>>
>> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +
>> if (sched_feat(SIS_PROP)) {
>> u64 avg_cost, avg_idle, span_avg;
>>
>> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> nr = div_u64(span_avg, avg_cost);
>> else
>> nr = 4;
>> - }
>> -
>> - time = cpu_clock(this);
>>
>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> + time = cpu_clock(this);
>> + }
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)

nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.

Thanks,
-Aubrey

2020-12-09 09:08:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
> >> nr = div_u64(span_avg, avg_cost);
> >> else
> >> nr = 4;
> >> - }
> >> -
> >> - time = cpu_clock(this);
> >>
> >> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> + time = cpu_clock(this);
> >> + }
> >>
> >> for_each_cpu_wrap(cpu, cpus, target) {
> >> if (!--nr)
>
> nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.
>

It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.

--
Mel Gorman
SUSE Labs

2020-12-09 11:12:00

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On 2020/12/9 17:05, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
>>>> nr = div_u64(span_avg, avg_cost);
>>>> else
>>>> nr = 4;
>>>> - }
>>>> -
>>>> - time = cpu_clock(this);
>>>>
>>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>> + time = cpu_clock(this);
>>>> + }
>>>>
>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>> if (!--nr)
>>
>> nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.
>>
>
> It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
>If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing directly.
But with SIS_PROP, need adding a test in the loop.
Since SIS_PROP is default true, I think it's okay to keep the current way.

Thanks,
-Aubrey

2020-12-09 11:36:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On Wed, Dec 09, 2020 at 07:07:11PM +0800, Li, Aubrey wrote:
> On 2020/12/9 17:05, Mel Gorman wrote:
> > On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
> >>>> nr = div_u64(span_avg, avg_cost);
> >>>> else
> >>>> nr = 4;
> >>>> - }
> >>>> -
> >>>> - time = cpu_clock(this);
> >>>>
> >>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>>> + time = cpu_clock(this);
> >>>> + }
> >>>>
> >>>> for_each_cpu_wrap(cpu, cpus, target) {
> >>>> if (!--nr)
> >>
> >> nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.
> >>
> >
> > It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
> >If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing directly.
> But with SIS_PROP, need adding a test in the loop.
> Since SIS_PROP is default true, I think it's okay to keep the current way.
>

It's because it's default true and the cost is negligible that I'm leaving
it alone. The branch cost and nr accounting cost is negligible and it
avoids peppering select_idle_cpu() with too many SIS_PROP checks.

--
Mel Gorman
SUSE Labs

2020-12-09 15:22:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> Changelog since v1
> o Drop single-pass patch (vincent)
> o Scope variables used for SIS_AVG_CPU (dietmar)
> o Remove redundant assignment (dietmar
>
> This reduces the amount of runqueue scanning in select_idle_sibling in
> the worst case.
>
> Patch 1 removes SIS_AVG_CPU because it's unused.
>
> Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
>
> Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> of scanning. It should be relatively uncontroversial
>
> Patch 4 returns an idle candidate if one is found while scanning for a
> free core.
>

Any other objections to the series? Vincent marked 1, 3 and 4 as
reviewed. While patch 2 had some mild cosmetic concerns, I think the
version and how it treats SIS_PROP is fine as it is to keep it
functionally equivalent to !SIS_PROP and without adding too many
SIS_PROP checks.

--
Mel Gorman
SUSE Labs

2020-12-10 08:05:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Wed, 9 Dec 2020 at 15:37, Mel Gorman <[email protected]> wrote:
>
> On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> > Changelog since v1
> > o Drop single-pass patch (vincent)
> > o Scope variables used for SIS_AVG_CPU (dietmar)
> > o Remove redundant assignment (dietmar
> >
> > This reduces the amount of runqueue scanning in select_idle_sibling in
> > the worst case.
> >
> > Patch 1 removes SIS_AVG_CPU because it's unused.
> >
> > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> >
> > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> > of scanning. It should be relatively uncontroversial
> >
> > Patch 4 returns an idle candidate if one is found while scanning for a
> > free core.
> >
>
> Any other objections to the series? Vincent marked 1, 3 and 4 as
> reviewed. While patch 2 had some mild cosmetic concerns, I think the
> version and how it treats SIS_PROP is fine as it is to keep it
> functionally equivalent to !SIS_PROP and without adding too many
> SIS_PROP checks.

while testing your patchset and Aubrey one on top of tip, I'm facing
some perf regression on my arm64 numa system on hackbench and reaim.
The regression seems to comes from your patchset but i don't know
which patch in particular yet

hackbench -l 256000 -g 1

v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
with your patchset 15.368(+/- 2.74) -15.9%

I'm also seeing perf regression on reaim but this one needs more
investigation before confirming

TBH, I was not expecting regressions. I'm running more test to find
which patch is the culprit


>
> --
> Mel Gorman
> SUSE Labs

2020-12-10 09:36:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On Thu, Dec 10, 2020 at 01:18:05PM +0800, Li, Aubrey wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac7b34e7372b..5c41875aec23 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > if (!this_sd)
> > return -1;
> >
> > + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +
> > if (sched_feat(SIS_PROP)) {
> > u64 avg_cost, avg_idle, span_avg;
> >
> > @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > nr = div_u64(span_avg, avg_cost);
> > else
> > nr = 4;
> > - }
> > -
> > - time = cpu_clock(this);
> >
> > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > + time = cpu_clock(this);
> > + }
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > return -1;
>
> I thought about this again and here seems not to be consistent:
> - even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before return -1?

You're right, but it's outside the scope
of this patch. I noted that this was a problem in
lore.kernel.org/r/lore.kernel.org/r/[email protected]
It's neither a consistent win or loss to always account for it and so
was dropped for this series to keep the number of controversial patches
to a minimum.

> - if avg_scan_cost is not updated because nr is throttled, the first
> time = cpu_clock(this);
> can be optimized. As nr is calculated and we already know which of the weight of cpumask and nr is greater.
>

That is also outside the scope of this patch. To do that, cpumask_weight()
would have to be calculated but it's likely to be a net loss. Even under
light load, nr will be smaller than the domain weight incurring both the
cost of cpumask_weight and the clock read in the common case.

--
Mel Gorman
SUSE Labs

2020-12-10 09:43:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Thu, 10 Dec 2020 at 09:00, Vincent Guittot
<[email protected]> wrote:
>
> On Wed, 9 Dec 2020 at 15:37, Mel Gorman <[email protected]> wrote:
> >
> > On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> > > Changelog since v1
> > > o Drop single-pass patch (vincent)
> > > o Scope variables used for SIS_AVG_CPU (dietmar)
> > > o Remove redundant assignment (dietmar
> > >
> > > This reduces the amount of runqueue scanning in select_idle_sibling in
> > > the worst case.
> > >
> > > Patch 1 removes SIS_AVG_CPU because it's unused.
> > >
> > > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> > >
> > > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> > > of scanning. It should be relatively uncontroversial
> > >
> > > Patch 4 returns an idle candidate if one is found while scanning for a
> > > free core.
> > >
> >
> > Any other objections to the series? Vincent marked 1, 3 and 4 as
> > reviewed. While patch 2 had some mild cosmetic concerns, I think the
> > version and how it treats SIS_PROP is fine as it is to keep it
> > functionally equivalent to !SIS_PROP and without adding too many
> > SIS_PROP checks.
>
> while testing your patchset and Aubrey one on top of tip, I'm facing
> some perf regression on my arm64 numa system on hackbench and reaim.
> The regression seems to comes from your patchset but i don't know
> which patch in particular yet
>
> hackbench -l 256000 -g 1
>
> v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> with your patchset 15.368(+/- 2.74) -15.9%
>
> I'm also seeing perf regression on reaim but this one needs more
> investigation before confirming
>
> TBH, I was not expecting regressions. I'm running more test to find
> which patch is the culprit

The regression comes from patch 3: sched/fair: Do not replace
recent_used_cpu with the new target


>
>
> >
> > --
> > Mel Gorman
> > SUSE Labs

2020-12-10 09:45:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

On Tue, 8 Dec 2020 at 17:14, Vincent Guittot <[email protected]> wrote:
>
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <[email protected]> wrote:
> >
> > After select_idle_sibling, p->recent_used_cpu is set to the
> > new target. However on the next wakeup, prev will be the same as
> > recent_used_cpu unless the load balancer has moved the task since the last
> > wakeup. It still works, but is less efficient than it can be after all
> > the changes that went in since that reduce unnecessary migrations, load
> > balancer changes etc. This patch preserves recent_used_cpu for longer.
> >
> > With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
> >
> > 5.10.0-rc6 5.10.0-rc6
> > baseline-v2 altrecent-v2
> > Hmean 1 508.39 ( 0.00%) 502.05 * -1.25%*
> > Hmean 2 986.70 ( 0.00%) 983.65 * -0.31%*
> > Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%*
> > Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%*
> > Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%*
> > Hmean 32 10142.57 ( 0.00%) 10253.73 * 1.10%*
> > Hmean 64 14348.40 ( 0.00%) 12506.31 * -12.84%*
> > Hmean 128 21842.59 ( 0.00%) 21967.13 * 0.57%*
> > Hmean 256 20813.75 ( 0.00%) 21534.52 * 3.46%*
> > Hmean 320 20684.33 ( 0.00%) 21070.14 * 1.87%*
> >
> > The different was marginal except for 64 threads which showed in the
> > baseline that the result was very unstable where as the patch was much
> > more stable. This is somewhat machine specific as on a separate 80-cpu
> > Broadwell machine the same test reported.
> >
> > 5.10.0-rc6 5.10.0-rc6
> > baseline-v2 altrecent-v2
> > Hmean 1 310.36 ( 0.00%) 291.81 * -5.98%*
> > Hmean 2 340.86 ( 0.00%) 547.22 * 60.54%*
> > Hmean 4 912.29 ( 0.00%) 1063.21 * 16.54%*
> > Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%*
> > Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%*
> > Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%*
> > Hmean 64 11733.91 ( 0.00%) 11473.66 * -2.22%*
> > Hmean 128 17727.24 ( 0.00%) 16784.23 * -5.32%*
> > Hmean 256 16089.23 ( 0.00%) 16110.79 * 0.13%*
> > Hmean 320 15992.60 ( 0.00%) 16071.64 * 0.49%*
> >
> > schedstats were not used in this series but from an earlier debugging
> > effort, the schedstats after the test run were as follows;
> >
> > Ops SIS Search 5653107942.00 5726545742.00
> > Ops SIS Domain Search 3365067916.00 3319768543.00
> > Ops SIS Scanned 112173512543.00 99194352541.00
> > Ops SIS Domain Scanned 109885472517.00 96787575342.00
> > Ops SIS Failures 2923185114.00 2950166441.00
> > Ops SIS Recent Used Hit 56547.00 118064916.00
> > Ops SIS Recent Used Miss 1590899250.00 354942791.00
> > Ops SIS Recent Attempts 1590955797.00 473007707.00
> > Ops SIS Search Efficiency 5.04 5.77
> > Ops SIS Domain Search Eff 3.06 3.43
> > Ops SIS Fast Success Rate 40.47 42.03
> > Ops SIS Success Rate 48.29 48.48
> > Ops SIS Recent Success Rate 0.00 24.96
> >
> > First interesting point is the ridiculous number of times runqueues are
> > enabled -- almost 97 billion times over the course of 40 minutes
> >
> > With the patch, "Recent Used Hit" is over 2000 times more likely to
> > succeed. The failure rate also increases by quite a lot but the cost is
> > marginal even if the "Fast Success Rate" only increases by 2% overall. What
> > cannot be observed from these stats is where the biggest impact as these
> > stats cover low utilisation to over saturation.
> >
> > If graphed over time, the graphs show that the sched domain is only
> > scanned at negligible rates until the machine is fully busy. With
> > low utilisation, the "Fast Success Rate" is almost 100% until the
> > machine is fully busy. For 320 clients, the success rate is close to
> > 0% which is unsurprising.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Reviewed-by: Vincent Guittot <[email protected]>

This patch is responsible for a performance regression on my thx2 with
hackbench. So although i reviewed it, it should not be applied as the
change in the behavior is far deeper than expected

>
> > ---
> > kernel/sched/fair.c | 9 +--------
> > 1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5c41875aec23..413d895bbbf8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >
> > /* Check a recently used CPU as a potential idle candidate: */
> > recent_used_cpu = p->recent_used_cpu;
> > + p->recent_used_cpu = prev;
> > if (recent_used_cpu != prev &&
> > recent_used_cpu != target &&
> > cpus_share_cache(recent_used_cpu, target) &&
> > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> > asym_fits_capacity(task_util, recent_used_cpu)) {
> > - /*
> > - * Replace recent_used_cpu with prev as it is a potential
> > - * candidate for the next wake:
> > - */
> > - p->recent_used_cpu = prev;
> > return recent_used_cpu;
> > }
> >
> > @@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> > /* Fast path */
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > -
> > - if (want_affine)
> > - current->recent_used_cpu = cpu;
> > }
> > rcu_read_unlock();
> >
> > --
> > 2.26.2
> >

2020-12-10 10:59:39

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

On 2020/12/8 23:34, Mel Gorman wrote:
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/fair.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> if (!this_sd)
> return -1;
>
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
> if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
>
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> nr = div_u64(span_avg, avg_cost);
> else
> nr = 4;
> - }
> -
> - time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + time = cpu_clock(this);
> + }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> return -1;

I thought about this again and here seems not to be consistent:
- even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before return -1?
- if avg_scan_cost is not updated because nr is throttled, the first
time = cpu_clock(this);
can be optimized. As nr is calculated and we already know which of the weight of cpumask and nr is greater.

Thanks,
-Aubrey

2020-12-10 11:56:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > while testing your patchset and Aubrey one on top of tip, I'm facing
> > some perf regression on my arm64 numa system on hackbench and reaim.
> > The regression seems to comes from your patchset but i don't know
> > which patch in particular yet
> >
> > hackbench -l 256000 -g 1
> >
> > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > with your patchset 15.368(+/- 2.74) -15.9%
> >
> > I'm also seeing perf regression on reaim but this one needs more
> > investigation before confirming
> >
> > TBH, I was not expecting regressions. I'm running more test to find
> > which patch is the culprit
>
> The regression comes from patch 3: sched/fair: Do not replace
> recent_used_cpu with the new target
>

That's not entirely surprising. The intent of the patch is to increase the
hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
corner cases. If multiple tasks have the same p->recent_used_cpu, they can
race to use that CPU and stack as a result instead of searching the domain.
If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
a busy sibling which the search would have avoided in select_idle_core().

I think you are using processes and sockets for hackbench but as you'll
see later, hackbench can be used both to show losses and gains.

I originally tested with 6 machines covering Broadwell (2 socket), Haswell
(2 socket), Skylake (1 socket), Cascadelake (2 socket), EPYC (2 socket)
and EPYC 2 (2 socket) with a range of workloads including hackbench. Of
those, just one reported a major problem with 1 group -- the EPYC 1 machine

EPYC hackbench process-sockets
5.10.0-rc6 5.10.0-rc6
baseline-v2r2 altrecent-v2r5
Amean 1 1.0607 ( 0.00%) 1.1480 ( -8.23%)
Amean 4 1.3277 ( 0.00%) 1.3117 ( 1.21%)
Amean 7 1.6940 ( 0.00%) 1.6950 ( -0.06%)
Amean 12 2.1600 ( 0.00%) 2.1367 ( 1.08%)
Amean 21 3.2450 ( 0.00%) 3.5883 ( -10.58%)
Amean 30 4.1673 ( 0.00%) 3.9653 ( 4.85%)
Amean 48 4.9257 ( 0.00%) 5.0000 ( -1.51%)
Amean 79 7.4950 ( 0.00%) 7.4563 ( 0.52%)
Amean 110 10.4233 ( 0.00%) 10.4727 ( -0.47%)
Amean 141 13.4690 ( 0.00%) 13.4563 ( 0.09%)
Amean 172 16.6450 ( 0.00%) 16.6033 ( 0.25%)
Amean 203 19.4873 ( 0.00%) 19.7893 * -1.55%*
Amean 234 22.5507 ( 0.00%) 22.8033 ( -1.12%)
Amean 265 25.3380 ( 0.00%) 25.6490 ( -1.23%)
Amean 296 28.0070 ( 0.00%) 28.1270 ( -0.43%)

That's showing an 8% loss for 1 group and also a problem with 21 groups.
Otherwise, it was more or less flat. EPYC 2 also showed a 2% loss for 1
group and 9% loss for 21 groups (probably related to the size of the LLC
domain as there are many LLCs per socket on EPYC*).

For the *same* machine running hackbench using pipes instead of sockets
we get

EPYC hackbench process-pipes
Amean 1 0.9497 ( 0.00%) 0.9517 ( -0.21%)
Amean 4 1.2253 ( 0.00%) 1.1387 ( 7.07%)
Amean 7 2.0677 ( 0.00%) 1.7460 * 15.56%*
Amean 12 2.8717 ( 0.00%) 2.4797 * 13.65%*
Amean 21 4.4053 ( 0.00%) 3.7463 * 14.96%*
Amean 30 5.3983 ( 0.00%) 4.1097 * 23.87%*
Amean 48 6.1050 ( 0.00%) 4.6873 * 23.22%*
Amean 79 7.5640 ( 0.00%) 6.8493 ( 9.45%)
Amean 110 12.2627 ( 0.00%) 9.4613 * 22.84%*
Amean 141 16.9980 ( 0.00%) 13.8327 * 18.62%*
Amean 172 21.5280 ( 0.00%) 17.3693 * 19.32%*
Amean 203 25.4480 ( 0.00%) 20.9947 * 17.50%*
Amean 234 29.6570 ( 0.00%) 24.9613 * 15.83%*
Amean 265 33.0713 ( 0.00%) 28.1103 * 15.00%*
Amean 296 37.4443 ( 0.00%) 31.8757 * 14.87%*

So even on the *same hardware*, hackbench can show very different results
depending on how it is run.

The rest of the machines were more or less neutral for this patch. Once
hackbench saturates the machine, the hit rate on recent_used_cpu is going
to be low

1-socket skylake
Amean 1 1.3183 ( 0.00%) 1.2827 * 2.71%*
Amean 3 3.6750 ( 0.00%) 3.6610 ( 0.38%)
Amean 5 6.1003 ( 0.00%) 6.0190 * 1.33%*
Amean 7 8.6063 ( 0.00%) 8.6047 ( 0.02%)
Amean 12 14.9480 ( 0.00%) 15.0327 ( -0.57%)
Amean 18 22.3430 ( 0.00%) 22.6680 ( -1.45%)
Amean 24 29.4970 ( 0.00%) 29.6677 ( -0.58%)
Amean 30 36.7373 ( 0.00%) 36.3687 ( 1.00%)
Amean 32 39.0973 ( 0.00%) 39.4417 ( -0.88%)

Shows a 2.71% gain for one group, otherwise more or less neutral

2-socket CascadeLake

Amean 1 0.3663 ( 0.00%) 0.3657 ( 0.18%)
Amean 4 0.7510 ( 0.00%) 0.7793 ( -3.77%)
Amean 7 1.2650 ( 0.00%) 1.2583 ( 0.53%)
Amean 12 1.9510 ( 0.00%) 1.9450 ( 0.31%)
Amean 21 2.9677 ( 0.00%) 3.0277 ( -2.02%)
Amean 30 4.2993 ( 0.00%) 4.0237 * 6.41%*
Amean 48 6.5373 ( 0.00%) 6.2987 * 3.65%*
Amean 79 10.5513 ( 0.00%) 10.3280 ( 2.12%)
Amean 110 15.8567 ( 0.00%) 13.9817 ( 11.82%)
Amean 141 17.4243 ( 0.00%) 17.3177 ( 0.61%)
Amean 172 21.0473 ( 0.00%) 20.9760 ( 0.34%)
Amean 203 25.1070 ( 0.00%) 25.1150 ( -0.03%)
Amean 234 28.6753 ( 0.00%) 28.9383 ( -0.92%)
Amean 265 32.7970 ( 0.00%) 32.9663 ( -0.52%)
Amean 296 36.6510 ( 0.00%) 36.6753 ( -0.07%)

Neutral for 1 group, small regression for 4 groups, few gains around the
middle, neutral when over-saturated.

select_idle_sibling is a curse because it's very rare that a change to
it is a universal win. On balance, I think it's better to avoid searching
the domain at all where possible even if there are cases where searching
can have a benefit such as finding an idle core instead of picking an
idle CPU with a busy sibling via p->recent_used_cpu.

--
Mel Gorman
SUSE Labs

2020-12-11 13:23:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Thu, 10 Dec 2020 at 12:04, Mel Gorman <[email protected]> wrote:
>
> On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > The regression seems to comes from your patchset but i don't know
> > > which patch in particular yet
> > >
> > > hackbench -l 256000 -g 1
> > >
> > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > with your patchset 15.368(+/- 2.74) -15.9%
> > >
> > > I'm also seeing perf regression on reaim but this one needs more
> > > investigation before confirming
> > >
> > > TBH, I was not expecting regressions. I'm running more test to find
> > > which patch is the culprit
> >
> > The regression comes from patch 3: sched/fair: Do not replace
> > recent_used_cpu with the new target
> >
>
> That's not entirely surprising. The intent of the patch is to increase the
> hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> race to use that CPU and stack as a result instead of searching the domain.
> If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> a busy sibling which the search would have avoided in select_idle_core().
>
> I think you are using processes and sockets for hackbench but as you'll
> see later, hackbench can be used both to show losses and gains.

I run more hackbench tests with pipe and socket and both show
regression with patch 3 whereas this is significant improvement with
other patches and Aubrey's one

>
> I originally tested with 6 machines covering Broadwell (2 socket), Haswell
> (2 socket), Skylake (1 socket), Cascadelake (2 socket), EPYC (2 socket)
> and EPYC 2 (2 socket) with a range of workloads including hackbench. Of
> those, just one reported a major problem with 1 group -- the EPYC 1 machine
>
> EPYC hackbench process-sockets
> 5.10.0-rc6 5.10.0-rc6
> baseline-v2r2 altrecent-v2r5
> Amean 1 1.0607 ( 0.00%) 1.1480 ( -8.23%)
> Amean 4 1.3277 ( 0.00%) 1.3117 ( 1.21%)
> Amean 7 1.6940 ( 0.00%) 1.6950 ( -0.06%)
> Amean 12 2.1600 ( 0.00%) 2.1367 ( 1.08%)
> Amean 21 3.2450 ( 0.00%) 3.5883 ( -10.58%)
> Amean 30 4.1673 ( 0.00%) 3.9653 ( 4.85%)
> Amean 48 4.9257 ( 0.00%) 5.0000 ( -1.51%)
> Amean 79 7.4950 ( 0.00%) 7.4563 ( 0.52%)
> Amean 110 10.4233 ( 0.00%) 10.4727 ( -0.47%)
> Amean 141 13.4690 ( 0.00%) 13.4563 ( 0.09%)
> Amean 172 16.6450 ( 0.00%) 16.6033 ( 0.25%)
> Amean 203 19.4873 ( 0.00%) 19.7893 * -1.55%*
> Amean 234 22.5507 ( 0.00%) 22.8033 ( -1.12%)
> Amean 265 25.3380 ( 0.00%) 25.6490 ( -1.23%)
> Amean 296 28.0070 ( 0.00%) 28.1270 ( -0.43%)
>
> That's showing an 8% loss for 1 group and also a problem with 21 groups.
> Otherwise, it was more or less flat. EPYC 2 also showed a 2% loss for 1
> group and 9% loss for 21 groups (probably related to the size of the LLC
> domain as there are many LLCs per socket on EPYC*).
>
> For the *same* machine running hackbench using pipes instead of sockets
> we get
>
> EPYC hackbench process-pipes
> Amean 1 0.9497 ( 0.00%) 0.9517 ( -0.21%)
> Amean 4 1.2253 ( 0.00%) 1.1387 ( 7.07%)
> Amean 7 2.0677 ( 0.00%) 1.7460 * 15.56%*
> Amean 12 2.8717 ( 0.00%) 2.4797 * 13.65%*
> Amean 21 4.4053 ( 0.00%) 3.7463 * 14.96%*
> Amean 30 5.3983 ( 0.00%) 4.1097 * 23.87%*
> Amean 48 6.1050 ( 0.00%) 4.6873 * 23.22%*
> Amean 79 7.5640 ( 0.00%) 6.8493 ( 9.45%)
> Amean 110 12.2627 ( 0.00%) 9.4613 * 22.84%*
> Amean 141 16.9980 ( 0.00%) 13.8327 * 18.62%*
> Amean 172 21.5280 ( 0.00%) 17.3693 * 19.32%*
> Amean 203 25.4480 ( 0.00%) 20.9947 * 17.50%*
> Amean 234 29.6570 ( 0.00%) 24.9613 * 15.83%*
> Amean 265 33.0713 ( 0.00%) 28.1103 * 15.00%*
> Amean 296 37.4443 ( 0.00%) 31.8757 * 14.87%*
>
> So even on the *same hardware*, hackbench can show very different results
> depending on how it is run.
>
> The rest of the machines were more or less neutral for this patch. Once
> hackbench saturates the machine, the hit rate on recent_used_cpu is going
> to be low
>
> 1-socket skylake
> Amean 1 1.3183 ( 0.00%) 1.2827 * 2.71%*
> Amean 3 3.6750 ( 0.00%) 3.6610 ( 0.38%)
> Amean 5 6.1003 ( 0.00%) 6.0190 * 1.33%*
> Amean 7 8.6063 ( 0.00%) 8.6047 ( 0.02%)
> Amean 12 14.9480 ( 0.00%) 15.0327 ( -0.57%)
> Amean 18 22.3430 ( 0.00%) 22.6680 ( -1.45%)
> Amean 24 29.4970 ( 0.00%) 29.6677 ( -0.58%)
> Amean 30 36.7373 ( 0.00%) 36.3687 ( 1.00%)
> Amean 32 39.0973 ( 0.00%) 39.4417 ( -0.88%)
>
> Shows a 2.71% gain for one group, otherwise more or less neutral
>
> 2-socket CascadeLake
>
> Amean 1 0.3663 ( 0.00%) 0.3657 ( 0.18%)
> Amean 4 0.7510 ( 0.00%) 0.7793 ( -3.77%)
> Amean 7 1.2650 ( 0.00%) 1.2583 ( 0.53%)
> Amean 12 1.9510 ( 0.00%) 1.9450 ( 0.31%)
> Amean 21 2.9677 ( 0.00%) 3.0277 ( -2.02%)
> Amean 30 4.2993 ( 0.00%) 4.0237 * 6.41%*
> Amean 48 6.5373 ( 0.00%) 6.2987 * 3.65%*
> Amean 79 10.5513 ( 0.00%) 10.3280 ( 2.12%)
> Amean 110 15.8567 ( 0.00%) 13.9817 ( 11.82%)
> Amean 141 17.4243 ( 0.00%) 17.3177 ( 0.61%)
> Amean 172 21.0473 ( 0.00%) 20.9760 ( 0.34%)
> Amean 203 25.1070 ( 0.00%) 25.1150 ( -0.03%)
> Amean 234 28.6753 ( 0.00%) 28.9383 ( -0.92%)
> Amean 265 32.7970 ( 0.00%) 32.9663 ( -0.52%)
> Amean 296 36.6510 ( 0.00%) 36.6753 ( -0.07%)
>
> Neutral for 1 group, small regression for 4 groups, few gains around the
> middle, neutral when over-saturated.
>
> select_idle_sibling is a curse because it's very rare that a change to
> it is a universal win. On balance, I think it's better to avoid searching
> the domain at all where possible even if there are cases where searching
> can have a benefit such as finding an idle core instead of picking an
> idle CPU with a busy sibling via p->recent_used_cpu.
>
> --
> Mel Gorman
> SUSE Labs

2020-12-11 13:35:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 12:04, Mel Gorman <[email protected]> wrote:
> >
> > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > The regression seems to comes from your patchset but i don't know
> > > > which patch in particular yet
> > > >
> > > > hackbench -l 256000 -g 1
> > > >
> > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > with your patchset 15.368(+/- 2.74) -15.9%
> > > >
> > > > I'm also seeing perf regression on reaim but this one needs more
> > > > investigation before confirming
> > > >
> > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > which patch is the culprit
> > >
> > > The regression comes from patch 3: sched/fair: Do not replace
> > > recent_used_cpu with the new target
> > >
> >
> > That's not entirely surprising. The intent of the patch is to increase the
> > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > race to use that CPU and stack as a result instead of searching the domain.
> > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > a busy sibling which the search would have avoided in select_idle_core().
> >
> > I think you are using processes and sockets for hackbench but as you'll
> > see later, hackbench can be used both to show losses and gains.
>
> I run more hackbench tests with pipe and socket and both show
> regression with patch 3 whereas this is significant improvement with
> other patches and Aubrey's one
>

Is SMT enabled on your test machine? If not, then patch 4 should make no
difference but if SMT is enabled, I wonder how this untested version of
patch 3 behaves for you. The main difference is that the recent used cpu
is used as a search target so that it would still check if it's an idle
core and if not, fall through so it's used as an idle CPU after checking
it's allowed by p->cpus_ptr.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..63980bcf6e70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return prev;
}

- /* Check a recently used CPU as a potential idle candidate: */
+ /* Check a recently used CPU as a search target: */
recent_used_cpu = p->recent_used_cpu;
+ p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
- (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
- cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
- asym_fits_capacity(task_util, recent_used_cpu)) {
- /*
- * Replace recent_used_cpu with prev as it is a potential
- * candidate for the next wake:
- */
- p->recent_used_cpu = prev;
- return recent_used_cpu;
- }
+ (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)))
+ target = recent_used_cpu;

/*
* For asymmetric CPU capacity systems, our domain of interest is
@@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
- if (want_affine)
- current->recent_used_cpu = cpu;
}
rcu_read_unlock();


--
Mel Gorman
SUSE Labs

2020-12-12 10:47:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote:
> On Tue, 8 Dec 2020 15:35:00 +0000 Mel Gorman wrote:
> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >
> > /* Check a recently used CPU as a potential idle candidate: */
> > recent_used_cpu = p->recent_used_cpu;
> > + p->recent_used_cpu = prev;
> > if (recent_used_cpu != prev &&
> > recent_used_cpu != target &&
> > cpus_share_cache(recent_used_cpu, target) &&
> > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>
> Typo? Fix it in spin if so.
>

What typo?

> > asym_fits_capacity(task_util, recent_used_cpu)) {
> > - /*
> > - * Replace recent_used_cpu with prev as it is a potential
> > - * candidate for the next wake:
> > - */
> > - p->recent_used_cpu = prev;
> > return recent_used_cpu;
>
> I prefer to update the recent CPU after llc check.
>

That would prevent recent_used_cpu leaving the LLC the task first
started on.

--
Mel Gorman
SUSE Labs

2020-12-12 10:49:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

On Fri, Dec 11, 2020 at 05:34:43PM +0800, Hillf Danton wrote:
> On Fri, 11 Dec 2020 09:02:28 +0000 Mel Gorman wrote:
> >On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote:
> >> On Tue, 8 Dec 2020 15:35:00 +0000 Mel Gorman wrote:
> >> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> >
> >> > /* Check a recently used CPU as a potential idle candidate: */
> >> > recent_used_cpu = p->recent_used_cpu;
> >> > + p->recent_used_cpu = prev;
> >> > if (recent_used_cpu != prev &&
> >> > recent_used_cpu != target &&
> >> > cpus_share_cache(recent_used_cpu, target) &&
> >> > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >> > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >>
> >> Typo? Fix it in spin if so.
> >>
> >
> >What typo?
>
> After your change it is prev that we check against p->cpus_ptr instead of
> the recent CPU. Wonder the point to do such a check for returning the
> recent one.

Ah... yes, this is indeed wrong. It wouldn't affect Vincent's case
that showed a problem with a hackbench configuration (which I'm still
disappointed about as it's a trade-off depending on machine and workload)
but it allows a task to run on the wrong cpu if sched_setscheduler()
was called between wakeup events.

--
Mel Gorman
SUSE Labs

2020-12-13 23:29:46

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

On Fri, 11 Dec 2020 at 11:23, Mel Gorman <[email protected]> wrote:
>
> On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> > On Thu, 10 Dec 2020 at 12:04, Mel Gorman <[email protected]> wrote:
> > >
> > > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > > The regression seems to comes from your patchset but i don't know
> > > > > which patch in particular yet
> > > > >
> > > > > hackbench -l 256000 -g 1
> > > > >
> > > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > > with your patchset 15.368(+/- 2.74) -15.9%
> > > > >
> > > > > I'm also seeing perf regression on reaim but this one needs more
> > > > > investigation before confirming
> > > > >
> > > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > > which patch is the culprit
> > > >
> > > > The regression comes from patch 3: sched/fair: Do not replace
> > > > recent_used_cpu with the new target
> > > >
> > >
> > > That's not entirely surprising. The intent of the patch is to increase the
> > > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > > race to use that CPU and stack as a result instead of searching the domain.
> > > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > > a busy sibling which the search would have avoided in select_idle_core().
> > >
> > > I think you are using processes and sockets for hackbench but as you'll
> > > see later, hackbench can be used both to show losses and gains.
> >
> > I run more hackbench tests with pipe and socket and both show
> > regression with patch 3 whereas this is significant improvement with
> > other patches and Aubrey's one
> >
>
> Is SMT enabled on your test machine? If not, then patch 4 should make no

yes I have SMT on my system : 2 nodes x 28 cores x 4 SMT



> difference but if SMT is enabled, I wonder how this untested version of
> patch 3 behaves for you. The main difference is that the recent used cpu
> is used as a search target so that it would still check if it's an idle
> core and if not, fall through so it's used as an idle CPU after checking
> it's allowed by p->cpus_ptr.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c41875aec23..63980bcf6e70 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> return prev;
> }
>
> - /* Check a recently used CPU as a potential idle candidate: */
> + /* Check a recently used CPU as a search target: */
> recent_used_cpu = p->recent_used_cpu;
> + p->recent_used_cpu = prev;
> if (recent_used_cpu != prev &&
> recent_used_cpu != target &&
> cpus_share_cache(recent_used_cpu, target) &&
> - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> - cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> - asym_fits_capacity(task_util, recent_used_cpu)) {
> - /*
> - * Replace recent_used_cpu with prev as it is a potential
> - * candidate for the next wake:
> - */
> - p->recent_used_cpu = prev;
> - return recent_used_cpu;
> - }
> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)))
> + target = recent_used_cpu;
>
> /*
> * For asymmetric CPU capacity systems, our domain of interest is
> @@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> - if (want_affine)
> - current->recent_used_cpu = cpu;
> }
> rcu_read_unlock();
>
>
> --
> Mel Gorman
> SUSE Labs