2020-12-07 09:19:00

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

This is a minimal series to reduce the amount of runqueue scanning in
select_idle_sibling in the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
of scanning. It should be relatively uncontroversial

Patch 3-4 scans the runqueues in a single pass for select_idle_core()
and select_idle_cpu() so runqueues are not scanned twice. It's
a tradeoff because it benefits deep scans but introduces overhead
for shallow scans.

Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
approach to stand on its own, patches 1-2 should be fine. The main decision
with patch 4 is whether select_idle_core() should do a full scan when searching
for an idle core, whether it should be throttled in some other fashion or
whether it should be just left alone.

--
2.26.2


2020-12-07 09:19:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the last
wakeup. It still works, but is less efficient than it can be after all
the changes that went in since that reduce unnecessary migrations, load
balancer changes etc. This patch preserves recent_used_cpu for longer.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

5.10.0-rc6 5.10.0-rc6
baseline-v2 altrecent-v2
Hmean 1 508.39 ( 0.00%) 502.05 * -1.25%*
Hmean 2 986.70 ( 0.00%) 983.65 * -0.31%*
Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%*
Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%*
Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%*
Hmean 32 10142.57 ( 0.00%) 10253.73 * 1.10%*
Hmean 64 14348.40 ( 0.00%) 12506.31 * -12.84%*
Hmean 128 21842.59 ( 0.00%) 21967.13 * 0.57%*
Hmean 256 20813.75 ( 0.00%) 21534.52 * 3.46%*
Hmean 320 20684.33 ( 0.00%) 21070.14 * 1.87%*

The different was marginal except for 64 threads which showed in the
baseline that the result was very unstable where as the patch was much
more stable. This is somewhat machine specific as on a separate 80-cpu
Broadwell machine the same test reported.

5.10.0-rc6 5.10.0-rc6
baseline-v2 altrecent-v2
Hmean 1 310.36 ( 0.00%) 291.81 * -5.98%*
Hmean 2 340.86 ( 0.00%) 547.22 * 60.54%*
Hmean 4 912.29 ( 0.00%) 1063.21 * 16.54%*
Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%*
Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%*
Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%*
Hmean 64 11733.91 ( 0.00%) 11473.66 * -2.22%*
Hmean 128 17727.24 ( 0.00%) 16784.23 * -5.32%*
Hmean 256 16089.23 ( 0.00%) 16110.79 * 0.13%*
Hmean 320 15992.60 ( 0.00%) 16071.64 * 0.49%*

schedstats were not used in this series but from an earlier debugging
effort, the schedstats after the test run were as follows;

Ops SIS Search 5653107942.00 5726545742.00
Ops SIS Domain Search 3365067916.00 3319768543.00
Ops SIS Scanned 112173512543.00 99194352541.00
Ops SIS Domain Scanned 109885472517.00 96787575342.00
Ops SIS Failures 2923185114.00 2950166441.00
Ops SIS Recent Used Hit 56547.00 118064916.00
Ops SIS Recent Used Miss 1590899250.00 354942791.00
Ops SIS Recent Attempts 1590955797.00 473007707.00
Ops SIS Search Efficiency 5.04 5.77
Ops SIS Domain Search Eff 3.06 3.43
Ops SIS Fast Success Rate 40.47 42.03
Ops SIS Success Rate 48.29 48.48
Ops SIS Recent Success Rate 0.00 24.96

First interesting point is the ridiculous number of times runqueues are
enabled -- almost 97 billion times over the course of 40 minutes

With the patch, "Recent Used Hit" is over 2000 times more likely to
succeed. The failure rate also increases by quite a lot but the cost is
marginal even if the "Fast Success Rate" only increases by 2% overall. What
cannot be observed from these stats is where the biggest impact as these
stats cover low utilisation to over saturation.

If graphed over time, the graphs show that the sched domain is only
scanned at negligible rates until the machine is fully busy. With
low utilisation, the "Fast Success Rate" is almost 100% until the
machine is fully busy. For 320 clients, the success rate is close to
0% which is unsurprising.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23934dbac635..01b38fc17bca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)

/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
+ p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
@@ -6765,9 +6766,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
- if (want_affine)
- current->recent_used_cpu = cpu;
}
rcu_read_unlock();

--
2.26.2

2020-12-07 09:19:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It
was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 3 ---
kernel/sched/features.h | 1 -
2 files changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98075f9ea9a8..23934dbac635 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;

- if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
- return -1;
-
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
-SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)

/*
--
2.26.2

2020-12-07 09:19:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

select_idle_core is called when SMT is active and there is likely a free
core available. It may find idle CPUs but this information is simply
discarded and the scan starts over again with select_idle_cpu.

This patch caches information on idle CPUs found during the search for
a core and uses one if no core is found. This is a tradeoff. There may
be a slight impact when utilisation is low and an idle core can be
found quickly. It provides improvements as the number of busy CPUs
approaches 50% of the domain size when SMT is enabled.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

5.10.0-rc6 5.10.0-rc6
schedstat idlecandidate
Hmean 1 500.06 ( 0.00%) 505.67 * 1.12%*
Hmean 2 975.90 ( 0.00%) 974.06 * -0.19%*
Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%*
Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%*
Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%*
Hmean 32 10435.31 ( 0.00%) 10312.58 * -1.18%*
Hmean 64 12325.51 ( 0.00%) 13792.01 * 11.90%*
Hmean 128 21225.21 ( 0.00%) 20963.44 * -1.23%*
Hmean 256 20532.83 ( 0.00%) 20335.62 * -0.96%*
Hmean 320 20334.81 ( 0.00%) 20147.25 * -0.92%*

Note that there is a significant corner case. As the SMT scan may be
terminated early, not all CPUs have been visited and select_idle_cpu()
is still called for a full scan. This case is handled in the next
patch.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01b38fc17bca..00c3b526a5bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
*/
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
+ int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;

@@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
idle = false;
break;
}
+
+ if (idle_candidate == -1 &&
+ cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ idle_candidate = cpu;
+ }
}

if (idle)
@@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
*/
set_idle_cores(target, 0);

- return -1;
+ return idle_candidate;
}

/*
--
2.26.2

2020-12-07 09:19:38

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/4] sched/fair: Avoid revisiting CPUs multiple times during select_idle_sibling

select_idle_core() potentially searches a number of CPUs for idle candidates
before select_idle_cpu() clears the mask and revisits the same CPUs. This
patch moves the initialisation of select_idle_mask to the top-level and
reuses the same mask across both select_idle_core and select_idle_cpu.
select_idle_smt() is left alone as the cost of checking one SMT sibling
is marginal relative to calling __clear_cpumask_cpu() for every CPU
visited by select_idle_core().

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

5.10.0-rc6 5.10.0-rc6
altrecent-v2 singlepass-v2
Hmean 1 502.05 ( 0.00%) 498.53 * -0.70%*
Hmean 2 983.65 ( 0.00%) 982.33 * -0.13%*
Hmean 4 1920.24 ( 0.00%) 1929.34 * 0.47%*
Hmean 8 3663.96 ( 0.00%) 3536.94 * -3.47%*
Hmean 16 6545.58 ( 0.00%) 6560.21 * 0.22%*
Hmean 32 10253.73 ( 0.00%) 10374.30 * 1.18%*
Hmean 64 12506.31 ( 0.00%) 11692.26 * -6.51%*
Hmean 128 21967.13 ( 0.00%) 21705.80 * -1.19%*
Hmean 256 21534.52 ( 0.00%) 21223.50 * -1.44%*
Hmean 320 21070.14 ( 0.00%) 21023.31 * -0.22%*

As before, results are somewhat workload and machine-specific with a mix
of good and bad. For example, netperf UDP_STREAM on an 80-cpu Broadwell
machine shows;

5.10.0-rc6 5.10.0-rc6
altrecent-v2 singlepass-v2
Hmean send-64 232.78 ( 0.00%) 258.02 * 10.85%*
Hmean send-128 464.69 ( 0.00%) 511.53 * 10.08%*
Hmean send-256 904.72 ( 0.00%) 999.40 * 10.46%*
Hmean send-1024 3512.08 ( 0.00%) 3868.86 * 10.16%*
Hmean send-2048 6777.61 ( 0.00%) 7435.05 * 9.70%*
Hmean send-3312 10352.45 ( 0.00%) 11321.43 * 9.36%*
Hmean send-4096 12660.58 ( 0.00%) 13577.08 * 7.24%*
Hmean send-8192 19240.36 ( 0.00%) 21321.34 * 10.82%*
Hmean send-16384 29691.27 ( 0.00%) 31296.47 * 5.41%*

The real question is -- is it better to scan CPU runqueues just once
where possible or scan multiple times? This patch scans once but it must
do additional work to track that scanning so for shallow scans it's a
loss and or deep scans, it's a win.

The main downside to this patch is that cpumask manipulations are
more complex because two cpumasks are involved. The alternative is that
select_idle_core() would scan the full domain and always return a CPU which
was previously considered. Alternatively, the depth of select_idle_core()
could be limited. Similarly, select_idle_core() often scans when no idle
core is available as test_idle_cores is very race-prone and maybe there
is a better approach to determining if select_idle_core() is likely to
succeed or not.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00c3b526a5bd..3b1736dc5bde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6064,10 +6064,11 @@ void __update_idle_core(struct rq *rq)
* there are no idle cores left in the system; tracked through
* sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
*/
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
+ int target, struct cpumask *cpus_scan)
{
int idle_candidate = -1;
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ struct cpumask *cpus;
int core, cpu;

if (!static_branch_likely(&sched_smt_present))
@@ -6076,12 +6077,21 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
if (!test_idle_cores(target, false))
return -1;

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ /*
+ * Repurpose load_balance_mask to avoid rescanning cores while
+ * cpus_scan tracks the cpu if select_idle_cpu() is necessary.
+ * In this context, it should be impossible to enter LB and
+ * clobber load_balance_mask.
+ */
+ cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+ cpumask_copy(cpus, cpus_scan);

for_each_cpu_wrap(core, cpus, target) {
bool idle = true;

for_each_cpu(cpu, cpu_smt_mask(core)) {
+ __cpumask_clear_cpu(cpu, cpus_scan);
+
if (!available_idle_cpu(cpu)) {
idle = false;
break;
@@ -6130,7 +6140,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t

#else /* CONFIG_SCHED_SMT */

-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd,
+ int target, struct cpumask *cpus)
{
return -1;
}
@@ -6147,9 +6158,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
+ int target, struct cpumask *cpus)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
u64 avg_cost, avg_idle;
u64 time;
@@ -6176,9 +6187,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}

time = cpu_clock(this);
-
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
@@ -6240,6 +6248,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
struct sched_domain *sd;
+ struct cpumask *cpus_scan;
unsigned long task_util;
int i, recent_used_cpu;

@@ -6319,11 +6328,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;

- i = select_idle_core(p, sd, target);
+ /*
+ * Init the select_idle_mask. select_idle_core() will mask
+ * out the CPUs that have already been limited to limit the
+ * search in select_idle_cpu(). Further clearing is not
+ * done as select_idle_smt checks only one CPU.
+ */
+ cpus_scan = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpumask_and(cpus_scan, sched_domain_span(sd), p->cpus_ptr);
+
+ i = select_idle_core(p, sd, target, cpus_scan);
if ((unsigned)i < nr_cpumask_bits)
return i;

- i = select_idle_cpu(p, sd, target);
+ i = select_idle_cpu(p, sd, target, cpus_scan);
if ((unsigned)i < nr_cpumask_bits)
return i;

--
2.26.2

2020-12-07 15:08:56

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

On Mon, 7 Dec 2020 at 10:15, Mel Gorman <[email protected]> wrote:
>
> This is a minimal series to reduce the amount of runqueue scanning in
> select_idle_sibling in the worst case.
>
> Patch 1 removes SIS_AVG_CPU because it's unused.
>
> Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
> of scanning. It should be relatively uncontroversial
>
> Patch 3-4 scans the runqueues in a single pass for select_idle_core()
> and select_idle_cpu() so runqueues are not scanned twice. It's
> a tradeoff because it benefits deep scans but introduces overhead
> for shallow scans.
>
> Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask

patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
like patch 4 which manipulates different cpumask including
load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
patchset which should fix the problem of possibly scanning twice busy
cpus in select_idle_core and select_idle_cpu



> approach to stand on its own, patches 1-2 should be fine. The main decision
> with patch 4 is whether select_idle_core() should do a full scan when searching
> for an idle core, whether it should be throttled in some other fashion or
> whether it should be just left alone.
>
> --
> 2.26.2
>

2020-12-07 15:10:46

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

On Mon, 7 Dec 2020 at 10:15, Mel Gorman <[email protected]> wrote:
>
> select_idle_core is called when SMT is active and there is likely a free
> core available. It may find idle CPUs but this information is simply
> discarded and the scan starts over again with select_idle_cpu.
>
> This patch caches information on idle CPUs found during the search for
> a core and uses one if no core is found. This is a tradeoff. There may
> be a slight impact when utilisation is low and an idle core can be
> found quickly. It provides improvements as the number of busy CPUs
> approaches 50% of the domain size when SMT is enabled.
>
> With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
>
> 5.10.0-rc6 5.10.0-rc6
> schedstat idlecandidate
> Hmean 1 500.06 ( 0.00%) 505.67 * 1.12%*
> Hmean 2 975.90 ( 0.00%) 974.06 * -0.19%*
> Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%*
> Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%*
> Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%*
> Hmean 32 10435.31 ( 0.00%) 10312.58 * -1.18%*
> Hmean 64 12325.51 ( 0.00%) 13792.01 * 11.90%*
> Hmean 128 21225.21 ( 0.00%) 20963.44 * -1.23%*
> Hmean 256 20532.83 ( 0.00%) 20335.62 * -0.96%*
> Hmean 320 20334.81 ( 0.00%) 20147.25 * -0.92%*
>
> Note that there is a significant corner case. As the SMT scan may be
> terminated early, not all CPUs have been visited and select_idle_cpu()
> is still called for a full scan. This case is handled in the next
> patch.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01b38fc17bca..00c3b526a5bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
> */
> static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
> {
> + int idle_candidate = -1;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int core, cpu;
>
> @@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> idle = false;
> break;
> }
> +
> + if (idle_candidate == -1 &&
> + cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + idle_candidate = cpu;
> + }
> }
>
> if (idle)
> @@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> */
> set_idle_cores(target, 0);
>
> - return -1;
> + return idle_candidate;
> }
>
> /*
> --
> 2.26.2
>

2020-12-07 15:11:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Mon, 7 Dec 2020 at 10:15, Mel Gorman <[email protected]> wrote:
>
> SIS_AVG_CPU was introduced as a means of avoiding a search when the
> average search cost indicated that the search would likely fail. It
> was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> select_idle_cpu() more aggressive") and later replaced with a proportional
> search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> scale select_idle_cpu()").
>
> While there are corner cases where SIS_AVG_CPU is better, it has now been
> disabled for almost three years. As the intent of SIS_PROP is to reduce
> the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> on SIS_PROP as a throttling mechanism.
>
> Signed-off-by: Mel Gorman <[email protected]>

Let see if someone complains but looks reasonable

> ---
> kernel/sched/fair.c | 3 ---
> kernel/sched/features.h | 1 -
> 2 files changed, 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98075f9ea9a8..23934dbac635 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> avg_idle = this_rq()->avg_idle / 512;
> avg_cost = this_sd->avg_scan_cost + 1;
>
> - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> - return -1;
> -
> if (sched_feat(SIS_PROP)) {
> u64 span_avg = sd->span_weight * avg_idle;
> if (span_avg > 4*avg_cost)
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 68d369cba9e4..e875eabb6600 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
> /*
> * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> */
> -SCHED_FEAT(SIS_AVG_CPU, false)
> SCHED_FEAT(SIS_PROP, true)
>
> /*
> --
> 2.26.2
>

2020-12-07 15:47:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
> On Mon, 7 Dec 2020 at 10:15, Mel Gorman <[email protected]> wrote:
> >
> > This is a minimal series to reduce the amount of runqueue scanning in
> > select_idle_sibling in the worst case.
> >
> > Patch 1 removes SIS_AVG_CPU because it's unused.
> >
> > Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
> > of scanning. It should be relatively uncontroversial
> >
> > Patch 3-4 scans the runqueues in a single pass for select_idle_core()
> > and select_idle_cpu() so runqueues are not scanned twice. It's
> > a tradeoff because it benefits deep scans but introduces overhead
> > for shallow scans.
> >
> > Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
>
> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
> like patch 4 which manipulates different cpumask including
> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
> patchset which should fix the problem of possibly scanning twice busy
> cpus in select_idle_core and select_idle_cpu
>

Seems fair, we can see where we stand after V6 of Aubrey's work. A lot
of the motivation for patch 4 would go away if we managed to avoid calling
select_idle_core() unnecessarily. As it stands, we can call it a lot from
hackbench even though the chance of getting an idle core are minimal.

Assuming I revisit it, I'll update the schedstat debug patches to include
the times select_idle_core() starts versus how many times it fails and
see can I think of a useful heuristic.

I'll wait for more review on patches 1-3 and if I hear nothing, I'll
resend just those.

Thanks Vincent.

--
Mel Gorman
SUSE Labs

2020-12-08 03:33:29

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

On 2020/12/7 23:42, Mel Gorman wrote:
> On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
>> On Mon, 7 Dec 2020 at 10:15, Mel Gorman <[email protected]> wrote:
>>>
>>> This is a minimal series to reduce the amount of runqueue scanning in
>>> select_idle_sibling in the worst case.
>>>
>>> Patch 1 removes SIS_AVG_CPU because it's unused.
>>>
>>> Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
>>> of scanning. It should be relatively uncontroversial
>>>
>>> Patch 3-4 scans the runqueues in a single pass for select_idle_core()
>>> and select_idle_cpu() so runqueues are not scanned twice. It's
>>> a tradeoff because it benefits deep scans but introduces overhead
>>> for shallow scans.
>>>
>>> Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
>>
>> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
>> like patch 4 which manipulates different cpumask including
>> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
>> patchset which should fix the problem of possibly scanning twice busy
>> cpus in select_idle_core and select_idle_cpu
>>
>
> Seems fair, we can see where we stand after V6 of Aubrey's work. A lot
> of the motivation for patch 4 would go away if we managed to avoid calling
> select_idle_core() unnecessarily. As it stands, we can call it a lot from
> hackbench even though the chance of getting an idle core are minimal.
>

Sorry for the delay, I sent v6 out just now. Comparing to v5, v6 followed Vincent's
suggestion to decouple idle cpumask update from stop_tick signal, that is, the
CPU is set in idle cpumask every time the CPU enters idle, this should address
Peter's concern about the facebook trail-latency workload, as I didn't see
any regression in schbench workload 99.0000th latency report.

However, I also didn't see any significant benefit so far, probably I should
put more load on the system. I'll do more characterization of uperf workload
to see if I can find anything.

Thanks,
-Aubrey

2020-12-08 10:13:27

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On 07/12/2020 10:15, Mel Gorman wrote:
> SIS_AVG_CPU was introduced as a means of avoiding a search when the
> average search cost indicated that the search would likely fail. It
> was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> select_idle_cpu() more aggressive") and later replaced with a proportional
> search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> scale select_idle_cpu()").
>
> While there are corner cases where SIS_AVG_CPU is better, it has now been
> disabled for almost three years. As the intent of SIS_PROP is to reduce
> the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> on SIS_PROP as a throttling mechanism.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/fair.c | 3 ---
> kernel/sched/features.h | 1 -
> 2 files changed, 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98075f9ea9a8..23934dbac635 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> avg_idle = this_rq()->avg_idle / 512;
> avg_cost = this_sd->avg_scan_cost + 1;
>
> - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> - return -1;
> -
> if (sched_feat(SIS_PROP)) {
> u64 span_avg = sd->span_weight * avg_idle;
> if (span_avg > 4*avg_cost)

Nitpick:

Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
completely into the SIS_PROP if condition.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09f6f0edead4..fce9457cccb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6121,7 +6121,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
- u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6130,14 +6129,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- avg_idle = this_rq()->avg_idle / 512;
- avg_cost = this_sd->avg_scan_cost + 1;
-
if (sched_feat(SIS_PROP)) {
+ /*
+ * Due to large variance we need a large fuzz factor; hackbench in
+ * particularly is sensitive here.
+ */
+ u64 avg_idle = this_rq()->avg_idle / 512;
+ u64 avg_cost = this_sd->avg_scan_cost + 1;
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
--
2.17.1

2020-12-08 11:02:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, Dec 08, 2020 at 11:07:19AM +0100, Dietmar Eggemann wrote:
> On 07/12/2020 10:15, Mel Gorman wrote:
> > SIS_AVG_CPU was introduced as a means of avoiding a search when the
> > average search cost indicated that the search would likely fail. It
> > was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> > select_idle_cpu() more aggressive") and later replaced with a proportional
> > search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> > scale select_idle_cpu()").
> >
> > While there are corner cases where SIS_AVG_CPU is better, it has now been
> > disabled for almost three years. As the intent of SIS_PROP is to reduce
> > the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> > on SIS_PROP as a throttling mechanism.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > kernel/sched/fair.c | 3 ---
> > kernel/sched/features.h | 1 -
> > 2 files changed, 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 98075f9ea9a8..23934dbac635 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > avg_idle = this_rq()->avg_idle / 512;
> > avg_cost = this_sd->avg_scan_cost + 1;
> >
> > - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> > - return -1;
> > -
> > if (sched_feat(SIS_PROP)) {
> > u64 span_avg = sd->span_weight * avg_idle;
> > if (span_avg > 4*avg_cost)
>
> Nitpick:
>
> Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> completely into the SIS_PROP if condition.
>

Yeah, I can do that. In the initial prototype, that happened in a
separate patch that split out SIS_PROP into a helper function and I
never merged it back. It's a trivial change.

Thanks.

--
Mel Gorman
SUSE Labs

2020-12-08 11:06:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target

On Tue, Dec 08, 2020 at 10:57:29AM +0100, Dietmar Eggemann wrote:
> On 07/12/2020 10:15, Mel Gorman wrote:
> > After select_idle_sibling, p->recent_used_cpu is set to the
> > new target. However on the next wakeup, prev will be the same as
>
> I'm confused here. Isn't current->recent_used_cpu set to 'cpu =
> smp_processor_id()' after sis()? Looking at v5.10-rc6.

If you are referring to this;

if (want_affine)
current->recent_used_cpu = cpu;

then it gets removed by the path. That replaces recent_used_cpu with the
wakers CPU which still works but the hit rate is lower.

>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 23934dbac635..01b38fc17bca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >
> > /* Check a recently used CPU as a potential idle candidate: */
> > recent_used_cpu = p->recent_used_cpu;
> > + p->recent_used_cpu = prev;
> > if (recent_used_cpu != prev &&
> > recent_used_cpu != target &&
> > cpus_share_cache(recent_used_cpu, target) &&
>
> p->recent_used_cpu is already set to prev in this if condition.
>

That can be removed as redundant, I'll fix it.

--
Mel Gorman
SUSE Labs

2020-12-08 12:25:34

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target

On 07/12/2020 10:15, Mel Gorman wrote:
> After select_idle_sibling, p->recent_used_cpu is set to the
> new target. However on the next wakeup, prev will be the same as

I'm confused here. Isn't current->recent_used_cpu set to 'cpu =
smp_processor_id()' after sis()? Looking at v5.10-rc6.

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23934dbac635..01b38fc17bca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>
> /* Check a recently used CPU as a potential idle candidate: */
> recent_used_cpu = p->recent_used_cpu;
> + p->recent_used_cpu = prev;
> if (recent_used_cpu != prev &&
> recent_used_cpu != target &&
> cpus_share_cache(recent_used_cpu, target) &&

p->recent_used_cpu is already set to prev in this if condition.

asym_fits_capacity(task_util, recent_used_cpu)) {
/*
* Replace recent_used_cpu with prev as it is a potential
* candidate for the next wake:
*/
p->recent_used_cpu = prev;
return recent_used_cpu;

> @@ -6765,9 +6766,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> - if (want_affine)
> - current->recent_used_cpu = cpu;
> }
> rcu_read_unlock();

2020-12-08 13:28:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, 8 Dec 2020 at 11:59, Mel Gorman <[email protected]> wrote:
>
> On Tue, Dec 08, 2020 at 11:07:19AM +0100, Dietmar Eggemann wrote:
> > On 07/12/2020 10:15, Mel Gorman wrote:
> > > SIS_AVG_CPU was introduced as a means of avoiding a search when the
> > > average search cost indicated that the search would likely fail. It
> > > was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> > > select_idle_cpu() more aggressive") and later replaced with a proportional
> > > search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> > > scale select_idle_cpu()").
> > >
> > > While there are corner cases where SIS_AVG_CPU is better, it has now been
> > > disabled for almost three years. As the intent of SIS_PROP is to reduce
> > > the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> > > on SIS_PROP as a throttling mechanism.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 3 ---
> > > kernel/sched/features.h | 1 -
> > > 2 files changed, 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 98075f9ea9a8..23934dbac635 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > avg_idle = this_rq()->avg_idle / 512;
> > > avg_cost = this_sd->avg_scan_cost + 1;
> > >
> > > - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> > > - return -1;
> > > -
> > > if (sched_feat(SIS_PROP)) {
> > > u64 span_avg = sd->span_weight * avg_idle;
> > > if (span_avg > 4*avg_cost)
> >
> > Nitpick:
> >
> > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> > completely into the SIS_PROP if condition.
> >
>
> Yeah, I can do that. In the initial prototype, that happened in a
> separate patch that split out SIS_PROP into a helper function and I
> never merged it back. It's a trivial change.

while doing this, should you also put the update of
this_sd->avg_scan_cost under the SIS_PROP feature ?

>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs

2020-12-08 13:41:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote:
> > > Nitpick:
> > >
> > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> > > completely into the SIS_PROP if condition.
> > >
> >
> > Yeah, I can do that. In the initial prototype, that happened in a
> > separate patch that split out SIS_PROP into a helper function and I
> > never merged it back. It's a trivial change.
>
> while doing this, should you also put the update of
> this_sd->avg_scan_cost under the SIS_PROP feature ?
>

It's outside the scope of the series but why not. This?

--8<--
sched/fair: Move avg_scan_cost calculations under SIS_PROP

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19ca0265f8aa..0fee53b1aae4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
nr = 4;
}

- time = cpu_clock(this);
-
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

+ if (sched_feat(SIS_PROP))
+ time = cpu_clock(this);
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
@@ -6187,8 +6187,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
break;
}

- time = cpu_clock(this) - time;
- update_avg(&this_sd->avg_scan_cost, time);
+ if (sched_feat(SIS_PROP)) {
+ time = cpu_clock(this) - time;
+ update_avg(&this_sd->avg_scan_cost, time);
+ }

return cpu;
}

2020-12-08 13:58:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, Dec 08, 2020 at 02:43:10PM +0100, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 14:36, Mel Gorman <[email protected]> wrote:
> >
> > On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote:
> > > > > Nitpick:
> > > > >
> > > > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> > > > > completely into the SIS_PROP if condition.
> > > > >
> > > >
> > > > Yeah, I can do that. In the initial prototype, that happened in a
> > > > separate patch that split out SIS_PROP into a helper function and I
> > > > never merged it back. It's a trivial change.
> > >
> > > while doing this, should you also put the update of
> > > this_sd->avg_scan_cost under the SIS_PROP feature ?
> > >
> >
> > It's outside the scope of the series but why not. This?
> >
> > --8<--
> > sched/fair: Move avg_scan_cost calculations under SIS_PROP
> >
> > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> > check and while we are at it, exclude the cost of initialising the CPU
> > mask from the average scan cost.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 19ca0265f8aa..0fee53b1aae4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > nr = 4;
> > }
> >
> > - time = cpu_clock(this);
>
> I would move it in the if (sched_feat(SIS_PROP)) above.
>

I considered it but made the choice to exclude the cost of cpumask_and()
from the avg_scan_cost instead. It's minor but when doing the original
prototype, I didn't think it was appropriate to count the cpumask
clearing as part of the scan cost as it's not directly related.

--
Mel Gorman
SUSE Labs

2020-12-08 14:53:09

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, 8 Dec 2020 at 14:54, Mel Gorman <[email protected]> wrote:
>
> On Tue, Dec 08, 2020 at 02:43:10PM +0100, Vincent Guittot wrote:
> > On Tue, 8 Dec 2020 at 14:36, Mel Gorman <[email protected]> wrote:
> > >
> > > On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote:
> > > > > > Nitpick:
> > > > > >
> > > > > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> > > > > > completely into the SIS_PROP if condition.
> > > > > >
> > > > >
> > > > > Yeah, I can do that. In the initial prototype, that happened in a
> > > > > separate patch that split out SIS_PROP into a helper function and I
> > > > > never merged it back. It's a trivial change.
> > > >
> > > > while doing this, should you also put the update of
> > > > this_sd->avg_scan_cost under the SIS_PROP feature ?
> > > >
> > >
> > > It's outside the scope of the series but why not. This?
> > >
> > > --8<--
> > > sched/fair: Move avg_scan_cost calculations under SIS_PROP
> > >
> > > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> > > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> > > check and while we are at it, exclude the cost of initialising the CPU
> > > mask from the average scan cost.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 19ca0265f8aa..0fee53b1aae4 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > nr = 4;
> > > }
> > >
> > > - time = cpu_clock(this);
> >
> > I would move it in the if (sched_feat(SIS_PROP)) above.
> >
>
> I considered it but made the choice to exclude the cost of cpumask_and()
> from the avg_scan_cost instead. It's minor but when doing the original

At the cost of a less readable code

> prototype, I didn't think it was appropriate to count the cpumask
> clearing as part of the scan cost as it's not directly related.

hmm... I think it is because the number of loop is directly related to
the allowed cpus


>
> --
> Mel Gorman
> SUSE Labs

2020-12-08 15:16:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, Dec 08, 2020 at 03:47:40PM +0100, Vincent Guittot wrote:
> > I considered it but made the choice to exclude the cost of cpumask_and()
> > from the avg_scan_cost instead. It's minor but when doing the original
>
> At the cost of a less readable code
>

Slightly less readable, yes.

> > prototype, I didn't think it was appropriate to count the cpumask
> > clearing as part of the scan cost as it's not directly related.
>
> hmm... I think it is because the number of loop is directly related to
> the allowed cpus
>

While that is true, the cost of initialising the map is constant and
what is most important is tracking the scan cost which is variable.
Without SIS_AVG_CPU, the cpumask init can go before SIS_PROP without any
penalty so is this version preferable?

--8<--
sched/fair: Move avg_scan_cost calculations under SIS_PROP

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7b34e7372b..5c41875aec23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;

@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
- }
-
- time = cpu_clock(this);

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ time = cpu_clock(this);
+ }

for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
break;
}

- time = cpu_clock(this) - time;
- update_avg(&this_sd->avg_scan_cost, time);
+ if (sched_feat(SIS_PROP)) {
+ time = cpu_clock(this) - time;
+ update_avg(&this_sd->avg_scan_cost, time);
+ }

return cpu;
}

2020-12-08 20:25:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, 8 Dec 2020 at 14:36, Mel Gorman <[email protected]> wrote:
>
> On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote:
> > > > Nitpick:
> > > >
> > > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> > > > completely into the SIS_PROP if condition.
> > > >
> > >
> > > Yeah, I can do that. In the initial prototype, that happened in a
> > > separate patch that split out SIS_PROP into a helper function and I
> > > never merged it back. It's a trivial change.
> >
> > while doing this, should you also put the update of
> > this_sd->avg_scan_cost under the SIS_PROP feature ?
> >
>
> It's outside the scope of the series but why not. This?
>
> --8<--
> sched/fair: Move avg_scan_cost calculations under SIS_PROP
>
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
>
> Signed-off-by: Mel Gorman <[email protected]>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 19ca0265f8aa..0fee53b1aae4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> nr = 4;
> }
>
> - time = cpu_clock(this);

I would move it in the if (sched_feat(SIS_PROP)) above.

> -
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> + if (sched_feat(SIS_PROP))
> + time = cpu_clock(this);
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> return -1;
> @@ -6187,8 +6187,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> break;
> }
>
> - time = cpu_clock(this) - time;
> - update_avg(&this_sd->avg_scan_cost, time);
> + if (sched_feat(SIS_PROP)) {
> + time = cpu_clock(this) - time;
> + update_avg(&this_sd->avg_scan_cost, time);
> + }
>
> return cpu;
> }

2020-12-08 20:25:32

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

On Tue, 8 Dec 2020 at 16:12, Mel Gorman <[email protected]> wrote:
>
> On Tue, Dec 08, 2020 at 03:47:40PM +0100, Vincent Guittot wrote:
> > > I considered it but made the choice to exclude the cost of cpumask_and()
> > > from the avg_scan_cost instead. It's minor but when doing the original
> >
> > At the cost of a less readable code
> >
>
> Slightly less readable, yes.
>
> > > prototype, I didn't think it was appropriate to count the cpumask
> > > clearing as part of the scan cost as it's not directly related.
> >
> > hmm... I think it is because the number of loop is directly related to
> > the allowed cpus
> >
>
> While that is true, the cost of initialising the map is constant and
> what is most important is tracking the scan cost which is variable.
> Without SIS_AVG_CPU, the cpumask init can go before SIS_PROP without any
> penalty so is this version preferable?

yes looks good to me

>
> --8<--
> sched/fair: Move avg_scan_cost calculations under SIS_PROP
>
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/fair.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> if (!this_sd)
> return -1;
>
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
> if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
>
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> nr = div_u64(span_avg, avg_cost);
> else
> nr = 4;
> - }
> -
> - time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + time = cpu_clock(this);
> + }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> break;
> }
>
> - time = cpu_clock(this) - time;
> - update_avg(&this_sd->avg_scan_cost, time);
> + if (sched_feat(SIS_PROP)) {
> + time = cpu_clock(this) - time;
> + update_avg(&this_sd->avg_scan_cost, time);
> + }
>
> return cpu;
> }