LinuxLists.cc - [PATCH v3 0/5] Scan for an idle sibling in a single pass

2021-01-19 12:23:53

Subject: [PATCH v3 0/5] Scan for an idle sibling in a single pass

Changelog since v2
o Remove unnecessary parameters
o Update nr during scan only when scanning for cpus

Changlog since v1
o Move extern declaration to header for coding style
o Remove unnecessary parameter from __select_idle_cpu

This series of 5 patches reposts three patches from Peter entitled
"select_idle_sibling() wreckage". It only scans the runqueues in a single
pass when searching for an idle sibling.

Two patches from Peter were dropped. The first patch altered how scan
depth was calculated. Scan depth deletion is a random number generator
with two major limitations. The avg_idle time is based on the time
between a CPU going idle and being woken up clamped approximately by
2*sysctl_sched_migration_cost. This is difficult to compare in a sensible
fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
of scan failures is recorded and it does not decay. This requires deeper
surgery that would justify a patch on its own although Peter notes that
https://lkml.kernel.org/r/[email protected] is
potentially useful for an alternative avg_idle metric.

The second patch dropped converted the idle core scan throttling
mechanism to SIS_PROP. While this would unify the throttling of core
and CPU scanning, it was not free of regressions and has_idle_cores is
a fairly effective throttling mechanism with the caveat that it can have
a lot of false positives for workloads like hackbench.

Peter's series tried to solve three problems at once, this subset addresses
one problem. As with anything select_idle_sibling, it's a mix of wins and
losses but won more than it lost across a range of workloads and machines.

kernel/sched/core.c | 18 +++--
kernel/sched/fair.c | 161 ++++++++++++++++++++--------------------
kernel/sched/features.h | 1 -
kernel/sched/sched.h | 2 +
4 files changed, 95 insertions(+), 87 deletions(-)

--
2.26.2

2021-01-19 12:24:23

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 4/5] sched/fair: Remove select_idle_smt()

From: Peter Zijlstra (Intel) <[email protected]>

In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 30 ------------------------------
1 file changed, 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0811e2fe4f19..12e08da90024 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
return -1;
}

-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
- int cpu;
-
- if (!static_branch_likely(&sched_smt_present))
- return -1;
-
- for_each_cpu(cpu, cpu_smt_mask(target)) {
- if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
- !cpumask_test_cpu(cpu, sched_domain_span(sd)))
- continue;
- if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
- return cpu;
- }
-
- return -1;
-}
-
#else /* CONFIG_SCHED_SMT */

#define sched_smt_weight 1
@@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s
return -1;
}

-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
- return -1;
-}
-
#endif /* CONFIG_SCHED_SMT */

#define sis_min_cores 2
@@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

- i = select_idle_smt(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
return target;
}

--
2.26.2

2021-01-19 12:24:24

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

From: Peter Zijlstra (Intel) <[email protected]>

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/core.c | 18 +++++++++++++-----
kernel/sched/fair.c | 12 ++++++++++--
kernel/sched/sched.h | 2 ++
3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15d2562118d1..ada8faac2e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,11 +7444,19 @@ int sched_cpu_activate(unsigned int cpu)
balance_push_set(cpu, false);

#ifdef CONFIG_SCHED_SMT
- /*
- * When going up, increment the number of cores with SMT present.
- */
- if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
- static_branch_inc_cpuslocked(&sched_smt_present);
+ do {
+ int weight = cpumask_weight(cpu_smt_mask(cpu));
+
+ if (weight > sched_smt_weight)
+ sched_smt_weight = weight;
+
+ /*
+ * When going up, increment the number of cores with SMT present.
+ */
+ if (weight == 2)
+ static_branch_inc_cpuslocked(&sched_smt_present);
+
+ } while (0);
#endif
set_cpu_active(cpu, true);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8d8e185cf3b..0811e2fe4f19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);

+int sched_smt_weight __read_mostly = 1;
+
static inline void set_idle_cores(int cpu, int val)
{
struct sched_domain_shared *sds;
@@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t

#else /* CONFIG_SCHED_SMT */

+#define sched_smt_weight 1
+
static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
return -1;
@@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd

#endif /* CONFIG_SCHED_SMT */

+#define sis_min_cores 2
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
avg_cost = this_sd->avg_scan_cost + 1;

span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
+ if (span_avg > sis_min_cores*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
- nr = 4;
+ nr = sis_min_cores;
+
+ nr *= sched_smt_weight;

time = cpu_clock(this);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12ada79d40f3..29aabe98dd1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq)
__update_idle_core(rq);
}

+extern int sched_smt_weight;
+
#else
static inline void update_idle_core(struct rq *rq) { }
#endif
--
2.26.2

2021-01-19 12:29:12

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On Tue, 19 Jan 2021 at 12:22, Mel Gorman <[email protected]> wrote:
>
> Changelog since v2
> o Remove unnecessary parameters
> o Update nr during scan only when scanning for cpus

Hi Mel,

I haven't looked at your previous version mainly because I'm chasing a
performance regression on v5.11-rcx which prevents me from testing the
impact of your patchset on my !SMT2 system.
Will do this as soon as this problem is fixed

>
> Changlog since v1
> o Move extern declaration to header for coding style
> o Remove unnecessary parameter from __select_idle_cpu
>
> This series of 5 patches reposts three patches from Peter entitled
> "select_idle_sibling() wreckage". It only scans the runqueues in a single
> pass when searching for an idle sibling.
>
> Two patches from Peter were dropped. The first patch altered how scan
> depth was calculated. Scan depth deletion is a random number generator
> with two major limitations. The avg_idle time is based on the time
> between a CPU going idle and being woken up clamped approximately by
> 2*sysctl_sched_migration_cost. This is difficult to compare in a sensible
> fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
> of scan failures is recorded and it does not decay. This requires deeper
> surgery that would justify a patch on its own although Peter notes that
> https://lkml.kernel.org/r/[email protected] is
> potentially useful for an alternative avg_idle metric.
>
> The second patch dropped converted the idle core scan throttling
> mechanism to SIS_PROP. While this would unify the throttling of core
> and CPU scanning, it was not free of regressions and has_idle_cores is
> a fairly effective throttling mechanism with the caveat that it can have
> a lot of false positives for workloads like hackbench.
>
> Peter's series tried to solve three problems at once, this subset addresses
> one problem. As with anything select_idle_sibling, it's a mix of wins and
> losses but won more than it lost across a range of workloads and machines.
>
> kernel/sched/core.c | 18 +++--
> kernel/sched/fair.c | 161 ++++++++++++++++++++--------------------
> kernel/sched/features.h | 1 -
> kernel/sched/sched.h | 2 +
> 4 files changed, 95 insertions(+), 87 deletions(-)
>
> --
> 2.26.2
>

2021-01-19 12:30:07

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f5682aeda2e..c8d8e185cf3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;

@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
- }
-
- time = cpu_clock(this);

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ time = cpu_clock(this);
+ }

for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
break;
}

- time = cpu_clock(this) - time;
- update_avg(&this_sd->avg_scan_cost, time);
+ if (sched_feat(SIS_PROP)) {
+ time = cpu_clock(this) - time;
+ update_avg(&this_sd->avg_scan_cost, time);
+ }

return cpu;
}
--
2.26.2

2021-01-19 12:30:16

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

From: Peter Zijlstra (Intel) <[email protected]>

Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 101 ++++++++++++++++++++++++++------------------
1 file changed, 61 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12e08da90024..822851b39b65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
return new_cpu;
}

+static inline int __select_idle_cpu(int cpu)
+{
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ return cpu;
+
+ return -1;
+}
+
#ifdef CONFIG_SCHED_SMT
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
* there are no idle cores left in the system; tracked through
* sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
*/
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu;
+ bool idle = true;
+ int cpu;

if (!static_branch_likely(&sched_smt_present))
- return -1;
-
- if (!test_idle_cores(target, false))
- return -1;
-
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ return __select_idle_cpu(core);

- for_each_cpu_wrap(core, cpus, target) {
- bool idle = true;
-
- for_each_cpu(cpu, cpu_smt_mask(core)) {
- if (!available_idle_cpu(cpu)) {
- idle = false;
- break;
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ if (!available_idle_cpu(cpu)) {
+ idle = false;
+ if (*idle_cpu == -1) {
+ if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ *idle_cpu = cpu;
+ break;
+ }
+ continue;
}
+ break;
}
-
- if (idle)
- return core;
-
- cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+ if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+ *idle_cpu = cpu;
}

- /*
- * Failed to find an idle core; stop looking for one.
- */
- set_idle_cores(target, 0);
+ if (idle)
+ return core;

+ cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
return -1;
}

@@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int

#define sched_smt_weight 1

-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static inline void set_idle_cores(int cpu, int val)
{
- return -1;
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+ return def;
+}
+
+static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
+{
+ return __select_idle_cpu(core);
}

#endif /* CONFIG_SCHED_SMT */
@@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ bool smt = test_idle_cores(target, false);
+ int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
- int this = smp_processor_id();
- int cpu, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

- if (sched_feat(SIS_PROP)) {
+ if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;

/*
@@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}

for_each_cpu_wrap(cpu, cpus, target) {
- if (!--nr)
- return -1;
- if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
- break;
+ if (smt) {
+ i = select_idle_core(p, cpu, cpus, &idle_cpu);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+
+ } else {
+ if (!--nr)
+ return -1;
+ i = __select_idle_cpu(cpu);
+ if ((unsigned int)i < nr_cpumask_bits) {
+ idle_cpu = i;
+ break;
+ }
+ }
}

- if (sched_feat(SIS_PROP)) {
+ if (smt)
+ set_idle_cores(this, false);
+
+ if (sched_feat(SIS_PROP) && !smt) {
time = cpu_clock(this) - time;
update_avg(&this_sd->avg_scan_cost, time);
}

- return cpu;
+ return idle_cpu;
}

/*
@@ -6297,10 +6322,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;

- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
--
2.26.2

2021-01-19 12:30:48

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 1/5] sched/fair: Remove SIS_AVG_CPU

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++-----------
kernel/sched/features.h | 1 -
2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..9f5682aeda2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
- u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- avg_idle = this_rq()->avg_idle / 512;
- avg_cost = this_sd->avg_scan_cost + 1;
+ if (sched_feat(SIS_PROP)) {
+ u64 avg_cost, avg_idle, span_avg;

- if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
- return -1;
+ /*
+ * Due to large variance we need a large fuzz factor;
+ * hackbench in particularly is sensitive here.
+ */
+ avg_idle = this_rq()->avg_idle / 512;
+ avg_cost = this_sd->avg_scan_cost + 1;

- if (sched_feat(SIS_PROP)) {
- u64 span_avg = sd->span_weight * avg_idle;
+ span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
-SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)

/*
--
2.26.2

2021-01-19 12:38:25

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
> On Tue, 19 Jan 2021 at 12:22, Mel Gorman <[email protected]> wrote:
> >
> > Changelog since v2
> > o Remove unnecessary parameters
> > o Update nr during scan only when scanning for cpus
>
> Hi Mel,
>
> I haven't looked at your previous version mainly because I'm chasing a
> performance regression on v5.11-rcx which prevents me from testing the
> impact of your patchset on my !SMT2 system.
> Will do this as soon as this problem is fixed
>

Thanks, that would be appreciated as I do not have access to a !SMT2
system to do my own evaluation.

--
Mel Gorman
SUSE Labs

2021-01-20 08:35:45

by Gautham R Shenoy

[permalink] [raw]

Subject: Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

Hello Mel, Peter,

On Tue, Jan 19, 2021 at 11:22:11AM +0000, Mel Gorman wrote:
> From: Peter Zijlstra (Intel) <[email protected]>
>
> Both select_idle_core() and select_idle_cpu() do a loop over the same
> cpumask. Observe that by clearing the already visited CPUs, we can
> fold the iteration and iterate a core at a time.
>
> All we need to do is remember any non-idle CPU we encountered while
> scanning for an idle core. This way we'll only iterate every CPU once.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/fair.c | 101 ++++++++++++++++++++++++++------------------
> 1 file changed, 61 insertions(+), 40 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12e08da90024..822851b39b65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[..snip..]

> @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> - if (!--nr)
> - return -1;
> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> - break;
> + if (smt) {
> + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> +
> + } else {
> + if (!--nr)
> + return -1;
> + i = __select_idle_cpu(cpu);
> + if ((unsigned int)i < nr_cpumask_bits) {
> + idle_cpu = i;
> + break;
> + }
> + }
> }
>
> - if (sched_feat(SIS_PROP)) {
> + if (smt)
> + set_idle_cores(this, false);

Shouldn't we set_idle_cores(false) only if this was the last idle
core in the LLC ?

--
Thanks and Regards
gautham.

2021-01-20 10:34:07

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

On Wed, 20 Jan 2021 at 10:12, Mel Gorman <[email protected]> wrote:
>
> On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > }
> > >
> > > for_each_cpu_wrap(cpu, cpus, target) {
> > > - if (!--nr)
> > > - return -1;
> > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > - break;
> > > + if (smt) {
> > > + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > + if ((unsigned int)i < nr_cpumask_bits)
> > > + return i;
> > > +
> > > + } else {
> > > + if (!--nr)
> > > + return -1;
> > > + i = __select_idle_cpu(cpu);
> > > + if ((unsigned int)i < nr_cpumask_bits) {
> > > + idle_cpu = i;
> > > + break;
> > > + }
> > > + }
> > > }
> > >
> > > - if (sched_feat(SIS_PROP)) {
> > > + if (smt)
> > > + set_idle_cores(this, false);
> >
> > Shouldn't we set_idle_cores(false) only if this was the last idle
> > core in the LLC ?
> >
>
> That would involve rechecking the cpumask bits that have not been
> scanned to see if any of them are an idle core. As the existance of idle
> cores can change very rapidly, it's not worth the cost.

But don't we reach this point only if we scanned all CPUs and didn't
find an idle core ?
If there is an idle core, it returns early
And in case of smt == true, nr is set to INt_MAX and we loop on all CPUs/cores

>
> --
> Mel Gorman
> SUSE Labs

2021-01-20 10:35:39

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > }
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > - if (!--nr)
> > - return -1;
> > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > - break;
> > + if (smt) {
> > + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > + if ((unsigned int)i < nr_cpumask_bits)
> > + return i;
> > +
> > + } else {
> > + if (!--nr)
> > + return -1;
> > + i = __select_idle_cpu(cpu);
> > + if ((unsigned int)i < nr_cpumask_bits) {
> > + idle_cpu = i;
> > + break;
> > + }
> > + }
> > }
> >
> > - if (sched_feat(SIS_PROP)) {
> > + if (smt)
> > + set_idle_cores(this, false);
>
> Shouldn't we set_idle_cores(false) only if this was the last idle
> core in the LLC ?
>

That would involve rechecking the cpumask bits that have not been
scanned to see if any of them are an idle core. As the existance of idle
cores can change very rapidly, it's not worth the cost.

--
Mel Gorman
SUSE Labs

2021-01-20 10:35:41

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

On Wed, 20 Jan 2021 at 10:54, Mel Gorman <[email protected]> wrote:
>
> On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote:
> > On Wed, 20 Jan 2021 at 10:12, Mel Gorman <[email protected]> wrote:
> > >
> > > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > > > }
> > > > >
> > > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > > - if (!--nr)
> > > > > - return -1;
> > > > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > > > - break;
> > > > > + if (smt) {
> > > > > + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > > > + if ((unsigned int)i < nr_cpumask_bits)
> > > > > + return i;
> > > > > +
> > > > > + } else {
> > > > > + if (!--nr)
> > > > > + return -1;
> > > > > + i = __select_idle_cpu(cpu);
> > > > > + if ((unsigned int)i < nr_cpumask_bits) {
> > > > > + idle_cpu = i;
> > > > > + break;
> > > > > + }
> > > > > + }
> > > > > }
> > > > >
> > > > > - if (sched_feat(SIS_PROP)) {
> > > > > + if (smt)
> > > > > + set_idle_cores(this, false);
> > > >
> > > > Shouldn't we set_idle_cores(false) only if this was the last idle
> > > > core in the LLC ?
> > > >
> > >
> > > That would involve rechecking the cpumask bits that have not been
> > > scanned to see if any of them are an idle core. As the existance of idle
> > > cores can change very rapidly, it's not worth the cost.
> >
> > But don't we reach this point only if we scanned all CPUs and didn't
> > find an idle core ?
>
> Yes, but my understanding of Gauthams suggestion was to check if an
> idle core found was the last idle core available and set has_idle_cores

ok get it now

> to false in that case. I think this would be relatively expensive and
> possibly futile as returning the last idle core for this wakeup does not
> mean there will be no idle core on the next wakeup as other cores may go
> idle between wakeups.

I agree, this doesn't worth the added complexity

>
> --
> Mel Gorman
> SUSE Labs

2021-01-20 10:47:27

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote:
> On Wed, 20 Jan 2021 at 10:12, Mel Gorman <[email protected]> wrote:
> >
> > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > > }
> > > >
> > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > - if (!--nr)
> > > > - return -1;
> > > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > > - break;
> > > > + if (smt) {
> > > > + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > > + if ((unsigned int)i < nr_cpumask_bits)
> > > > + return i;
> > > > +
> > > > + } else {
> > > > + if (!--nr)
> > > > + return -1;
> > > > + i = __select_idle_cpu(cpu);
> > > > + if ((unsigned int)i < nr_cpumask_bits) {
> > > > + idle_cpu = i;
> > > > + break;
> > > > + }
> > > > + }
> > > > }
> > > >
> > > > - if (sched_feat(SIS_PROP)) {
> > > > + if (smt)
> > > > + set_idle_cores(this, false);
> > >
> > > Shouldn't we set_idle_cores(false) only if this was the last idle
> > > core in the LLC ?
> > >
> >
> > That would involve rechecking the cpumask bits that have not been
> > scanned to see if any of them are an idle core. As the existance of idle
> > cores can change very rapidly, it's not worth the cost.
>
> But don't we reach this point only if we scanned all CPUs and didn't
> find an idle core ?

Yes, but my understanding of Gauthams suggestion was to check if an
idle core found was the last idle core available and set has_idle_cores
to false in that case. I think this would be relatively expensive and
possibly futile as returning the last idle core for this wakeup does not
mean there will be no idle core on the next wakeup as other cores may go
idle between wakeups.

--
Mel Gorman
SUSE Labs

2021-01-20 20:33:38

by Gautham R Shenoy

[permalink] [raw]

Subject: Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

On Wed, Jan 20, 2021 at 09:54:20AM +0000, Mel Gorman wrote:
> On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote:
> > On Wed, 20 Jan 2021 at 10:12, Mel Gorman <[email protected]> wrote:
> > >
> > > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > > > }
> > > > >
> > > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > > - if (!--nr)
> > > > > - return -1;
> > > > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > > > - break;
> > > > > + if (smt) {
> > > > > + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > > > + if ((unsigned int)i < nr_cpumask_bits)
> > > > > + return i;
> > > > > +
> > > > > + } else {
> > > > > + if (!--nr)
> > > > > + return -1;
> > > > > + i = __select_idle_cpu(cpu);
> > > > > + if ((unsigned int)i < nr_cpumask_bits) {
> > > > > + idle_cpu = i;
> > > > > + break;
> > > > > + }
> > > > > + }
> > > > > }
> > > > >
> > > > > - if (sched_feat(SIS_PROP)) {
> > > > > + if (smt)
> > > > > + set_idle_cores(this, false);
> > > >
> > > > Shouldn't we set_idle_cores(false) only if this was the last idle
> > > > core in the LLC ?
> > > >
> > >
> > > That would involve rechecking the cpumask bits that have not been
> > > scanned to see if any of them are an idle core. As the existance of idle
> > > cores can change very rapidly, it's not worth the cost.
> >
> > But don't we reach this point only if we scanned all CPUs and didn't
> > find an idle core ?

Indeed, I missed that part that we return as soon as we find an idle
core in the for_each_cpu_wrap() loop above. So here we clear the
"has_idle_cores" when there are no longer any idle-cores. Sorry for
the noise!

>
> Yes, but my understanding of Gauthams suggestion was to check if an
> idle core found was the last idle core available and set has_idle_cores
> to false in that case.

That would have been nice, but since we do not keep a count of idle
cores, it is probably not worth the effort as you note below.

> I think this would be relatively expensive and
> possibly futile as returning the last idle core for this wakeup does not
> mean there will be no idle core on the next wakeup as other cores may go
> idle between wakeups.

>
> --
> Mel Gorman
> SUSE Labs

--
Thanks and Regards
gautham.

2021-01-22 10:27:56

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On Fri, Jan 22, 2021 at 10:30:52AM +0100, Vincent Guittot wrote:
> Hi Mel,
>
> On Tue, 19 Jan 2021 at 13:02, Mel Gorman <[email protected]> wrote:
> >
> > On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
> > > On Tue, 19 Jan 2021 at 12:22, Mel Gorman <[email protected]> wrote:
> > > >
> > > > Changelog since v2
> > > > o Remove unnecessary parameters
> > > > o Update nr during scan only when scanning for cpus
> > >
> > > Hi Mel,
> > >
> > > I haven't looked at your previous version mainly because I'm chasing a
> > > performance regression on v5.11-rcx which prevents me from testing the
> > > impact of your patchset on my !SMT2 system.
> > > Will do this as soon as this problem is fixed
> > >
> >
> > Thanks, that would be appreciated as I do not have access to a !SMT2
> > system to do my own evaluation.
>
> I have been able to run tests with your patchset on both large arm64
> SMT4 system and small arm64 !SMT system and patch 3 is still a source
> of regression on both. Decreasing min number of loops to 2 instead of
> 4 and scaling it with smt weight doesn't seem to be a good option as
> regressions disappear when I remove them as I tested with the patch
> below
>
> hackbench -l 2560 -g 1 on 8 cores arm64
> v5.11-rc4 : 1.355 (+/- 7.96)
> + sis improvement : 1.923 (+/- 25%)
> + the patch below : 1.332 (+/- 4.95)
>
> hackbench -l 2560 -g 256 on 8 cores arm64
> v5.11-rc4 : 2.116 (+/- 4.62%)
> + sis improvement : 2.216 (+/- 3.84%)
> + the patch below : 2.113 (+/- 3.01%)
>
> So starting with a min of 2 loops instead of 4 currently and scaling
> nr loop with smt weight doesn't seem to be a good option and we should
> remove it for now
>

Ok

Note that this is essentially reverting the patch. As you remove "nr *=
sched_smt_weight", the scan is no longer proportional to cores, it's
proportial to logical CPUs and the rest of the patch and changelog becomes
meaningless. On that basis, I'll queue tests over the weekend that remove
this patch entirely and keep the CPU scan as a single pass.

--
Mel Gorman
SUSE Labs

2021-01-22 10:36:40

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

Hi Mel,

On Tue, 19 Jan 2021 at 13:02, Mel Gorman <[email protected]> wrote:
>
> On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
> > On Tue, 19 Jan 2021 at 12:22, Mel Gorman <[email protected]> wrote:
> > >
> > > Changelog since v2
> > > o Remove unnecessary parameters
> > > o Update nr during scan only when scanning for cpus
> >
> > Hi Mel,
> >
> > I haven't looked at your previous version mainly because I'm chasing a
> > performance regression on v5.11-rcx which prevents me from testing the
> > impact of your patchset on my !SMT2 system.
> > Will do this as soon as this problem is fixed
> >
>
> Thanks, that would be appreciated as I do not have access to a !SMT2
> system to do my own evaluation.

I have been able to run tests with your patchset on both large arm64
SMT4 system and small arm64 !SMT system and patch 3 is still a source
of regression on both. Decreasing min number of loops to 2 instead of
4 and scaling it with smt weight doesn't seem to be a good option as
regressions disappear when I remove them as I tested with the patch
below

hackbench -l 2560 -g 1 on 8 cores arm64
v5.11-rc4 : 1.355 (+/- 7.96)
+ sis improvement : 1.923 (+/- 25%)
+ the patch below : 1.332 (+/- 4.95)

hackbench -l 2560 -g 256 on 8 cores arm64
v5.11-rc4 : 2.116 (+/- 4.62%)
+ sis improvement : 2.216 (+/- 3.84%)
+ the patch below : 2.113 (+/- 3.01%)

So starting with a min of 2 loops instead of 4 currently and scaling
nr loop with smt weight doesn't seem to be a good option and we should
remove it for now

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 808e40d0439d..c4de33574e6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6138,7 +6138,7 @@ static inline int select_idle_core(struct
task_struct *p, int core, struct cpuma

#endif /* CONFIG_SCHED_SMT */

-#define sis_min_cores 2
+#define sis_min_cores 4

/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
@@ -6176,8 +6176,6 @@ static int select_idle_cpu(struct task_struct
*p, struct sched_domain *sd, int t
else
nr = sis_min_cores;

- nr *= sched_smt_weight;
-
time = cpu_clock(this);
}

>
> --
> Mel Gorman
> SUSE Labs

2021-01-22 13:25:32

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On Fri, 22 Jan 2021 at 11:14, Mel Gorman <[email protected]> wrote:
>
> On Fri, Jan 22, 2021 at 10:30:52AM +0100, Vincent Guittot wrote:
> > Hi Mel,
> >
> > On Tue, 19 Jan 2021 at 13:02, Mel Gorman <[email protected]> wrote:
> > >
> > > On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
> > > > On Tue, 19 Jan 2021 at 12:22, Mel Gorman <[email protected]> wrote:
> > > > >
> > > > > Changelog since v2
> > > > > o Remove unnecessary parameters
> > > > > o Update nr during scan only when scanning for cpus
> > > >
> > > > Hi Mel,
> > > >
> > > > I haven't looked at your previous version mainly because I'm chasing a
> > > > performance regression on v5.11-rcx which prevents me from testing the
> > > > impact of your patchset on my !SMT2 system.
> > > > Will do this as soon as this problem is fixed
> > > >
> > >
> > > Thanks, that would be appreciated as I do not have access to a !SMT2
> > > system to do my own evaluation.
> >
> > I have been able to run tests with your patchset on both large arm64
> > SMT4 system and small arm64 !SMT system and patch 3 is still a source
> > of regression on both. Decreasing min number of loops to 2 instead of
> > 4 and scaling it with smt weight doesn't seem to be a good option as
> > regressions disappear when I remove them as I tested with the patch
> > below
> >
> > hackbench -l 2560 -g 1 on 8 cores arm64
> > v5.11-rc4 : 1.355 (+/- 7.96)
> > + sis improvement : 1.923 (+/- 25%)
> > + the patch below : 1.332 (+/- 4.95)
> >
> > hackbench -l 2560 -g 256 on 8 cores arm64
> > v5.11-rc4 : 2.116 (+/- 4.62%)
> > + sis improvement : 2.216 (+/- 3.84%)
> > + the patch below : 2.113 (+/- 3.01%)
> >
> > So starting with a min of 2 loops instead of 4 currently and scaling
> > nr loop with smt weight doesn't seem to be a good option and we should
> > remove it for now
> >
>
> Ok
>
> Note that this is essentially reverting the patch. As you remove "nr *=
> sched_smt_weight", the scan is no longer proportional to cores, it's

Yes. My goal above was to narrow the changes only to lines that
generate the regressions but i agree that removing patch 3 is the
right solution

> proportial to logical CPUs and the rest of the patch and changelog becomes
> meaningless. On that basis, I'll queue tests over the weekend that remove
> this patch entirely and keep the CPU scan as a single pass.
>
> --
> Mel Gorman
> SUSE Labs

2021-01-25 04:33:02

by Li, Aubrey

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On 2021/1/22 21:22, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 11:14, Mel Gorman <[email protected]> wrote:
>>
>> On Fri, Jan 22, 2021 at 10:30:52AM +0100, Vincent Guittot wrote:
>>> Hi Mel,
>>>
>>> On Tue, 19 Jan 2021 at 13:02, Mel Gorman <[email protected]> wrote:
>>>>
>>>> On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
>>>>> On Tue, 19 Jan 2021 at 12:22, Mel Gorman <[email protected]> wrote:
>>>>>>
>>>>>> Changelog since v2
>>>>>> o Remove unnecessary parameters
>>>>>> o Update nr during scan only when scanning for cpus
>>>>>
>>>>> Hi Mel,
>>>>>
>>>>> I haven't looked at your previous version mainly because I'm chasing a
>>>>> performance regression on v5.11-rcx which prevents me from testing the
>>>>> impact of your patchset on my !SMT2 system.
>>>>> Will do this as soon as this problem is fixed
>>>>>
>>>>
>>>> Thanks, that would be appreciated as I do not have access to a !SMT2
>>>> system to do my own evaluation.
>>>
>>> I have been able to run tests with your patchset on both large arm64
>>> SMT4 system and small arm64 !SMT system and patch 3 is still a source
>>> of regression on both. Decreasing min number of loops to 2 instead of
>>> 4 and scaling it with smt weight doesn't seem to be a good option as
>>> regressions disappear when I remove them as I tested with the patch
>>> below
>>>
>>> hackbench -l 2560 -g 1 on 8 cores arm64
>>> v5.11-rc4 : 1.355 (+/- 7.96)
>>> + sis improvement : 1.923 (+/- 25%)
>>> + the patch below : 1.332 (+/- 4.95)
>>>
>>> hackbench -l 2560 -g 256 on 8 cores arm64
>>> v5.11-rc4 : 2.116 (+/- 4.62%)
>>> + sis improvement : 2.216 (+/- 3.84%)
>>> + the patch below : 2.113 (+/- 3.01%)
>>>

4 benchmarks reported out during weekend, with patch 3 on a x86 4s system
with 24 cores per socket and 2 HT per core, total 192 CPUs.

It looks like mid-load has notable changes on my side:
- netperf 50% num of threads in TCP mode has 27.25% improved
- tbench 50% num of threads has 9.52% regression

Details below:

hackbench: 10 iterations, 10000 loops, 40 fds per group
======================================================

- pipe process

group base %std patch %std
6 1 5.27 1.0469 8.53
12 1 1.03 1.0398 1.44
24 1 2.36 1.0275 3.34

- pipe thread

group base %std patch %std
6 1 7.48 1.0747 5.25
12 1 0.97 1.0432 1.95
24 1 7.01 1.0299 6.81

- socket process

group base %std patch %std
6 1 1.01 0.9656 1.09
12 1 0.35 0.9853 0.49
24 1 1.33 0.9877 1.20

- socket thread

group base %std patch %std
6 1 2.52 0.9346 2.75
12 1 0.86 0.9830 0.66
24 1 1.17 0.9791 1.23

netperf: 10 iterations x 100 seconds, transactions rate / sec
=============================================================

- tcp request/response performance

thread base %std patch %std
50% 1 3.98 1.2725 7.52
100% 1 2.73 0.9446 2.86
200% 1 39.36 0.9955 29.45

- udp request/response performance

thread base %std patch %std
50% 1 6.18 1.0704 11.99
100% 1 47.85 0.9637 45.83
200% 1 45.74 1.0162 36.99

tbench: 10 iterations x 100 seconds, throughput / sec
=====================================================

thread base %std patch %std
50% 1 1.38 0.9048 2.46
100% 1 1.05 0.9640 0.68
200% 1 6.76 0.9886 2.86

schbench: 10 iterations x 100 seconds, 99th percentile latency
==============================================================

mthread base %std patch %std
6 1 29.07 0.8714 25.73
12 1 15.32 1.0000 12.39
24 1 0.08 0.9996 0.01

>>> So starting with a min of 2 loops instead of 4 currently and scaling
>>> nr loop with smt weight doesn't seem to be a good option and we should
>>> remove it for now
>>>
>> Note that this is essentially reverting the patch. As you remove "nr *=
>> sched_smt_weight", the scan is no longer proportional to cores, it's
>
> Yes. My goal above was to narrow the changes only to lines that
> generate the regressions but i agree that removing patch 3 is the
> right solution>
>> proportial to logical CPUs and the rest of the patch and changelog becomes
>> meaningless. On that basis, I'll queue tests over the weekend that remove
>> this patch entirely and keep the CPU scan as a single pass.
>>
>> --
>> Mel Gorman
>> SUSE Labs

2021-01-26 04:38:17

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On Mon, Jan 25, 2021 at 07:37:55PM +0800, Li, Aubrey wrote:
> > It's interesting that patch 3 would make any difference on x64 given that
> > it's SMT2. The scan depth should have been similar. It's somewhat expected
> > that it will not be a universal win, particularly once the utilisation
> > is high enough to spill over in sched domains (25%, 50%, 75% utilisation
> > being interesting on 4-socket systems). In such cases, double scanning can
> > still show improvements for workloads that idle rapidly like tbench and
> > hackbench even though it's expensive. The extra scanning gives more time
> > for a CPU to go idle enough to be selected which can improve throughput
> > but at the cost of wake-up latency,
>
> aha, sorry for the confusion. Since you and Vincent discussed to drop
> patch3, I just mentioned I tested 5 patches with patch3, not patch3 alone.
>

Ah, that makes more sense.

> >
> > Hopefully v4 can be tested as well which is now just a single scan.
> >
>
> Sure, may I know the baseline of v4?
>

5.11-rc4.

--
Mel Gorman
SUSE Labs

2021-01-26 21:32:45

by Li, Aubrey

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On 2021/1/25 17:04, Mel Gorman wrote:
> On Mon, Jan 25, 2021 at 12:29:47PM +0800, Li, Aubrey wrote:
>>>>> hackbench -l 2560 -g 1 on 8 cores arm64
>>>>> v5.11-rc4 : 1.355 (+/- 7.96)
>>>>> + sis improvement : 1.923 (+/- 25%)
>>>>> + the patch below : 1.332 (+/- 4.95)
>>>>>
>>>>> hackbench -l 2560 -g 256 on 8 cores arm64
>>>>> v5.11-rc4 : 2.116 (+/- 4.62%)
>>>>> + sis improvement : 2.216 (+/- 3.84%)
>>>>> + the patch below : 2.113 (+/- 3.01%)
>>>>>
>>
>> 4 benchmarks reported out during weekend, with patch 3 on a x86 4s system
>> with 24 cores per socket and 2 HT per core, total 192 CPUs.
>>
>> It looks like mid-load has notable changes on my side:
>> - netperf 50% num of threads in TCP mode has 27.25% improved
>> - tbench 50% num of threads has 9.52% regression
>>
>
> It's interesting that patch 3 would make any difference on x64 given that
> it's SMT2. The scan depth should have been similar. It's somewhat expected
> that it will not be a universal win, particularly once the utilisation
> is high enough to spill over in sched domains (25%, 50%, 75% utilisation
> being interesting on 4-socket systems). In such cases, double scanning can
> still show improvements for workloads that idle rapidly like tbench and
> hackbench even though it's expensive. The extra scanning gives more time
> for a CPU to go idle enough to be selected which can improve throughput
> but at the cost of wake-up latency,

aha, sorry for the confusion. Since you and Vincent discussed to drop
patch3, I just mentioned I tested 5 patches with patch3, not patch3 alone.

>
> Hopefully v4 can be tested as well which is now just a single scan.
>

Sure, may I know the baseline of v4?

Thanks,
-Aubrey

2021-01-26 22:43:14

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

On Mon, Jan 25, 2021 at 12:29:47PM +0800, Li, Aubrey wrote:
> >>> hackbench -l 2560 -g 1 on 8 cores arm64
> >>> v5.11-rc4 : 1.355 (+/- 7.96)
> >>> + sis improvement : 1.923 (+/- 25%)
> >>> + the patch below : 1.332 (+/- 4.95)
> >>>
> >>> hackbench -l 2560 -g 256 on 8 cores arm64
> >>> v5.11-rc4 : 2.116 (+/- 4.62%)
> >>> + sis improvement : 2.216 (+/- 3.84%)
> >>> + the patch below : 2.113 (+/- 3.01%)
> >>>
>
> 4 benchmarks reported out during weekend, with patch 3 on a x86 4s system
> with 24 cores per socket and 2 HT per core, total 192 CPUs.
>
> It looks like mid-load has notable changes on my side:
> - netperf 50% num of threads in TCP mode has 27.25% improved
> - tbench 50% num of threads has 9.52% regression
>

It's interesting that patch 3 would make any difference on x64 given that
it's SMT2. The scan depth should have been similar. It's somewhat expected
that it will not be a universal win, particularly once the utilisation
is high enough to spill over in sched domains (25%, 50%, 75% utilisation
being interesting on 4-socket systems). In such cases, double scanning can
still show improvements for workloads that idle rapidly like tbench and
hackbench even though it's expensive. The extra scanning gives more time
for a CPU to go idle enough to be selected which can improve throughput
but at the cost of wake-up latency,

Hopefully v4 can be tested as well which is now just a single scan.

--
Mel Gorman
SUSE Labs