2022-07-12 08:26:26

by Abel Wu

[permalink] [raw]
Subject: [PATCH 0/5] sched/fair: SIS improvements and cleanups

The first patch ignores the new sched_feat SIS_UTIL when
there are idle cores available to make full use of cpus
as possible.

The other patches come from the SIS filter patchset [1],
since they are irrelevant to the filter.

[1] https://lore.kernel.org/lkml/[email protected]/

Abel Wu (5):
sched/fair: ignore SIS_UTIL when has idle core
sched/fair: default to false in test_idle_cores
sched/fair: remove redundant check in select_idle_smt
sched/fair: avoid double search on same cpu
sched/fair: remove useless check in select_idle_core

kernel/sched/fair.c | 45 ++++++++++++++++++++++-----------------------
1 file changed, 22 insertions(+), 23 deletions(-)

--
2.31.1


2022-07-12 08:27:34

by Abel Wu

[permalink] [raw]
Subject: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

When SIS_UTIL is enabled, SIS domain scan will be skipped if
the LLC is overloaded. Since the overloaded status is checked
in the load balancing at LLC level, the interval is llc_size
miliseconds. The duration might be long enough to affect the
overall system throughput if idle cores are out of reach in
SIS domain scan.

Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a78d2e3b9d49..cd758b3616bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
struct sched_domain *this_sd;
u64 time = 0;

- this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
- if (!this_sd)
- return -1;
-
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

- if (sched_feat(SIS_PROP) && !has_idle_core) {
+ if (has_idle_core)
+ goto scan;
+
+ if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
unsigned long now = jiffies;

+ this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+ if (!this_sd)
+ return -1;
+
/*
* If we're busy, the assumption that the last idle period
* predicts the future is flawed; age away the remaining
@@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return -1;
}
}
-
+scan:
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
--
2.31.1

2022-07-12 08:27:39

by Abel Wu

[permalink] [raw]
Subject: [PATCH 2/5] sched/fair: default to false in test_idle_cores

It's uncertain whether idle cores exist or not if shared sched-
domains are not ready, so returning "no idle cores" usually
makes sense.

While __update_idle_core() is an exception, it checks status
of this core and set to shared sched-domain if necessary. So
the whole logic depends on the existence of sds, and can bail
out early if !sds.

It's somehow a little tricky, and as Josh suggested that it
should be transient while the domain isn't ready. So remove
the self-defined default value to make things more clearer,
while introduce little overhead in idle path.

Signed-off-by: Abel Wu <[email protected]>
Reviewed-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd758b3616bd..c72093f282ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1592,11 +1592,11 @@ numa_type numa_classify(unsigned int imbalance_pct,

#ifdef CONFIG_SCHED_SMT
/* Forward declarations of select_idle_sibling helpers */
-static inline bool test_idle_cores(int cpu, bool def);
+static inline bool test_idle_cores(int cpu);
static inline int numa_idle_core(int idle_core, int cpu)
{
if (!static_branch_likely(&sched_smt_present) ||
- idle_core >= 0 || !test_idle_cores(cpu, false))
+ idle_core >= 0 || !test_idle_cores(cpu))
return idle_core;

/*
@@ -6260,7 +6260,7 @@ static inline void set_idle_cores(int cpu, int val)
WRITE_ONCE(sds->has_idle_cores, val);
}

-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool test_idle_cores(int cpu)
{
struct sched_domain_shared *sds;

@@ -6268,7 +6268,7 @@ static inline bool test_idle_cores(int cpu, bool def)
if (sds)
return READ_ONCE(sds->has_idle_cores);

- return def;
+ return false;
}

/*
@@ -6284,7 +6284,7 @@ void __update_idle_core(struct rq *rq)
int cpu;

rcu_read_lock();
- if (test_idle_cores(core, true))
+ if (test_idle_cores(core))
goto unlock;

for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6360,9 +6360,9 @@ static inline void set_idle_cores(int cpu, int val)
{
}

-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool test_idle_cores(int cpu)
{
- return def;
+ return false;
}

static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
@@ -6604,7 +6604,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return target;

if (sched_smt_active()) {
- has_idle_core = test_idle_cores(target, false);
+ has_idle_core = test_idle_cores(target);

if (!has_idle_core && cpus_share_cache(prev, target)) {
i = select_idle_smt(p, sd, prev);
--
2.31.1

2022-07-12 08:43:25

by Abel Wu

[permalink] [raw]
Subject: [PATCH 4/5] sched/fair: avoid double search on same cpu

The prev cpu is checked at the beginning of SIS, and it's unlikely
to be idle before the second check in select_idle_smt(). So we'd
better focus on its SMT siblings.

Signed-off-by: Abel Wu <[email protected]>
Reviewed-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d7e8555bcf9..e4cf000604fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6344,6 +6344,8 @@ static int select_idle_smt(struct task_struct *p, int target)
int cpu;

for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
+ if (cpu == target)
+ continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
}
--
2.31.1

2022-07-12 08:57:37

by Abel Wu

[permalink] [raw]
Subject: [PATCH 5/5] sched/fair: remove useless check in select_idle_core

The function only gets called when sds->has_idle_cores is true
which can be possible only when sched_smt_present is enabled.
This change also aligns select_idle_core with select_idle_smt
that the caller do the check if necessary.

Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4cf000604fc..7c47621ccd40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6310,9 +6310,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
bool idle = true;
int cpu;

- if (!static_branch_likely(&sched_smt_present))
- return __select_idle_cpu(core, p);
-
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
idle = false;
--
2.31.1

2022-07-12 09:03:26

by Abel Wu

[permalink] [raw]
Subject: [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt

If two cpus share LLC cache, then the two cores they belong to
are also in the same LLC domain.

Signed-off-by: Abel Wu <[email protected]>
Reviewed-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c72093f282ec..0d7e8555bcf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6339,14 +6339,11 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
/*
* Scan the local SMT mask for idle CPUs.
*/
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_smt(struct task_struct *p, int target)
{
int cpu;

- for_each_cpu(cpu, cpu_smt_mask(target)) {
- if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
- !cpumask_test_cpu(cpu, sched_domain_span(sd)))
- continue;
+ for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
}
@@ -6370,7 +6367,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
return __select_idle_cpu(core, p);
}

-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static inline int select_idle_smt(struct task_struct *p, int target)
{
return -1;
}
@@ -6607,7 +6604,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
has_idle_core = test_idle_cores(target);

if (!has_idle_core && cpus_share_cache(prev, target)) {
- i = select_idle_smt(p, sd, prev);
+ i = select_idle_smt(p, prev);
if ((unsigned int)i < nr_cpumask_bits)
return i;
}
--
2.31.1

2022-07-13 03:56:10

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
The idle core scan was skipped in SIS_UTIL because we saw better
improvement in some benchmarks. But yes, we could make has_idle_core
to scan anyway no matter what the system load is, if we have some
data to support it. I'll test this patch on top of latest sched/core
branch to see if this makes a difference.

thanks,
Chenyu
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> kernel/sched/fair.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a78d2e3b9d49..cd758b3616bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> struct sched_domain *this_sd;
> u64 time = 0;
>
> - this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> - if (!this_sd)
> - return -1;
> -
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> - if (sched_feat(SIS_PROP) && !has_idle_core) {
> + if (has_idle_core)
> + goto scan;
> +
> + if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
> unsigned long now = jiffies;
>
> + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> + if (!this_sd)
> + return -1;
> +
> /*
> * If we're busy, the assumption that the last idle period
> * predicts the future is flawed; age away the remaining
> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return -1;
> }
> }
> -
> +scan:
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> --
> 2.31.1
>

2022-07-13 16:46:27

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core


On 7/13/22 11:47 AM, Chen Yu Wrote:
> On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
> The idle core scan was skipped in SIS_UTIL because we saw better
> improvement in some benchmarks. But yes, we could make has_idle_core
> to scan anyway no matter what the system load is, if we have some
> data to support it. I'll test this patch on top of latest sched/core
> branch to see if this makes a difference.

I did some benchmark in my machine: netperf showed ~3% regression
and considerable improvement in hackbench pipe.

Maybe there are some workload benefit from selecting the recent
cpus rather than idle ones, but I don't think the recent_cpu can
out-perform the idle core much, since the two cpus still share
LLC cache and idle core has more capacity than idle cpu. Besides
this case, idle cores usually are better choices I think. So it
might worth a full scan if idle cores are available.

Thanks,
Abel

-------8<---

Tests are done in an Intel Xeon(R) Platinum 8260 [email protected] machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Based on tip sched/core c02d5546ea34 (v5.19-rc2).

1) hackbench-process-pipes

Amean 1 0.2307 ( 0.00%) 0.2277 ( 1.30%)
Amean 4 0.6277 ( 0.00%) 0.6037 ( 3.82%)
Amean 7 0.7897 ( 0.00%) 0.7710 * 2.36%*
Amean 12 1.2067 ( 0.00%) 1.3087 ( -8.45%)
Amean 21 2.6717 ( 0.00%) 2.5077 ( 6.14%)
Amean 30 5.0893 ( 0.00%) 3.9610 ( 22.17%)
Amean 48 9.2407 ( 0.00%) 7.4030 * 19.89%*
Amean 79 9.5607 ( 0.00%) 9.3960 ( 1.72%)
Amean 110 10.6053 ( 0.00%) 10.6453 ( -0.38%)
Amean 141 12.8210 ( 0.00%) 12.5253 ( 2.31%)
Amean 172 14.9777 ( 0.00%) 14.4317 ( 3.65%)
Amean 203 18.1320 ( 0.00%) 16.9753 * 6.38%*
Amean 234 20.1207 ( 0.00%) 19.0730 * 5.21%*
Amean 265 22.7037 ( 0.00%) 21.7953 ( 4.00%)
Amean 296 25.6753 ( 0.00%) 23.6877 * 7.74%*

2) hackbench-process-sockets

Amean 1 0.4223 ( 0.00%) 0.4300 ( -1.82%)
Amean 4 1.4470 ( 0.00%) 1.4593 * -0.85%*
Amean 7 2.4803 ( 0.00%) 2.4843 ( -0.16%)
Amean 12 4.1457 ( 0.00%) 4.1170 * 0.69%*
Amean 21 7.0223 ( 0.00%) 7.0053 ( 0.24%)
Amean 30 9.9570 ( 0.00%) 9.8683 * 0.89%*
Amean 48 16.0213 ( 0.00%) 15.6657 * 2.22%*
Amean 79 26.8140 ( 0.00%) 26.3657 * 1.67%*
Amean 110 37.3530 ( 0.00%) 36.8800 * 1.27%*
Amean 141 47.6123 ( 0.00%) 47.1627 ( 0.94%)
Amean 172 57.6757 ( 0.00%) 57.1430 * 0.92%*
Amean 203 68.2310 ( 0.00%) 67.6030 ( 0.92%)
Amean 234 78.1990 ( 0.00%) 77.7073 * 0.63%*
Amean 265 88.9657 ( 0.00%) 87.9833 * 1.10%*
Amean 296 99.8617 ( 0.00%) 98.1073 * 1.76%*

3) hackbench-thread-pipes

Amean 1 0.2650 ( 0.00%) 0.2807 ( -5.91%)
Amean 4 0.6650 ( 0.00%) 0.6257 * 5.91%*
Amean 7 0.8283 ( 0.00%) 0.8207 ( 0.93%)
Amean 12 1.3343 ( 0.00%) 1.3050 ( 2.20%)
Amean 21 3.7053 ( 0.00%) 2.7530 * 25.70%*
Amean 30 7.2817 ( 0.00%) 4.2013 * 42.30%*
Amean 48 9.9037 ( 0.00%) 8.8483 * 10.66%*
Amean 79 10.0790 ( 0.00%) 10.1603 ( -0.81%)
Amean 110 11.5837 ( 0.00%) 11.1297 ( 3.92%)
Amean 141 13.4760 ( 0.00%) 12.9903 ( 3.60%)
Amean 172 16.2357 ( 0.00%) 15.5903 ( 3.97%)
Amean 203 18.2873 ( 0.00%) 17.8690 ( 2.29%)
Amean 234 21.2920 ( 0.00%) 20.4680 ( 3.87%)
Amean 265 23.9393 ( 0.00%) 22.3933 * 6.46%*
Amean 296 26.6683 ( 0.00%) 26.1260 ( 2.03%)

4) hackbench-thread-sockets

Amean 1 0.4700 ( 0.00%) 0.4437 ( 5.60%)
Amean 4 1.4837 ( 0.00%) 1.4960 ( -0.83%)
Amean 7 2.5497 ( 0.00%) 2.5240 * 1.01%*
Amean 12 4.2473 ( 0.00%) 4.2137 * 0.79%*
Amean 21 7.2530 ( 0.00%) 7.1800 * 1.01%*
Amean 30 10.1973 ( 0.00%) 10.1483 ( 0.48%)
Amean 48 16.2163 ( 0.00%) 16.0870 * 0.80%*
Amean 79 27.4460 ( 0.00%) 27.0770 * 1.34%*
Amean 110 38.1103 ( 0.00%) 37.5573 * 1.45%*
Amean 141 48.4513 ( 0.00%) 48.0347 ( 0.86%)
Amean 172 59.4410 ( 0.00%) 58.4020 * 1.75%*
Amean 203 70.0873 ( 0.00%) 69.0470 * 1.48%*
Amean 234 80.3570 ( 0.00%) 79.4163 * 1.17%*
Amean 265 91.8550 ( 0.00%) 90.3417 * 1.65%*
Amean 296 102.7420 ( 0.00%) 100.8933 * 1.80%*

5) netperf-udp

Hmean send-64 210.14 ( 0.00%) 202.26 * -3.75%*
Hmean send-128 418.23 ( 0.00%) 404.14 * -3.37%*
Hmean send-256 821.31 ( 0.00%) 789.94 * -3.82%*
Hmean send-1024 3161.96 ( 0.00%) 3025.23 * -4.32%*
Hmean send-2048 5959.67 ( 0.00%) 5725.35 * -3.93%*
Hmean send-3312 9196.99 ( 0.00%) 8830.47 * -3.99%*
Hmean send-4096 11061.30 ( 0.00%) 10675.83 * -3.48%*
Hmean send-8192 17320.16 ( 0.00%) 17601.11 * 1.62%*
Hmean send-16384 27936.12 ( 0.00%) 27859.52 ( -0.27%)
Hmean recv-64 210.14 ( 0.00%) 202.26 * -3.75%*
Hmean recv-128 418.23 ( 0.00%) 404.14 * -3.37%*
Hmean recv-256 821.31 ( 0.00%) 789.94 * -3.82%*
Hmean recv-1024 3161.95 ( 0.00%) 3025.23 * -4.32%*
Hmean recv-2048 5959.44 ( 0.00%) 5725.35 * -3.93%*
Hmean recv-3312 9196.99 ( 0.00%) 8830.44 * -3.99%*
Hmean recv-4096 11061.26 ( 0.00%) 10675.83 * -3.48%*
Hmean recv-8192 17319.62 ( 0.00%) 17601.10 * 1.63%*
Hmean recv-16384 27935.00 ( 0.00%) 27859.40 ( -0.27%)

6) netperf-tcp

Hmean 64 1222.35 ( 0.00%) 1195.62 * -2.19%*
Hmean 128 2372.82 ( 0.00%) 2311.15 * -2.60%*
Hmean 256 4305.82 ( 0.00%) 4244.00 ( -1.44%)
Hmean 1024 13645.83 ( 0.00%) 13299.13 * -2.54%*
Hmean 2048 21119.56 ( 0.00%) 21042.51 ( -0.36%)
Hmean 3312 24835.87 ( 0.00%) 25282.73 * 1.80%*
Hmean 4096 26705.57 ( 0.00%) 26851.09 ( 0.54%)
Hmean 8192 30889.25 ( 0.00%) 31436.74 * 1.77%*
Hmean 16384 35108.55 ( 0.00%) 35172.76 ( 0.18%)

7) tbench4 Throughput

Hmean 1 289.23 ( 0.00%) 289.58 ( 0.12%)
Hmean 2 577.35 ( 0.00%) 576.40 * -0.17%*
Hmean 4 1141.02 ( 0.00%) 1155.58 * 1.28%*
Hmean 8 2258.36 ( 0.00%) 2287.82 * 1.30%*
Hmean 16 4410.02 ( 0.00%) 4510.76 * 2.28%*
Hmean 32 7414.89 ( 0.00%) 7474.74 * 0.81%*
Hmean 64 9011.49 ( 0.00%) 8973.50 * -0.42%*
Hmean 128 19892.18 ( 0.00%) 19913.82 * 0.11%*
Hmean 256 19854.73 ( 0.00%) 20239.21 * 1.94%*
Hmean 384 19808.80 ( 0.00%) 19709.59 * -0.50%*



>
> thanks,
> Chenyu
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> kernel/sched/fair.c | 15 +++++++++------
>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a78d2e3b9d49..cd758b3616bd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> struct sched_domain *this_sd;
>> u64 time = 0;
>>
>> - this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> - if (!this_sd)
>> - return -1;
>> -
>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>
>> - if (sched_feat(SIS_PROP) && !has_idle_core) {
>> + if (has_idle_core)
>> + goto scan;
>> +
>> + if (sched_feat(SIS_PROP)) {
>> u64 avg_cost, avg_idle, span_avg;
>> unsigned long now = jiffies;
>>
>> + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> + if (!this_sd)
>> + return -1;
>> +
>> /*
>> * If we're busy, the assumption that the last idle period
>> * predicts the future is flawed; age away the remaining
>> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> return -1;
>> }
>> }
>> -
>> +scan:
>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>> if (has_idle_core) {
>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> --
>> 2.31.1
>>

2022-07-14 06:43:30

by Yicong Yang

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 2022/7/12 16:20, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> kernel/sched/fair.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a78d2e3b9d49..cd758b3616bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> struct sched_domain *this_sd;
> u64 time = 0;
>
> - this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> - if (!this_sd)
> - return -1;
> -
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> - if (sched_feat(SIS_PROP) && !has_idle_core) {
> + if (has_idle_core)
> + goto scan;
> +
> + if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
> unsigned long now = jiffies;
>
> + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> + if (!this_sd)
> + return -1;
> +

I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
commit. Does the position of this make any performance difference?

Thanks.

> /*
> * If we're busy, the assumption that the last idle period
> * predicts the future is flawed; age away the remaining
> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return -1;
> }
> }
> -
> +scan:
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>

2022-07-14 07:10:13

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core


On 7/14/22 2:19 PM, Yicong Yang Wrote:
> On 2022/7/12 16:20, Abel Wu wrote:
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> kernel/sched/fair.c | 15 +++++++++------
>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a78d2e3b9d49..cd758b3616bd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> struct sched_domain *this_sd;
>> u64 time = 0;
>>
>> - this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> - if (!this_sd)
>> - return -1;
>> -
>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>
>> - if (sched_feat(SIS_PROP) && !has_idle_core) {
>> + if (has_idle_core)
>> + goto scan;
>> +
>> + if (sched_feat(SIS_PROP)) {
>> u64 avg_cost, avg_idle, span_avg;
>> unsigned long now = jiffies;
>>
>> + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> + if (!this_sd)
>> + return -1;
>> +
>
> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
> commit. Does the position of this make any performance difference?

No, this change doesn't make much difference to performance. Are
you suggesting that I should make this a separate patch?

Thanks,
Abel

>
> Thanks.
>
>> /*
>> * If we're busy, the assumption that the last idle period
>> * predicts the future is flawed; age away the remaining
>> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> return -1;
>> }
>> }
>> -
>> +scan:
>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>> if (has_idle_core) {
>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>

2022-07-14 07:47:47

by Yicong Yang

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 2022/7/14 14:58, Abel Wu wrote:
>
> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>> On 2022/7/12 16:20, Abel Wu wrote:
>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>> the LLC is overloaded. Since the overloaded status is checked
>>> in the load balancing at LLC level, the interval is llc_size
>>> miliseconds. The duration might be long enough to affect the
>>> overall system throughput if idle cores are out of reach in
>>> SIS domain scan.
>>>
>>> Signed-off-by: Abel Wu <[email protected]>
>>> ---
>>>   kernel/sched/fair.c | 15 +++++++++------
>>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a78d2e3b9d49..cd758b3616bd 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>       struct sched_domain *this_sd;
>>>       u64 time = 0;
>>>   -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> -    if (!this_sd)
>>> -        return -1;
>>> -
>>>       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>   -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>> +    if (has_idle_core)
>>> +        goto scan;
>>> +
>>> +    if (sched_feat(SIS_PROP)) {
>>>           u64 avg_cost, avg_idle, span_avg;
>>>           unsigned long now = jiffies;
>>>   +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> +        if (!this_sd)
>>> +            return -1;
>>> +
>>
>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>> commit. Does the position of this make any performance difference?
>
> No, this change doesn't make much difference to performance. Are
> you suggesting that I should make this a separate patch?
>

It just makes me think that dereference is unnecessary if this_cpu and target locates in
the same LLC, since it's already been passed. But since you noticed no difference it may
have little effect. :)

> Thanks,
> Abel
>
>>
>> Thanks.
>>
>>>           /*
>>>            * If we're busy, the assumption that the last idle period
>>>            * predicts the future is flawed; age away the remaining
>>> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>                   return -1;
>>>           }
>>>       }
>>> -
>>> +scan:
>>>       for_each_cpu_wrap(cpu, cpus, target + 1) {
>>>           if (has_idle_core) {
>>>               i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>>
> .

2022-07-14 08:13:09

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core


On 7/14/22 3:15 PM, Yicong Yang Wrote:
> On 2022/7/14 14:58, Abel Wu wrote:
>>
>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>> the LLC is overloaded. Since the overloaded status is checked
>>>> in the load balancing at LLC level, the interval is llc_size
>>>> miliseconds. The duration might be long enough to affect the
>>>> overall system throughput if idle cores are out of reach in
>>>> SIS domain scan.
>>>>
>>>> Signed-off-by: Abel Wu <[email protected]>
>>>> ---
>>>>   kernel/sched/fair.c | 15 +++++++++------
>>>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>       struct sched_domain *this_sd;
>>>>       u64 time = 0;
>>>>   -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> -    if (!this_sd)
>>>> -        return -1;
>>>> -
>>>>       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>   -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>> +    if (has_idle_core)
>>>> +        goto scan;
>>>> +
>>>> +    if (sched_feat(SIS_PROP)) {
>>>>           u64 avg_cost, avg_idle, span_avg;
>>>>           unsigned long now = jiffies;
>>>>   +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> +        if (!this_sd)
>>>> +            return -1;
>>>> +
>>>
>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>> commit. Does the position of this make any performance difference?
>>
>> No, this change doesn't make much difference to performance. Are
>> you suggesting that I should make this a separate patch?
>>
>
> It just makes me think that dereference is unnecessary if this_cpu and target locates in
> the same LLC, since it's already been passed. But since you noticed no difference it may
> have little effect. :)
>

Hmm.. Not exactly. The sched-domains are cpu private, and this_cpu can
be in another LLC than target.

2022-07-14 08:19:52

by Yicong Yang

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 2022/7/14 16:00, Abel Wu wrote:
>
> On 7/14/22 3:15 PM, Yicong Yang Wrote:
>> On 2022/7/14 14:58, Abel Wu wrote:
>>>
>>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>>> the LLC is overloaded. Since the overloaded status is checked
>>>>> in the load balancing at LLC level, the interval is llc_size
>>>>> miliseconds. The duration might be long enough to affect the
>>>>> overall system throughput if idle cores are out of reach in
>>>>> SIS domain scan.
>>>>>
>>>>> Signed-off-by: Abel Wu <[email protected]>
>>>>> ---
>>>>>    kernel/sched/fair.c | 15 +++++++++------
>>>>>    1 file changed, 9 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>>        struct sched_domain *this_sd;
>>>>>        u64 time = 0;
>>>>>    -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>> -    if (!this_sd)
>>>>> -        return -1;
>>>>> -
>>>>>        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>>    -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>>> +    if (has_idle_core)
>>>>> +        goto scan;
>>>>> +
>>>>> +    if (sched_feat(SIS_PROP)) {
>>>>>            u64 avg_cost, avg_idle, span_avg;
>>>>>            unsigned long now = jiffies;
>>>>>    +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>> +        if (!this_sd)
>>>>> +            return -1;
>>>>> +
>>>>
>>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>>> commit. Does the position of this make any performance difference?
>>>
>>> No, this change doesn't make much difference to performance. Are
>>> you suggesting that I should make this a separate patch?
>>>
>>
>> It just makes me think that dereference is unnecessary if this_cpu and target locates in
>> the same LLC, since it's already been passed. But since you noticed no difference it may
>> have little effect. :)
>>
>
> Hmm.. Not exactly. The sched-domains are cpu private, and this_cpu can
> be in another LLC than target.
> .

yes. you're right. sorry for get this messed.

2022-07-14 08:37:53

by Yicong Yang

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 2022/7/14 16:16, Yicong Yang wrote:
> On 2022/7/14 16:00, Abel Wu wrote:
>>
>> On 7/14/22 3:15 PM, Yicong Yang Wrote:
>>> On 2022/7/14 14:58, Abel Wu wrote:
>>>>
>>>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>>>> the LLC is overloaded. Since the overloaded status is checked
>>>>>> in the load balancing at LLC level, the interval is llc_size
>>>>>> miliseconds. The duration might be long enough to affect the
>>>>>> overall system throughput if idle cores are out of reach in
>>>>>> SIS domain scan.
>>>>>>
>>>>>> Signed-off-by: Abel Wu <[email protected]>
>>>>>> ---
>>>>>>    kernel/sched/fair.c | 15 +++++++++------
>>>>>>    1 file changed, 9 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>>>        struct sched_domain *this_sd;
>>>>>>        u64 time = 0;
>>>>>>    -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>>> -    if (!this_sd)
>>>>>> -        return -1;
>>>>>> -
>>>>>>        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>>>    -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>>>> +    if (has_idle_core)
>>>>>> +        goto scan;
>>>>>> +
>>>>>> +    if (sched_feat(SIS_PROP)) {
>>>>>>            u64 avg_cost, avg_idle, span_avg;
>>>>>>            unsigned long now = jiffies;
>>>>>>    +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>>> +        if (!this_sd)
>>>>>> +            return -1;
>>>>>> +
>>>>>
>>>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>>>> commit. Does the position of this make any performance difference?
>>>>
>>>> No, this change doesn't make much difference to performance. Are
>>>> you suggesting that I should make this a separate patch?
>>>>
>>>
>>> It just makes me think that dereference is unnecessary if this_cpu and target locates in
>>> the same LLC, since it's already been passed. But since you noticed no difference it may
>>> have little effect. :)
>>>
>>
>> Hmm.. Not exactly. The sched-domains are cpu private

>>, and this_cpu can be in another LLC than target.

This is not the condition I meant to, I would have thought it would make some sense only when they're in
the same LLC. Anyway since no difference for dereference or not, it doesn't matter at all.
Thanks for the explanation.

>> .
>
> yes. you're right. sorry for get this messed.
>
> .
>

2022-08-04 10:36:42

by Yu Chen

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Thu, Jul 14, 2022 at 3:11 PM Abel Wu <[email protected]> wrote:
>
>
> On 7/14/22 2:19 PM, Yicong Yang Wrote:
> > On 2022/7/12 16:20, Abel Wu wrote:
> >> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> >> the LLC is overloaded. Since the overloaded status is checked
> >> in the load balancing at LLC level, the interval is llc_size
> >> miliseconds. The duration might be long enough to affect the
> >> overall system throughput if idle cores are out of reach in
> >> SIS domain scan.
> >>
> >> Signed-off-by: Abel Wu <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 15 +++++++++------
> >> 1 file changed, 9 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index a78d2e3b9d49..cd758b3616bd 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >> struct sched_domain *this_sd;
> >> u64 time = 0;
> >>
> >> - this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> >> - if (!this_sd)
> >> - return -1;
> >> -
> >> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>
> >> - if (sched_feat(SIS_PROP) && !has_idle_core) {
> >> + if (has_idle_core)
> >> + goto scan;
> >> +
> >> + if (sched_feat(SIS_PROP)) {
> >> u64 avg_cost, avg_idle, span_avg;
> >> unsigned long now = jiffies;
> >>
> >> + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> >> + if (!this_sd)
> >> + return -1;
> >> +
> >
> > I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
> > commit. Does the position of this make any performance difference?
>
> No, this change doesn't make much difference to performance. Are
> you suggesting that I should make this a separate patch?
>
I took a look at this patch again before I start a OLTP test. I
thought the position change of
dereference sd_llc might not be closely connected with current change
as Yicong mentioned.
Besides, after moving the dereference inside SIS_PROP, we might do
cpumask_and() no matter whether
sd_llc is valid or not, which might be of extra cost?

thanks,
Chenyu

> Thanks,
> Abel
>
> >
> > Thanks.
> >
> >> /*
> >> * If we're busy, the assumption that the last idle period
> >> * predicts the future is flawed; age away the remaining
> >> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >> return -1;
> >> }
> >> }
> >> -
> >> +scan:
> >> for_each_cpu_wrap(cpu, cpus, target + 1) {
> >> if (has_idle_core) {
> >> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >>



--
Thanks,
Chenyu

2022-08-10 14:20:15

by Yu Chen

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Tue, Jul 12, 2022 at 4:45 PM Abel Wu <[email protected]> wrote:
>
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
>
Tested schbench and netperf on latest 5.19 vanilla, it seems that there
is latency performance improvement when the load is low in schbench,
and no performance difference on netperf.

./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t schbench

schbench
========
case load baseline(std%) compare%( std%)
normal mthread-1 1.00 ( 0.00) +7.69 ( 0.00)
normal mthread-2 1.00 ( 0.00) +13.24 ( 0.00)
normal mthread-4 1.00 ( 0.00) -5.88 ( 0.00)
normal mthread-8 1.00 ( 0.00) -0.25 ( 0.00)


./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t netperf
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR thread-28 1.00 ( 0.62) +0.15 ( 0.55)
TCP_RR thread-56 1.00 ( 0.42) -0.26 ( 0.40)
TCP_RR thread-84 1.00 ( 0.29) +0.39 ( 0.29)
TCP_RR thread-112 1.00 ( 0.22) +0.44 ( 0.23)
TCP_RR thread-140 1.00 ( 0.17) +0.33 ( 0.18)
TCP_RR thread-168 1.00 ( 0.17) +0.19 ( 0.16)
TCP_RR thread-196 1.00 ( 13.65) -0.62 ( 14.83)
TCP_RR thread-224 1.00 ( 9.80) -0.65 ( 9.67)
UDP_RR thread-28 1.00 ( 0.89) +0.92 ( 0.81)
UDP_RR thread-56 1.00 ( 0.78) +0.38 ( 0.73)
UDP_RR thread-84 1.00 ( 14.03) +0.78 ( 16.85)
UDP_RR thread-112 1.00 ( 12.26) -0.42 ( 11.95)
UDP_RR thread-140 1.00 ( 9.86) -0.89 ( 6.93)
UDP_RR thread-168 1.00 ( 11.62) -0.82 ( 8.80)
UDP_RR thread-196 1.00 ( 19.47) +0.42 ( 16.50)
UDP_RR thread-224 1.00 ( 18.68) +0.72 ( 18.50)


Tested-by: Chen Yu <[email protected]>

--
Thanks,
Chenyu

2022-08-15 02:51:39

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

Hi Chen, thanks for your testing!

On 8/10/22 9:50 PM, Chen Yu Wrote:
> On Tue, Jul 12, 2022 at 4:45 PM Abel Wu <[email protected]> wrote:
>>
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>>
> Tested schbench and netperf on latest 5.19 vanilla, it seems that there
> is latency performance improvement when the load is low in schbench,
> and no performance difference on netperf.
>
> ./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t schbench
>
> schbench
> ========
> case load baseline(std%) compare%( std%)
> normal mthread-1 1.00 ( 0.00) +7.69 ( 0.00)
> normal mthread-2 1.00 ( 0.00) +13.24 ( 0.00)
> normal mthread-4 1.00 ( 0.00) -5.88 ( 0.00)
> normal mthread-8 1.00 ( 0.00) -0.25 ( 0.00)
>
>
> ./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t netperf
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR thread-28 1.00 ( 0.62) +0.15 ( 0.55)
> TCP_RR thread-56 1.00 ( 0.42) -0.26 ( 0.40)
> TCP_RR thread-84 1.00 ( 0.29) +0.39 ( 0.29)
> TCP_RR thread-112 1.00 ( 0.22) +0.44 ( 0.23)
> TCP_RR thread-140 1.00 ( 0.17) +0.33 ( 0.18)
> TCP_RR thread-168 1.00 ( 0.17) +0.19 ( 0.16)
> TCP_RR thread-196 1.00 ( 13.65) -0.62 ( 14.83)
> TCP_RR thread-224 1.00 ( 9.80) -0.65 ( 9.67)
> UDP_RR thread-28 1.00 ( 0.89) +0.92 ( 0.81)
> UDP_RR thread-56 1.00 ( 0.78) +0.38 ( 0.73)
> UDP_RR thread-84 1.00 ( 14.03) +0.78 ( 16.85)
> UDP_RR thread-112 1.00 ( 12.26) -0.42 ( 11.95)
> UDP_RR thread-140 1.00 ( 9.86) -0.89 ( 6.93)
> UDP_RR thread-168 1.00 ( 11.62) -0.82 ( 8.80)
> UDP_RR thread-196 1.00 ( 19.47) +0.42 ( 16.50)
> UDP_RR thread-224 1.00 ( 18.68) +0.72 ( 18.50)
>
>
> Tested-by: Chen Yu <[email protected]>
>

2022-08-15 03:24:30

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 8/4/22 5:59 PM, Chen Yu Wrote:
> On Thu, Jul 14, 2022 at 3:11 PM Abel Wu <[email protected]> wrote:
>>
>>
>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>> the LLC is overloaded. Since the overloaded status is checked
>>>> in the load balancing at LLC level, the interval is llc_size
>>>> miliseconds. The duration might be long enough to affect the
>>>> overall system throughput if idle cores are out of reach in
>>>> SIS domain scan.
>>>>
>>>> Signed-off-by: Abel Wu <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 15 +++++++++------
>>>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>> struct sched_domain *this_sd;
>>>> u64 time = 0;
>>>>
>>>> - this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> - if (!this_sd)
>>>> - return -1;
>>>> -
>>>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>
>>>> - if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>> + if (has_idle_core)
>>>> + goto scan;
>>>> +
>>>> + if (sched_feat(SIS_PROP)) {
>>>> u64 avg_cost, avg_idle, span_avg;
>>>> unsigned long now = jiffies;
>>>>
>>>> + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> + if (!this_sd)
>>>> + return -1;
>>>> +
>>>
>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>> commit. Does the position of this make any performance difference?
>>
>> No, this change doesn't make much difference to performance. Are
>> you suggesting that I should make this a separate patch?
>>
> I took a look at this patch again before I start a OLTP test. I
> thought the position change of
> dereference sd_llc might not be closely connected with current change
> as Yicong mentioned.

OK, I will make it a separate patch. But before that I'd prefer wait
for more comments :)

> Besides, after moving the dereference inside SIS_PROP, we might do
> cpumask_and() no matter whether
> sd_llc is valid or not, which might be of extra cost?
>
I think it might be irrelevant whether the local sd_llc is valid or
not, since all we care about is target sd_llc if !SIS_PROP.

Best Regards,
Abel

2022-08-15 13:34:14

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 0/5] sched/fair: SIS improvements and cleanups

Hi Peter, it would be appreciate if you can offer some advice!

Thanks & Best,
Abel

On 7/12/22 4:20 PM, Abel Wu Wrote:
> The first patch ignores the new sched_feat SIS_UTIL when
> there are idle cores available to make full use of cpus
> as possible.
>
> The other patches come from the SIS filter patchset [1],
> since they are irrelevant to the filter.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Abel Wu (5):
> sched/fair: ignore SIS_UTIL when has idle core
> sched/fair: default to false in test_idle_cores
> sched/fair: remove redundant check in select_idle_smt
> sched/fair: avoid double search on same cpu
> sched/fair: remove useless check in select_idle_core
>
> kernel/sched/fair.c | 45 ++++++++++++++++++++++-----------------------
> 1 file changed, 22 insertions(+), 23 deletions(-)
>

2022-08-29 12:58:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched/fair: avoid double search on same cpu

On Tue, Jul 12, 2022 at 04:20:35PM +0800, Abel Wu wrote:
> The prev cpu is checked at the beginning of SIS, and it's unlikely
> to be idle before the second check in select_idle_smt(). So we'd
> better focus on its SMT siblings.
>
> Signed-off-by: Abel Wu <[email protected]>
> Reviewed-by: Josh Don <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2022-08-29 12:59:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched/fair: remove useless check in select_idle_core

On Tue, Jul 12, 2022 at 04:20:36PM +0800, Abel Wu wrote:
> The function only gets called when sds->has_idle_cores is true
> which can be possible only when sched_smt_present is enabled.
> This change also aligns select_idle_core with select_idle_smt
> that the caller do the check if necessary.
>
> Signed-off-by: Abel Wu <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2022-08-29 13:04:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt

On Tue, Jul 12, 2022 at 04:20:34PM +0800, Abel Wu wrote:
> If two cpus share LLC cache, then the two cores they belong to
> are also in the same LLC domain.
>
> Signed-off-by: Abel Wu <[email protected]>
> Reviewed-by: Josh Don <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2022-08-29 13:11:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched/fair: default to false in test_idle_cores

On Tue, Jul 12, 2022 at 04:20:33PM +0800, Abel Wu wrote:
> It's uncertain whether idle cores exist or not if shared sched-
> domains are not ready, so returning "no idle cores" usually
> makes sense.
>
> While __update_idle_core() is an exception, it checks status
> of this core and set to shared sched-domain if necessary. So
> the whole logic depends on the existence of sds, and can bail
> out early if !sds.
>
> It's somehow a little tricky, and as Josh suggested that it
> should be transient while the domain isn't ready. So remove
> the self-defined default value to make things more clearer,
> while introduce little overhead in idle path.
>
> Signed-off-by: Abel Wu <[email protected]>
> Reviewed-by: Josh Don <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2022-08-29 13:55:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
>
> Signed-off-by: Abel Wu <[email protected]>

Split this patch to move the this_sd lookup into the SIS_PROP section.

Otherwise, this is the most controversial patch in the series and the
most likely to cause problems where it wins on some machines and
workloads and loses on others.

The corner case to worry about is a workload that is rapidly idling and
the has_idle_core hint is often wrong. This can happen with hackbench for
example, or at least this was true when I last looked which is quite some
time ago. If this hint is often wrong, then there will be full scan cost
incurred regardless of SIS_UTIL that often fails to find a core.

So, first suggestion is to move this patch to the end of the series as
the other patches are relatively harmless. They could even be merged in
isolation as a cleanup.

Second, using the other patches as your baseline, include in the
changelog what you tested that showed a benefit, what type of machine
it was and in particular include the number of cores, nodes and the
span of the LLC. If you measured any regressions, include that as well
and make a call on whether you think the patch wins more than it loses.
The reason to include that information is because the worst corner case
(all CPUs scanned uselessly) costs more the larger the span of LLC is.
If all the testing was done on a 2-core SMT-2 machine, the overhead of the
patch would be negligible but very different if the LLC span is 20 cores.
While the patch is not obviously wrong, it definitely needs better data,
Even if you do not have a large test machine available, it's still helpful
to have it in the changelog because a reviewer like me can decide "this
needs testing on a larger machine".

I did queue this up the entire series for testing and while it sometimes
benefitted, there were large counter-examples. tbench4 on Zen3 showed some
large regressions (e.g. 41% loss on 64 clients with a massive increase in
variability) which has multiple small L3 caches per node. tbench4 (slightly
modified in mmtests to produce a usable metric) in general showed issues
across multiple x86-64 machines both AMD and Intel, multiple generations
with a noticable increase in system CPU usage when the client counts reach
the stage where the machine is close to saturated. perfpipe for some
weird reason showed a large regression apparent regresion on Broadwell
but the variability was also crazy so probably can be ignored. hackbench
overall looked ok so it's possible I'm wrong about the idle_cores hint
being often wrong on that workload and I should double check that. It's
possible the hint is wrong some of the times but right often enough to
either benefit from using an idle core or by finding an idle sibling which
may be preferable to stacking tasks on the same CPU.

The lack of data and the lack of a note on the potential downside is the
main reason why I'm not acking patch. tbench4 was a particular concern on
my own tests and it's possible a better patch would be a hybrid approach
where a limited search of two cores (excluding target) is allowed even if
SIS_UTIL indicates overload but has_idle_cores is true.

--
Mel Gorman
SUSE Labs

2022-08-29 14:18:41

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

Hi Mel, thanks for reviewing!

On 8/29/22 9:08 PM, Mel Gorman Wrote:
> On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
>>
>> Signed-off-by: Abel Wu <[email protected]>
>
> Split this patch to move the this_sd lookup into the SIS_PROP section.

OK

>
> Otherwise, this is the most controversial patch in the series and the
> most likely to cause problems where it wins on some machines and
> workloads and loses on others.
>
> The corner case to worry about is a workload that is rapidly idling and
> the has_idle_core hint is often wrong. This can happen with hackbench for
> example, or at least this was true when I last looked which is quite some
> time ago. If this hint is often wrong, then there will be full scan cost
> incurred regardless of SIS_UTIL that often fails to find a core.

Yes, I can't agree more. And the situation can become worse when LLC
gets bigger as you mentioned below. I will exclude this part from v2.

>
> So, first suggestion is to move this patch to the end of the series as
> the other patches are relatively harmless. They could even be merged in
> isolation as a cleanup.
>
> Second, using the other patches as your baseline, include in the
> changelog what you tested that showed a benefit, what type of machine
> it was and in particular include the number of cores, nodes and the
> span of the LLC. If you measured any regressions, include that as well
> and make a call on whether you think the patch wins more than it loses.
> The reason to include that information is because the worst corner case
> (all CPUs scanned uselessly) costs more the larger the span of LLC is.
> If all the testing was done on a 2-core SMT-2 machine, the overhead of the
> patch would be negligible but very different if the LLC span is 20 cores.
> While the patch is not obviously wrong, it definitely needs better data,
> Even if you do not have a large test machine available, it's still helpful
> to have it in the changelog because a reviewer like me can decide "this
> needs testing on a larger machine".

Thanks for your detailed suggestions. I will attach benchmark results
along with some analysis next time when posting performance related
patches.

>
> I did queue this up the entire series for testing and while it sometimes
> benefitted, there were large counter-examples. tbench4 on Zen3 showed some
> large regressions (e.g. 41% loss on 64 clients with a massive increase in
> variability) which has multiple small L3 caches per node. tbench4 (slightly
> modified in mmtests to produce a usable metric) in general showed issues
> across multiple x86-64 machines both AMD and Intel, multiple generations
> with a noticable increase in system CPU usage when the client counts reach
> the stage where the machine is close to saturated. perfpipe for some
> weird reason showed a large regression apparent regresion on Broadwell
> but the variability was also crazy so probably can be ignored. hackbench
> overall looked ok so it's possible I'm wrong about the idle_cores hint
> being often wrong on that workload and I should double check that. It's
> possible the hint is wrong some of the times but right often enough to
> either benefit from using an idle core or by finding an idle sibling which
> may be preferable to stacking tasks on the same CPU.

I attached my test result in one of my replies[1]: netperf showed ~3.5%
regression, hackbench improved a lot, and tbench4 drew. I tested several
times and the results didn't seem vary much.

>
> The lack of data and the lack of a note on the potential downside is the
> main reason why I'm not acking patch. tbench4 was a particular concern on
> my own tests and it's possible a better patch would be a hybrid approach
> where a limited search of two cores (excluding target) is allowed even if
> SIS_UTIL indicates overload but has_idle_cores is true.
>

Agreed. And the reason I will exclude this part in v2 is that I plan to
make it part of another feature, SIS filter [2]. The latest version of
SIS filter (not posted yet but soon) will contain all the idle cpus so
we don't need a full scan when has_idle_core. Scanning the filter then
is enough. While it may still cost too much if too many false positive
bits in the filter. Does this direction make sense to you?

[1]
https://lore.kernel.org/lkml/[email protected]/
[2]
https://lore.kernel.org/lkml/[email protected]/

Thanks & Best Regards,
Abel

2022-08-29 15:14:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
> Hi Mel, thanks for reviewing!
>

No problem, sorry it took so long.

> > So, first suggestion is to move this patch to the end of the series as
> > the other patches are relatively harmless. They could even be merged in
> > isolation as a cleanup.
> >
> > Second, using the other patches as your baseline, include in the
> > changelog what you tested that showed a benefit, what type of machine
> > it was and in particular include the number of cores, nodes and the
> > span of the LLC. If you measured any regressions, include that as well
> > and make a call on whether you think the patch wins more than it loses.
> > The reason to include that information is because the worst corner case
> > (all CPUs scanned uselessly) costs more the larger the span of LLC is.
> > If all the testing was done on a 2-core SMT-2 machine, the overhead of the
> > patch would be negligible but very different if the LLC span is 20 cores.
> > While the patch is not obviously wrong, it definitely needs better data,
> > Even if you do not have a large test machine available, it's still helpful
> > to have it in the changelog because a reviewer like me can decide "this
> > needs testing on a larger machine".
>
> Thanks for your detailed suggestions. I will attach benchmark results
> along with some analysis next time when posting performance related
> patches.
>

Thanks, include this in the changelog. While I had different figures for
hackbench, the figures are still fine. I had similar figures for netperf
(~3-4% regression on some machines but not universal). The tbench figures
are interesting because for you, it was mostly neutral but I did test
it with a CascadeLake machine and had worse results and that machine is
smaller in terms of core counts than yours.

> >
> > I did queue this up the entire series for testing and while it sometimes
> > benefitted, there were large counter-examples. tbench4 on Zen3 showed some
> > large regressions (e.g. 41% loss on 64 clients with a massive increase in
> > variability) which has multiple small L3 caches per node. tbench4 (slightly
> > modified in mmtests to produce a usable metric) in general showed issues
> > across multiple x86-64 machines both AMD and Intel, multiple generations
> > with a noticable increase in system CPU usage when the client counts reach
> > the stage where the machine is close to saturated. perfpipe for some
> > weird reason showed a large regression apparent regresion on Broadwell
> > but the variability was also crazy so probably can be ignored. hackbench
> > overall looked ok so it's possible I'm wrong about the idle_cores hint
> > being often wrong on that workload and I should double check that. It's
> > possible the hint is wrong some of the times but right often enough to
> > either benefit from using an idle core or by finding an idle sibling which
> > may be preferable to stacking tasks on the same CPU.
>
> I attached my test result in one of my replies[1]: netperf showed ~3.5%
> regression, hackbench improved a lot, and tbench4 drew. I tested several
> times and the results didn't seem vary much.
>
> >
> > The lack of data and the lack of a note on the potential downside is the
> > main reason why I'm not acking patch. tbench4 was a particular concern on
> > my own tests and it's possible a better patch would be a hybrid approach
> > where a limited search of two cores (excluding target) is allowed even if
> > SIS_UTIL indicates overload but has_idle_cores is true.
> >
>
> Agreed. And the reason I will exclude this part in v2 is that I plan to
> make it part of another feature, SIS filter [2]. The latest version of
> SIS filter (not posted yet but soon) will contain all the idle cpus so
> we don't need a full scan when has_idle_core. Scanning the filter then
> is enough. While it may still cost too much if too many false positive
> bits in the filter. Does this direction make sense to you?
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2]
> https://lore.kernel.org/lkml/[email protected]/
>

The filter idea is tricky, it will always struggle with "accurate
information" vs "cost of cpumask update" and what you link indicates that
it's updated on the tick boundary. That will be relatively cheap but could
mean that searches near the point of saturation or overload will have
false positives and negatives which you are already aware of given the
older series. It does not kill the idea but I would strongly suggest that
you do the simple thing first. That would yield potentially 3-4 patches
at the end of the series

1. Full scan for cores (this patch effectively)
2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
(Compare 1 vs 2 in the changelog, not expected to be a universal win but
should have better "worst case" behaviour to be worth merging)
3. Filtered scan tracking idle CPUs as a hint while removing the
"limited scan"
(Compare 2 vs 3)
4. The SIS "de-entrophy" patch that tries to cope with false positives
and negatives
(Compare 3 vs 4)

That way if the later patches do not work as expected then they can be
easily dropped without affecting the rest of the series.

--
Mel Gorman
SUSE Labs

2022-09-01 13:18:13

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 8/29/22 10:56 PM, Mel Gorman Wrote:
> On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
>> Hi Mel, thanks for reviewing!
>>
>
> No problem, sorry it took so long.
>
>>> So, first suggestion is to move this patch to the end of the series as
>>> the other patches are relatively harmless. They could even be merged in
>>> isolation as a cleanup.
>>>
>>> Second, using the other patches as your baseline, include in the
>>> changelog what you tested that showed a benefit, what type of machine
>>> it was and in particular include the number of cores, nodes and the
>>> span of the LLC. If you measured any regressions, include that as well
>>> and make a call on whether you think the patch wins more than it loses.
>>> The reason to include that information is because the worst corner case
>>> (all CPUs scanned uselessly) costs more the larger the span of LLC is.
>>> If all the testing was done on a 2-core SMT-2 machine, the overhead of the
>>> patch would be negligible but very different if the LLC span is 20 cores.
>>> While the patch is not obviously wrong, it definitely needs better data,
>>> Even if you do not have a large test machine available, it's still helpful
>>> to have it in the changelog because a reviewer like me can decide "this
>>> needs testing on a larger machine".
>>
>> Thanks for your detailed suggestions. I will attach benchmark results
>> along with some analysis next time when posting performance related
>> patches.
>>
>
> Thanks, include this in the changelog. While I had different figures for
> hackbench, the figures are still fine. I had similar figures for netperf
> (~3-4% regression on some machines but not universal). The tbench figures
> are interesting because for you, it was mostly neutral but I did test
> it with a CascadeLake machine and had worse results and that machine is
> smaller in terms of core counts than yours.

Interesting, my test machine model is Intel(R) Xeon(R) Platinum 8260 CPU
@ 2.40GHz, which is based on Cascade Lake. This time I re-tested with
SNC enabled, so each NUMA node now has 12 cores (was 24), but the tbench
result is still neutral..

I just remembered that in the Conclusion part of SIS filter v2 patchset
mentioned a suspicious 50%~90% improvement in netperf&tbench test. Seems
like the result can be misleading under certain conditions.

https://lore.kernel.org/lkml/[email protected]/

>
>>>
>>> I did queue this up the entire series for testing and while it sometimes
>>> benefitted, there were large counter-examples. tbench4 on Zen3 showed some
>>> large regressions (e.g. 41% loss on 64 clients with a massive increase in
>>> variability) which has multiple small L3 caches per node. tbench4 (slightly
>>> modified in mmtests to produce a usable metric) in general showed issues
>>> across multiple x86-64 machines both AMD and Intel, multiple generations
>>> with a noticable increase in system CPU usage when the client counts reach
>>> the stage where the machine is close to saturated. perfpipe for some
>>> weird reason showed a large regression apparent regresion on Broadwell
>>> but the variability was also crazy so probably can be ignored. hackbench
>>> overall looked ok so it's possible I'm wrong about the idle_cores hint
>>> being often wrong on that workload and I should double check that. It's
>>> possible the hint is wrong some of the times but right often enough to
>>> either benefit from using an idle core or by finding an idle sibling which
>>> may be preferable to stacking tasks on the same CPU.
>>
>> I attached my test result in one of my replies[1]: netperf showed ~3.5%
>> regression, hackbench improved a lot, and tbench4 drew. I tested several
>> times and the results didn't seem vary much.
>>
>>>
>>> The lack of data and the lack of a note on the potential downside is the
>>> main reason why I'm not acking patch. tbench4 was a particular concern on
>>> my own tests and it's possible a better patch would be a hybrid approach
>>> where a limited search of two cores (excluding target) is allowed even if
>>> SIS_UTIL indicates overload but has_idle_cores is true.
>>>
>>
>> Agreed. And the reason I will exclude this part in v2 is that I plan to
>> make it part of another feature, SIS filter [2]. The latest version of
>> SIS filter (not posted yet but soon) will contain all the idle cpus so
>> we don't need a full scan when has_idle_core. Scanning the filter then
>> is enough. While it may still cost too much if too many false positive
>> bits in the filter. Does this direction make sense to you?
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>> [2]
>> https://lore.kernel.org/lkml/[email protected]/
>>
>
> The filter idea is tricky, it will always struggle with "accurate
> information" vs "cost of cpumask update" and what you link indicates that
> it's updated on the tick boundary. That will be relatively cheap but could
> mean that searches near the point of saturation or overload will have
> false positives and negatives which you are already aware of given the

I didn't exclude the CPU_NEWLY_IDLE case, yet still the filter worked
the way unexpected as I found recently. The propagation of the idle or
sched-idle cpus can be skipped if rq->avg_idle is small enough (which
is sysctl_sched_migration_cost 500us by default). This can result in
the benchmarks of frequent wakeup type losing some throughput.

So the filter of my latest version will be updated at __update_idle_core
rather than load_balance, and reset when a full domain scan fails. I
will include more details in the log when posting, and I wish you can
give some advice then.

> older series. It does not kill the idea but I would strongly suggest that
> you do the simple thing first. That would yield potentially 3-4 patches
> at the end of the series
>
> 1. Full scan for cores (this patch effectively)
> 2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
> (Compare 1 vs 2 in the changelog, not expected to be a universal win but
> should have better "worst case" behaviour to be worth merging)
> 3. Filtered scan tracking idle CPUs as a hint while removing the
> "limited scan"
> (Compare 2 vs 3)
> 4. The SIS "de-entrophy" patch that tries to cope with false positives
> and negatives
> (Compare 3 vs 4)
>
> That way if the later patches do not work as expected then they can be
> easily dropped without affecting the rest of the series.
>

Copy. Thanks for your advice!

Best Regards,
Abel

2022-09-02 05:06:29

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 8/29/22 10:56 PM, Mel Gorman Wrote:
> On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
>> Hi Mel, thanks for reviewing!
>>
>
> No problem, sorry it took so long.
>
>>> So, first suggestion is to move this patch to the end of the series as
>>> the other patches are relatively harmless. They could even be merged in
>>> isolation as a cleanup.
>>>
>>> Second, using the other patches as your baseline, include in the
>>> changelog what you tested that showed a benefit, what type of machine
>>> it was and in particular include the number of cores, nodes and the
>>> span of the LLC. If you measured any regressions, include that as well
>>> and make a call on whether you think the patch wins more than it loses.
>>> The reason to include that information is because the worst corner case
>>> (all CPUs scanned uselessly) costs more the larger the span of LLC is.
>>> If all the testing was done on a 2-core SMT-2 machine, the overhead of the
>>> patch would be negligible but very different if the LLC span is 20 cores.
>>> While the patch is not obviously wrong, it definitely needs better data,
>>> Even if you do not have a large test machine available, it's still helpful
>>> to have it in the changelog because a reviewer like me can decide "this
>>> needs testing on a larger machine".
>>
>> Thanks for your detailed suggestions. I will attach benchmark results
>> along with some analysis next time when posting performance related
>> patches.
>>
>
> Thanks, include this in the changelog. While I had different figures for
> hackbench, the figures are still fine. I had similar figures for netperf
> (~3-4% regression on some machines but not universal). The tbench figures
> are interesting because for you, it was mostly neutral but I did test
> it with a CascadeLake machine and had worse results and that machine is
> smaller in terms of core counts than yours.
>
>>>
>>> I did queue this up the entire series for testing and while it sometimes
>>> benefitted, there were large counter-examples. tbench4 on Zen3 showed some
>>> large regressions (e.g. 41% loss on 64 clients with a massive increase in
>>> variability) which has multiple small L3 caches per node. tbench4 (slightly
>>> modified in mmtests to produce a usable metric) in general showed issues
>>> across multiple x86-64 machines both AMD and Intel, multiple generations
>>> with a noticable increase in system CPU usage when the client counts reach
>>> the stage where the machine is close to saturated. perfpipe for some
>>> weird reason showed a large regression apparent regresion on Broadwell
>>> but the variability was also crazy so probably can be ignored. hackbench
>>> overall looked ok so it's possible I'm wrong about the idle_cores hint
>>> being often wrong on that workload and I should double check that. It's
>>> possible the hint is wrong some of the times but right often enough to
>>> either benefit from using an idle core or by finding an idle sibling which
>>> may be preferable to stacking tasks on the same CPU.
>>
>> I attached my test result in one of my replies[1]: netperf showed ~3.5%
>> regression, hackbench improved a lot, and tbench4 drew. I tested several
>> times and the results didn't seem vary much.
>>
>>>
>>> The lack of data and the lack of a note on the potential downside is the
>>> main reason why I'm not acking patch. tbench4 was a particular concern on
>>> my own tests and it's possible a better patch would be a hybrid approach
>>> where a limited search of two cores (excluding target) is allowed even if
>>> SIS_UTIL indicates overload but has_idle_cores is true.
>>>
>>
>> Agreed. And the reason I will exclude this part in v2 is that I plan to
>> make it part of another feature, SIS filter [2]. The latest version of
>> SIS filter (not posted yet but soon) will contain all the idle cpus so
>> we don't need a full scan when has_idle_core. Scanning the filter then
>> is enough. While it may still cost too much if too many false positive
>> bits in the filter. Does this direction make sense to you?
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>> [2]
>> https://lore.kernel.org/lkml/[email protected]/
>>
>
> The filter idea is tricky, it will always struggle with "accurate
> information" vs "cost of cpumask update" and what you link indicates that
> it's updated on the tick boundary. That will be relatively cheap but could
> mean that searches near the point of saturation or overload will have
> false positives and negatives which you are already aware of given the
> older series. It does not kill the idea but I would strongly suggest that
> you do the simple thing first. That would yield potentially 3-4 patches
> at the end of the series
>
> 1. Full scan for cores (this patch effectively)
> 2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
> (Compare 1 vs 2 in the changelog, not expected to be a universal win but
> should have better "worst case" behaviour to be worth merging)

I am afraid I had a hard time on this part, and it would be nice if you
can shed some light on this..

Currently the mainline behavior when has_idle_core:

nr_idle_scan 0 1 2 ...
nr 0 max max ...

while this patch makes:

nr_idle_scan 0 1 2 ...
nr max max max ...

and if limit scan:

nr_idle_scan 0 1 2 ...
nr 2 2 max ...

anyway, the scan depth for has_idle_core case is not well aligned to
the load. And honestly I'm not sure whether the depth should align to
the load or not, since lower depth can easily put sds->has_idle_core
in vain. Thoughts?

Thanks,
Abel

> 3. Filtered scan tracking idle CPUs as a hint while removing the
> "limited scan"
> (Compare 2 vs 3)
> 4. The SIS "de-entrophy" patch that tries to cope with false positives
> and negatives
> (Compare 3 vs 4)
>
> That way if the later patches do not work as expected then they can be
> easily dropped without affecting the rest of the series.
>

2022-09-02 11:12:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Fri, Sep 02, 2022 at 12:12:31PM +0800, Abel Wu wrote:
> On 8/29/22 10:56 PM, Mel Gorman Wrote:
> > On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
> > > Hi Mel, thanks for reviewing!
> > >
> >
> > No problem, sorry it took so long.
> >
> > > > So, first suggestion is to move this patch to the end of the series as
> > > > the other patches are relatively harmless. They could even be merged in
> > > > isolation as a cleanup.
> > > >
> > > > Second, using the other patches as your baseline, include in the
> > > > changelog what you tested that showed a benefit, what type of machine
> > > > it was and in particular include the number of cores, nodes and the
> > > > span of the LLC. If you measured any regressions, include that as well
> > > > and make a call on whether you think the patch wins more than it loses.
> > > > The reason to include that information is because the worst corner case
> > > > (all CPUs scanned uselessly) costs more the larger the span of LLC is.
> > > > If all the testing was done on a 2-core SMT-2 machine, the overhead of the
> > > > patch would be negligible but very different if the LLC span is 20 cores.
> > > > While the patch is not obviously wrong, it definitely needs better data,
> > > > Even if you do not have a large test machine available, it's still helpful
> > > > to have it in the changelog because a reviewer like me can decide "this
> > > > needs testing on a larger machine".
> > >
> > > Thanks for your detailed suggestions. I will attach benchmark results
> > > along with some analysis next time when posting performance related
> > > patches.
> > >
> >
> > Thanks, include this in the changelog. While I had different figures for
> > hackbench, the figures are still fine. I had similar figures for netperf
> > (~3-4% regression on some machines but not universal). The tbench figures
> > are interesting because for you, it was mostly neutral but I did test
> > it with a CascadeLake machine and had worse results and that machine is
> > smaller in terms of core counts than yours.
> >
> > > >
> > > > I did queue this up the entire series for testing and while it sometimes
> > > > benefitted, there were large counter-examples. tbench4 on Zen3 showed some
> > > > large regressions (e.g. 41% loss on 64 clients with a massive increase in
> > > > variability) which has multiple small L3 caches per node. tbench4 (slightly
> > > > modified in mmtests to produce a usable metric) in general showed issues
> > > > across multiple x86-64 machines both AMD and Intel, multiple generations
> > > > with a noticable increase in system CPU usage when the client counts reach
> > > > the stage where the machine is close to saturated. perfpipe for some
> > > > weird reason showed a large regression apparent regresion on Broadwell
> > > > but the variability was also crazy so probably can be ignored. hackbench
> > > > overall looked ok so it's possible I'm wrong about the idle_cores hint
> > > > being often wrong on that workload and I should double check that. It's
> > > > possible the hint is wrong some of the times but right often enough to
> > > > either benefit from using an idle core or by finding an idle sibling which
> > > > may be preferable to stacking tasks on the same CPU.
> > >
> > > I attached my test result in one of my replies[1]: netperf showed ~3.5%
> > > regression, hackbench improved a lot, and tbench4 drew. I tested several
> > > times and the results didn't seem vary much.
> > >
> > > >
> > > > The lack of data and the lack of a note on the potential downside is the
> > > > main reason why I'm not acking patch. tbench4 was a particular concern on
> > > > my own tests and it's possible a better patch would be a hybrid approach
> > > > where a limited search of two cores (excluding target) is allowed even if
> > > > SIS_UTIL indicates overload but has_idle_cores is true.
> > > >
> > >
> > > Agreed. And the reason I will exclude this part in v2 is that I plan to
> > > make it part of another feature, SIS filter [2]. The latest version of
> > > SIS filter (not posted yet but soon) will contain all the idle cpus so
> > > we don't need a full scan when has_idle_core. Scanning the filter then
> > > is enough. While it may still cost too much if too many false positive
> > > bits in the filter. Does this direction make sense to you?
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2]
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> >
> > The filter idea is tricky, it will always struggle with "accurate
> > information" vs "cost of cpumask update" and what you link indicates that
> > it's updated on the tick boundary. That will be relatively cheap but could
> > mean that searches near the point of saturation or overload will have
> > false positives and negatives which you are already aware of given the
> > older series. It does not kill the idea but I would strongly suggest that
> > you do the simple thing first. That would yield potentially 3-4 patches
> > at the end of the series
> >
> > 1. Full scan for cores (this patch effectively)
> > 2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
> > (Compare 1 vs 2 in the changelog, not expected to be a universal win but
> > should have better "worst case" behaviour to be worth merging)
>
> I am afraid I had a hard time on this part, and it would be nice if you
> can shed some light on this..
>
> Currently the mainline behavior when has_idle_core:
>
> nr_idle_scan 0 1 2 ...
> nr 0 max max ...
>
> while this patch makes:
>
> nr_idle_scan 0 1 2 ...
> nr max max max ...
>
> and if limit scan:
>
> nr_idle_scan 0 1 2 ...
> nr 2 2 max ...
>
> anyway, the scan depth for has_idle_core case is not well aligned to
> the load. And honestly I'm not sure whether the depth should align to
> the load or not, since lower depth can easily put sds->has_idle_core
> in vain. Thoughts?
>

For the simple case, I was expecting the static depth to *not* match load
because it's unclear what the scaling should be for load or if it had a
benefit. If investigating scaling the scan depth to load, it would still
make sense to compare it to a static depth. The depth of 2 cores was to
partially match the old SIS_PROP behaviour of the minimum depth to scan.

if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
nr = 4;

nr is not proportional to cores although it could be
https://lore.kernel.org/all/[email protected]/

This is not tested or properly checked for correctness but for
illustrative purposes something like this should conduct a limited scan when
overloaded. It has a side-effect that the has_idle_cores hint gets cleared
for a partial scan for idle cores but the hint is probably wrong anyway.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6089251a4720..59b27a2ef465 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (sd_share) {
/* because !--nr is the condition to stop scan */
nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
- /* overloaded LLC is unlikely to have idle cpu/core */
- if (nr == 1)
- return -1;
+
+ /*
+ * Non-overloaded case: Scan full domain if there is
+ * an idle core. Otherwise, scan for an idle
+ * CPU based on nr_idle_scan
+ * Overloaded case: Unlikely to have an idle CPU but
+ * conduct a limited scan if there is potentially
+ * an idle core.
+ */
+ if (nr > 1) {
+ if (has_idle_core)
+ nr = sd->span_weight;
+ } else {
+ if (!has_idle_core)
+ return -1;
+ nr = 2;
+ }
}
}

for_each_cpu_wrap(cpu, cpus, target + 1) {
+ if (!--nr)
+ break;
+
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;

} else {
- if (!--nr)
- return -1;
idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;

2022-09-05 15:51:37

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 9/2/22 6:25 PM, Mel Gorman Wrote:
> For the simple case, I was expecting the static depth to *not* match load
> because it's unclear what the scaling should be for load or if it had a
> benefit. If investigating scaling the scan depth to load, it would still
> make sense to compare it to a static depth. The depth of 2 cores was to
> partially match the old SIS_PROP behaviour of the minimum depth to scan.
>
> if (span_avg > 4*avg_cost)
> nr = div_u64(span_avg, avg_cost);
> else
> nr = 4;
>
> nr is not proportional to cores although it could be
> https://lore.kernel.org/all/[email protected]/
>
> This is not tested or properly checked for correctness but for
> illustrative purposes something like this should conduct a limited scan when
> overloaded. It has a side-effect that the has_idle_cores hint gets cleared
> for a partial scan for idle cores but the hint is probably wrong anyway.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6089251a4720..59b27a2ef465 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> if (sd_share) {
> /* because !--nr is the condition to stop scan */
> nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> - /* overloaded LLC is unlikely to have idle cpu/core */
> - if (nr == 1)
> - return -1;
> +
> + /*
> + * Non-overloaded case: Scan full domain if there is
> + * an idle core. Otherwise, scan for an idle
> + * CPU based on nr_idle_scan
> + * Overloaded case: Unlikely to have an idle CPU but
> + * conduct a limited scan if there is potentially
> + * an idle core.
> + */
> + if (nr > 1) {
> + if (has_idle_core)
> + nr = sd->span_weight;
> + } else {
> + if (!has_idle_core)
> + return -1;
> + nr = 2;
> + }
> }
> }
>
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> + if (!--nr)
> + break;
> +
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> if ((unsigned int)i < nr_cpumask_bits)
> return i;
>
> } else {
> - if (!--nr)
> - return -1;
> idle_cpu = __select_idle_cpu(cpu, p);
> if ((unsigned int)idle_cpu < nr_cpumask_bits)
> break;

I spent last few days testing this, with 3 variations (assume
has_idle_core):

a) full or limited (2cores) scan when !nr_idle_scan
b) whether clear sds->has_idle_core when partial scan failed
c) scale scan depth with load or not

some observations:

1) It seems always bad if not clear sds->has_idle_core when
partial scan fails. It is due to over partially scanned
but still can not find an idle core. (Following ones are
based on clearing has_idle_core even in partial scans.)

2) Unconditionally full scan when has_idle_core is not good
for netperf_{udp,tcp} and tbench4. It is probably because
the SIS success rate of these workloads is already high
enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
hackbench ~= 3.5%) which negate a lot of the benefit full
scan brings.

3) Scaling scan depth with load seems good for the hackbench
socket tests, and neutral in pipe tests. And I think this
is just the case you mentioned before, under fast wake-up
workloads the has_idle_core will become not that reliable,
so a full scan won't always win.

Best Regards,
Abel

2022-09-06 10:37:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Mon, Sep 05, 2022 at 10:40:00PM +0800, Abel Wu wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6089251a4720..59b27a2ef465 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > if (sd_share) {
> > /* because !--nr is the condition to stop scan */
> > nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > - /* overloaded LLC is unlikely to have idle cpu/core */
> > - if (nr == 1)
> > - return -1;
> > +
> > + /*
> > + * Non-overloaded case: Scan full domain if there is
> > + * an idle core. Otherwise, scan for an idle
> > + * CPU based on nr_idle_scan
> > + * Overloaded case: Unlikely to have an idle CPU but
> > + * conduct a limited scan if there is potentially
> > + * an idle core.
> > + */
> > + if (nr > 1) {
> > + if (has_idle_core)
> > + nr = sd->span_weight;
> > + } else {
> > + if (!has_idle_core)
> > + return -1;
> > + nr = 2;
> > + }
> > }
> > }
> > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > + if (!--nr)
> > + break;
> > +
> > if (has_idle_core) {
> > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > if ((unsigned int)i < nr_cpumask_bits)
> > return i;
> > } else {
> > - if (!--nr)
> > - return -1;
> > idle_cpu = __select_idle_cpu(cpu, p);
> > if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > break;
>
> I spent last few days testing this, with 3 variations (assume
> has_idle_core):
>
> a) full or limited (2cores) scan when !nr_idle_scan
> b) whether clear sds->has_idle_core when partial scan failed
> c) scale scan depth with load or not
>
> some observations:
>
> 1) It seems always bad if not clear sds->has_idle_core when
> partial scan fails. It is due to over partially scanned
> but still can not find an idle core. (Following ones are
> based on clearing has_idle_core even in partial scans.)
>

Ok, that's rational. There will be corner cases where there was no idle
CPU near the target when there is an idle core far away but it would be
counter to the purpose of SIS_UTIL to care about that corner case.

> 2) Unconditionally full scan when has_idle_core is not good
> for netperf_{udp,tcp} and tbench4. It is probably because
> the SIS success rate of these workloads is already high
> enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
> hackbench ~= 3.5%) which negate a lot of the benefit full
> scan brings.
>

That's also rational. For a single client/server on netperf, it's expected
that the SIS success rate is high and scanning is minimal. As the client
and server are sharing data on localhost and somewhat synchronous, it may
even partially benefit from SMT sharing.

So basic approach would be "always clear sds->has_idle_core" + "limit
scan even when has_idle_core is true", right?

If so, stick a changelog on it and resend!

> 3) Scaling scan depth with load seems good for the hackbench
> socket tests, and neutral in pipe tests. And I think this
> is just the case you mentioned before, under fast wake-up
> workloads the has_idle_core will become not that reliable,
> so a full scan won't always win.
>

My recommendation is leave this out for now but assuming the rest of
the patches get picked up, consider posting it for the next major kernel
version (i.e. separate the basic and clever approaches by one major kernel
version). By separating them, there is less chance of a false positive
bisection pointing to the wrong patch. Any regression will not be perfectly
reproducible so the changes of a false positive bisection is relatively high.

--
Mel Gorman
SUSE Labs

2022-09-07 08:12:41

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On 9/6/22 5:57 PM, Mel Gorman wrote:
> On Mon, Sep 05, 2022 at 10:40:00PM +0800, Abel Wu wrote:
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6089251a4720..59b27a2ef465 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> if (sd_share) {
>>> /* because !--nr is the condition to stop scan */
>>> nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
>>> - /* overloaded LLC is unlikely to have idle cpu/core */
>>> - if (nr == 1)
>>> - return -1;
>>> +
>>> + /*
>>> + * Non-overloaded case: Scan full domain if there is
>>> + * an idle core. Otherwise, scan for an idle
>>> + * CPU based on nr_idle_scan
>>> + * Overloaded case: Unlikely to have an idle CPU but
>>> + * conduct a limited scan if there is potentially
>>> + * an idle core.
>>> + */
>>> + if (nr > 1) {
>>> + if (has_idle_core)
>>> + nr = sd->span_weight;
>>> + } else {
>>> + if (!has_idle_core)
>>> + return -1;
>>> + nr = 2;
>>> + }
>>> }
>>> }
>>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>>> + if (!--nr)
>>> + break;
>>> +
>>> if (has_idle_core) {
>>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>> if ((unsigned int)i < nr_cpumask_bits)
>>> return i;
>>> } else {
>>> - if (!--nr)
>>> - return -1;
>>> idle_cpu = __select_idle_cpu(cpu, p);
>>> if ((unsigned int)idle_cpu < nr_cpumask_bits)
>>> break;
>>
>> I spent last few days testing this, with 3 variations (assume
>> has_idle_core):
>>
>> a) full or limited (2cores) scan when !nr_idle_scan
>> b) whether clear sds->has_idle_core when partial scan failed
>> c) scale scan depth with load or not
>>
>> some observations:
>>
>> 1) It seems always bad if not clear sds->has_idle_core when
>> partial scan fails. It is due to over partially scanned
>> but still can not find an idle core. (Following ones are
>> based on clearing has_idle_core even in partial scans.)
>>
>
> Ok, that's rational. There will be corner cases where there was no idle
> CPU near the target when there is an idle core far away but it would be
> counter to the purpose of SIS_UTIL to care about that corner case.

Yes, and this corner case (that may become normal sometimes actually)
will be continuously exist if scanning in a linear fashion while scan
depth is limited (SIS_UTIL/SIS_PROP/...), especially when the LLC is
getting larger nowadays.

>
>> 2) Unconditionally full scan when has_idle_core is not good
>> for netperf_{udp,tcp} and tbench4. It is probably because
>> the SIS success rate of these workloads is already high
>> enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
>> hackbench ~= 3.5%) which negate a lot of the benefit full
>> scan brings.
>>
>
> That's also rational. For a single client/server on netperf, it's expected
> that the SIS success rate is high and scanning is minimal. As the client
> and server are sharing data on localhost and somewhat synchronous, it may
> even partially benefit from SMT sharing.
>
> So basic approach would be "always clear sds->has_idle_core" + "limit
> scan even when has_idle_core is true", right?

Yes, exactly the same as what you suggested at first place.

>
> If so, stick a changelog on it and resend!

I will include this in the SIS_FILTER patchset.

>
>> 3) Scaling scan depth with load seems good for the hackbench
>> socket tests, and neutral in pipe tests. And I think this
>> is just the case you mentioned before, under fast wake-up
>> workloads the has_idle_core will become not that reliable,
>> so a full scan won't always win.
>>
>
> My recommendation is leave this out for now but assuming the rest of
> the patches get picked up, consider posting it for the next major kernel
> version (i.e. separate the basic and clever approaches by one major kernel
> version). By separating them, there is less chance of a false positive
> bisection pointing to the wrong patch. Any regression will not be perfectly
> reproducible so the changes of a false positive bisection is relatively high.

Makes sense. I will just include the basic part first.

Thanks,
Abel

2022-09-07 08:18:50

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

Hi Mel,
On 2022-09-06 at 10:57:17 +0100, Mel Gorman wrote:
> On Mon, Sep 05, 2022 at 10:40:00PM +0800, Abel Wu wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6089251a4720..59b27a2ef465 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > if (sd_share) {
> > > /* because !--nr is the condition to stop scan */
> > > nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > > - /* overloaded LLC is unlikely to have idle cpu/core */
> > > - if (nr == 1)
> > > - return -1;
> > > +
> > > + /*
> > > + * Non-overloaded case: Scan full domain if there is
> > > + * an idle core. Otherwise, scan for an idle
> > > + * CPU based on nr_idle_scan
> > > + * Overloaded case: Unlikely to have an idle CPU but
> > > + * conduct a limited scan if there is potentially
> > > + * an idle core.
> > > + */
> > > + if (nr > 1) {
> > > + if (has_idle_core)
> > > + nr = sd->span_weight;
> > > + } else {
> > > + if (!has_idle_core)
> > > + return -1;
> > > + nr = 2;
> > > + }
> > > }
> > > }
> > > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > > + if (!--nr)
> > > + break;
> > > +
> > > if (has_idle_core) {
> > > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > if ((unsigned int)i < nr_cpumask_bits)
> > > return i;
> > > } else {
> > > - if (!--nr)
> > > - return -1;
> > > idle_cpu = __select_idle_cpu(cpu, p);
> > > if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > > break;
> >
> > I spent last few days testing this, with 3 variations (assume
> > has_idle_core):
> >
> > a) full or limited (2cores) scan when !nr_idle_scan
> > b) whether clear sds->has_idle_core when partial scan failed
> > c) scale scan depth with load or not
> >
> > some observations:
> >
> > 1) It seems always bad if not clear sds->has_idle_core when
> > partial scan fails. It is due to over partially scanned
> > but still can not find an idle core. (Following ones are
> > based on clearing has_idle_core even in partial scans.)
> >
>
> Ok, that's rational. There will be corner cases where there was no idle
> CPU near the target when there is an idle core far away but it would be
> counter to the purpose of SIS_UTIL to care about that corner case.
>
> > 2) Unconditionally full scan when has_idle_core is not good
> > for netperf_{udp,tcp} and tbench4. It is probably because
> > the SIS success rate of these workloads is already high
> > enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
> > hackbench ~= 3.5%) which negate a lot of the benefit full
> > scan brings.
> >
>
> That's also rational. For a single client/server on netperf, it's expected
> that the SIS success rate is high and scanning is minimal. As the client
> and server are sharing data on localhost and somewhat synchronous, it may
> even partially benefit from SMT sharing.
>
Maybe off-topic, since we monitor the success rate and also other metrics
for each optimization in SIS path, is it possible to merge your statistics
patch [1] into upstream so we don't need to rebase in the future(although
it is targeting kernel development)?

Link: https://lore.kernel.org/lkml/[email protected]/
[cut]


thanks,
Chenyu

2022-09-07 09:10:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

On Wed, Sep 07, 2022 at 03:27:41PM +0800, Chen Yu wrote:
> > Ok, that's rational. There will be corner cases where there was no idle
> > CPU near the target when there is an idle core far away but it would be
> > counter to the purpose of SIS_UTIL to care about that corner case.
> >
> > > 2) Unconditionally full scan when has_idle_core is not good
> > > for netperf_{udp,tcp} and tbench4. It is probably because
> > > the SIS success rate of these workloads is already high
> > > enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
> > > hackbench ~= 3.5%) which negate a lot of the benefit full
> > > scan brings.
> > >
> >
> > That's also rational. For a single client/server on netperf, it's expected
> > that the SIS success rate is high and scanning is minimal. As the client
> > and server are sharing data on localhost and somewhat synchronous, it may
> > even partially benefit from SMT sharing.
> >
> Maybe off-topic, since we monitor the success rate and also other metrics
> for each optimization in SIS path, is it possible to merge your statistics
> patch [1] into upstream so we don't need to rebase in the future(although
> it is targeting kernel development)?
>

I am doubtful it is a merge candidate. While it's very useful when modifying
SIS and gathering data on whether SIS is behaving as expected, it has little
practical benefit when debugging problems on normal systems. Crude estimates
can be obtained by other methods. Probing when select_idle_sibling and
select_idle_cpu are called reveals the SIS fast and slow paths and the
ratio between time. Tracking the time spent in select_idle_cpu reveals
how much time is spent finding idle cores and cpus.

I would not object to someone trying but the changlog would benefit from
explaining why it's practically useful. Every time I've used it, it was to
justify another patch being merged or investigating various SIS corner cases.

--
Mel Gorman
SUSE Labs