2021-04-22 10:25:10

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 00/10] sched/fair: wake_affine improvements

Recently we found that some of the benchmark numbers on Power10 were lesser
than expected. Some analysis showed that the problem lies in the fact that
L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.

One probable solution to the problem was worked by Gautham where he posted
http://lore.kernel.org/lkml/[email protected]/t/#u
a patch that marks MC domain as LLC.

Here the focus is on seeing if we can improve the current core scheduler's
wakeup mechanism by looking at idle-cores and nr_busy_cpus that is already
maintained per Last level cache(aka LLC) (first 8 patches) + explore the
possibility to provide a fallback LLC domain, that can be preferred if the
current LLC is busy (last 2 patches).

Except the last 2 patches, the rest patches should work independently of the
other proposed solution. i.e if the mc-llc patch is accepted, then the last
two patches may not be needed for Power10. However this may be helpful for
other archs/platforms.

In the fallback approach, we look for a one-to-one mapping for each LLC.
However this can be easily modified to look for all LLC's in the current
LLC's parent. Also fallback is only used for sync wakeups. This is because
that is where we expect the maximum benefit of moving the task closer to the
task. For non-sync wakeups, its expected that CPU from previous LLC may be
better off.

Request you to please review and provide your feedback.

Benchmarking numbers are from Power 10 but I have verified that we don't
regress on Power 9 setup.

# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 8
Core(s) per socket: 10
Socket(s): 1
NUMA node(s): 1
Model: 1.0 (pvr 0080 0100)
Model name: POWER10 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8K
NUMA node2 CPU(s): 0-79

Hackbench: (latency, lower is better)

v5.12-rc5
instances = 1, min = 24.102529 usecs/op, median = usecs/op, max = 24.102529 usecs/op
instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op

v5.12-rc5 + mc-llc
instances = 1, min = 18.676505 usecs/op, median = usecs/op, max = 18.676505 usecs/op
instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op

v5.12-rc5 + this patchset
instances = 1, min = 19.368805 usecs/op, median = usecs/op, max = 19.368805 usecs/op
instances = 2, min = 19.423674 usecs/op, median = 19.423674 usecs/op, max = 19.506203 usecs/op
instances = 4, min = 19.454523 usecs/op, median = 19.596947 usecs/op, max = 19.863620 usecs/op
instances = 8, min = 20.005272 usecs/op, median = 20.239924 usecs/op, max = 20.878947 usecs/op
instances = 16, min = 21.856779 usecs/op, median = 24.102147 usecs/op, max = 25.496110 usecs/op
instances = 32, min = 31.460159 usecs/op, median = 32.809621 usecs/op, max = 33.939650 usecs/op
instances = 64, min = 39.506553 usecs/op, median = 43.835221 usecs/op, max = 45.645505 usecs/op
instances = 80, min = 43.805716 usecs/op, median = 44.314757 usecs/op, max = 48.910236 usecs/op

Summary:
mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5

DayTrader (throughput, higher is better)
v5.12-rc5 v5.12-rc5 v5.12-rc5
+ mc-llc + patchset
64CPUs/1JVM/ 60Users 6373.7 7520.5 7232.3
64CPUs/1JVM/ 80Users 6742.1 7940.9 7732.8
64CPUs/1JVM/100Users 6482.2 7730.3 7540
64CPUs/2JVM/ 60Users 6335 8081.6 7914.3
64CPUs/2JVM/ 80Users 6360.8 8259.6 8138.6
64CPUs/2JVM/100Users 6215.6 8046.5 8039.2
64CPUs/4JVM/ 60Users 5385.4 7685.3 7706.1
64CPUs/4JVM/ 80Users 5380.8 7753.3 7721.5
64CPUs/4JVM/100Users 5275.2 7549.2 7608.3

Summary: Across all profiles, this patchset or mc-llc out perform
vanilla v5.12-rc5
Not: Only 64 cores were online during this test.

schbench (latency: lesser is better)
======== Running schbench -m 3 -r 30 =================
Latency percentiles (usec) runtime 10 (s) (2545 total samples)
v5.12-rc5 | v5.12-rc5 + mc-llc | v5.12-rc5 + patchset

50.0th: 56 (1301 samples) | 50.0th: 49 (1309 samples) | 50.0th: 50 (1310 samples)
75.0th: 76 (623 samples) | 75.0th: 66 (628 samples) | 75.0th: 68 (632 samples)
90.0th: 93 (371 samples) | 90.0th: 78 (371 samples) | 90.0th: 80 (354 samples)
95.0th: 107 (123 samples) | 95.0th: 87 (117 samples) | 95.0th: 86 (126 samples)
*99.0th: 12560 (102 samples) *99.0th: 100 (97 samples) | *99.0th: 103 (97 samples)
99.5th: 15312 (14 samples) | 99.5th: 104 (12 samples) | 99.5th: 1202 (13 samples)
99.9th: 19936 (9 samples) | 99.9th: 106 (8 samples) | 99.9th: 14992 (10 samples)
min=13, max=20684 | min=15, max=113 | min=15, max=18721

Latency percentiles (usec) runtime 20 (s) (7649 total samples)

50.0th: 51 (3884 samples) | 50.0th: 50 (3935 samples) | 50.0th: 49 (3841 samples)
75.0th: 69 (1859 samples) | 75.0th: 66 (1817 samples) | 75.0th: 67 (1965 samples)
90.0th: 87 (1173 samples) | 90.0th: 80 (1204 samples) | 90.0th: 78 (1134 samples)
95.0th: 97 (368 samples) | 95.0th: 87 (342 samples) | 95.0th: 83 (359 samples)
*99.0th: 8624 (290 samples)| *99.0th: 98 (294 samples) | *99.0th: 93 (296 samples)
99.5th: 11344 (37 samples) | 99.5th: 102 (37 samples) | 99.5th: 98 (34 samples)
99.9th: 18592 (31 samples) | 99.9th: 106 (30 samples) | 99.9th: 7544 (28 samples)
min=13, max=20684 | min=12, max=113 | min=13, max=18721

Latency percentiles (usec) runtime 30 (s) (12785 total samples)

50.0th: 50 (6614 samples) | 50.0th: 49 (6544 samples) | 50.0th: 48 (6527 samples)
75.0th: 67 (3059 samples) | 75.0th: 65 (3100 samples) | 75.0th: 64 (3143 samples)
90.0th: 84 (1894 samples) | 90.0th: 79 (1912 samples) | 90.0th: 76 (1985 samples)
95.0th: 94 (586 samples) | 95.0th: 87 (646 samples) | 95.0th: 81 (585 samples)
*99.0th: 8304 (507 samples)| *99.0th: 101 (496 samples) | *99.0th: 90 (453 samples)
99.5th: 11696 (62 samples) | 99.5th: 104 (45 samples) | 99.5th: 94 (66 samples)
99.9th: 18592 (51 samples) | 99.9th: 110 (51 samples) | 99.9th: 1202 (49 samples)
min=12, max=21421 | min=1, max=126 | min=3, max=18721

Summary:
mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5


mongodb (threads=6) (throughput, higher is better)
Throughput read clean update
latency latency latency
v5.12-rc5 JVM=YCSB_CLIENTS=14 68116.05 ops/sec 1109.82 us 944.19 us 1342.29 us
v5.12-rc5 JVM=YCSB_CLIENTS=21 64802.69 ops/sec 1772.64 us 944.69 us 2099.57 us
v5.12-rc5 JVM=YCSB_CLIENTS=28 61792.78 ops/sec 2490.48 us 930.09 us 2928.03 us
v5.12-rc5 JVM=YCSB_CLIENTS=35 59604.44 ops/sec 3236.86 us 870.28 us 3787.48 us

v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=14 70948.51 ops/sec 1060.21 us 842.02 us 1289.44 us
v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=21 68732.48 ops/sec 1669.91 us 871.57 us 1975.19 us
v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=28 66674.81 ops/sec 2313.79 us 889.59 us 2702.36 us
v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=35 64397.51 ops/sec 3010.66 us 966.28 us 3484.19 us

v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14 67403.29 ops/sec 1121.80 us 797.81 us 1357.28 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21 63952.79 ops/sec 1792.86 us 779.59 us 2130.54 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28 62198.83 ops/sec 2469.60 us 780.00 us 2914.48 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35 60333.81 ops/sec 3192.41 us 822.09 us 3748.24 us

Summary:
mc-llc outperforms, this patchset and upstream almost give similar performance.


Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>

Srikar Dronamraju (10):
sched/fair: Update affine statistics when needed
sched/fair: Maintain the identity of idle-core
sched/fair: Update idle-core more often
sched/fair: Prefer idle CPU to cache affinity
sched/fair: Call wake_affine only if necessary
sched/idle: Move busy_cpu accounting to idle callback
sched/fair: Remove ifdefs in waker_affine_idler_llc
sched/fair: Dont iterate if no idle CPUs
sched/topology: Introduce fallback LLC
powerpc/smp: Add fallback flag to powerpc MC domain

arch/powerpc/kernel/smp.c | 7 +-
include/linux/sched/sd_flags.h | 7 +
include/linux/sched/topology.h | 3 +-
kernel/sched/fair.c | 229 +++++++++++++++++++++++++++------
kernel/sched/features.h | 1 +
kernel/sched/idle.c | 33 ++++-
kernel/sched/sched.h | 6 +
kernel/sched/topology.c | 54 +++++++-
8 files changed, 296 insertions(+), 44 deletions(-)

--
2.18.2


2021-04-22 10:25:17

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 05/10] sched/fair: Call wake_affine only if necessary

Currently if a waker is the only running thread of the CPU, and waking a
wakee (both of them interacting over a pipe aka sync wakeups) with the
wakee previously running on a different LLC), then wakee is pulled and
queued on the CPU that is running the waker.

This is true even if the previous CPU was completely idle. The rationale
being waker would soon relinquish the CPU and wakee would benefit from
the cache access. However if the previous CPU is idle, then wakee thread
will incur latency cost of migration to the waker CPU + latency cost of
the waker context switching.

Before the waker switches out, load balancer could also kick in and pull
the wakee out resulting in another migration cost. This will mostly hurt
systems where LLC have just one core. For example:- Hackbench. Both the
threads of hackbench would then end up running on the same core
resulting in CPUs running in higher SMT modes, more load balances and
non optimal performance.

It would be beneficial to run the wakee thread on the same CPU as waker
only if other CPUs are busy. To achieve this move this part of the code
that check if waker is the only running thread to early part of
wake_affine_weight().

Also post this change, wake_affine_idle() will only look at wakeups
within the LLC. For wakeups within LLC, there should be no difference
between sync and no-sync. For wakeups across LLC boundaries, lets use
wake_affine_idler_llc.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 22 +++++++++++-----------
kernel/sched/features.h | 2 +-
2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 943621367a96..f8c98e544418 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5799,8 +5799,7 @@ static int wake_wide(struct task_struct *p)
* scheduling latency of the CPUs. This seems to work
* for the overloaded case.
*/
-static int
-wake_affine_idle(int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idle(int this_cpu, int prev_cpu)
{
/*
* If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5814,15 +5813,12 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
* a cpufreq perspective, it's better to have higher utilisation
* on one CPU.
*/
- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+ if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+ return prev_cpu;

- if (sync && cpu_rq(this_cpu)->nr_running == 1)
+ if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
return this_cpu;

- if (available_idle_cpu(prev_cpu))
- return prev_cpu;
-
return nr_cpumask_bits;
}

@@ -5838,6 +5834,9 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
if (sync) {
unsigned long current_load = task_h_load(current);

+ if (cpu_rq(this_cpu)->nr_running <= 1)
+ return this_cpu;
+
if (current_load > this_eff_load)
return this_cpu;

@@ -5933,12 +5932,13 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
+ bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
int target = nr_cpumask_bits;

- if (sched_feat(WA_IDLE))
- target = wake_affine_idle(this_cpu, prev_cpu, sync);
+ if (sched_feat(WA_IDLE) && share_caches)
+ target = wake_affine_idle(this_cpu, prev_cpu);

- if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+ else if (sched_feat(WA_IDLER_LLC) && !share_caches)
target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c77349a47e01..c60a6d2b2126 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)

SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
-SCHED_FEAT(WA_IDLER_LLC, false)
+SCHED_FEAT(WA_IDLER_LLC, true)
SCHED_FEAT(WA_BIAS, true)

/*
--
2.18.2

2021-04-22 10:25:28

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 06/10] sched/idle: Move busy_cpu accounting to idle callback

Currently we account nr_busy_cpus in no_hz idle functions.
There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
configs only. Also scheduler can mark a CPU as non-busy as soon as an
idle class task starts to run. Scheduler can then mark a CPU as busy
as soon as its woken up from idle or a new task is placed on it's
runqueue.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/idle.c | 29 +++++++++++++++++++++++++++--
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 2 ++
4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8c98e544418..00bcf1d861b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10384,7 +10384,10 @@ static void set_cpu_sd_state_busy(int cpu)
goto unlock;
sd->nohz_idle = 0;

- atomic_inc(&sd->shared->nr_busy_cpus);
+ if (sd && per_cpu(is_idle, cpu)) {
+ atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
+ per_cpu(is_idle, cpu) = 0;
+ }
unlock:
rcu_read_unlock();
}
@@ -10414,7 +10417,6 @@ static void set_cpu_sd_state_idle(int cpu)
goto unlock;
sd->nohz_idle = 1;

- atomic_dec(&sd->shared->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index cc828f3efe71..e624a05e48bd 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
-#ifdef CONFIG_SCHED_SMT
+#ifdef CONFIG_SMP
+ struct sched_domain_shared *sds;
int cpu = rq->cpu;

+#ifdef CONFIG_SCHED_SMT
if (static_branch_likely(&sched_smt_present))
set_core_busy(cpu);
#endif
+
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ if (per_cpu(is_idle, cpu)) {
+ atomic_inc(&sds->nr_busy_cpus);
+ per_cpu(is_idle, cpu) = 0;
+ }
+ }
+ rcu_read_unlock();
+#endif
}

static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
@@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
struct task_struct *pick_next_task_idle(struct rq *rq)
{
struct task_struct *next = rq->idle;
+#ifdef CONFIG_SMP
+ struct sched_domain_shared *sds;
+ int cpu = rq->cpu;

- set_next_task_idle(rq, next, true);
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
+ per_cpu(is_idle, cpu) = 1;
+ }

+ rcu_read_unlock();
+#endif
+
+ set_next_task_idle(rq, next, true);
return next;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c0bd4b0e73a..baf8d9a4cb26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
#ifdef CONFIG_SCHED_SMT
DECLARE_PER_CPU(int, smt_id);
#endif
+DECLARE_PER_CPU(int, is_idle);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8db40c8a6ad0..00e4669bb241 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
#ifdef CONFIG_SCHED_SMT
DEFINE_PER_CPU(int, smt_id);
#endif
+DEFINE_PER_CPU(int, is_idle);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
#ifdef CONFIG_SCHED_SMT
per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
#endif
+ per_cpu(is_idle, cpu) = 1;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);

sd = lowest_flag_domain(cpu, SD_NUMA);
--
2.18.2

2021-04-22 10:25:57

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 04/10] sched/fair: Prefer idle CPU to cache affinity

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a method where Scheduler
compares idle CPUs in waker and previous LLCs and picks the appropriate
one.

The new method looks at idle-core to figure out idle LLC. If there are
no idle LLCs, it compares the ratio of busy CPUs to the total number of
CPUs in the LLC. This method will only be useful to compare 2 LLCs. If
the previous CPU and the waking CPU are in the same LLC, this method
would not be useful. For now the new method is disabled by default.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Based on similar posting:
http://lore.kernel.org/lkml/[email protected]/t/#u
Some comments in the next patch
- Make WA_WAKER default (Suggested by Rik) : done in next patch
- Make WA_WAKER check more conservative: (Suggested by Rik / Peter)
- Rename WA_WAKER to WA_IDLER_LLC (Suggested by Vincent)
- s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar)
- Add rcu_read_lock and check for validity of shared domains
- Add idle-core support

kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 65 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09c33cca0349..943621367a96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,67 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}

+static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu, int sync)
+{
+#ifdef CONFIG_NO_HZ_COMMON
+ int pnr_busy, pllc_size, tnr_busy, tllc_size;
+#endif
+ struct sched_domain_shared *tsds, *psds;
+ int diff;
+
+ tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+ psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+ if (!tsds || !psds)
+ return nr_cpumask_bits;
+
+ if (sync) {
+ if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
+ return this_cpu;
+ if (tsds->idle_core != -1) {
+ if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
+ return tsds->idle_core;
+ return this_cpu;
+ }
+ }
+
+ if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+ return prev_cpu;
+ if (psds->idle_core != -1) {
+ if (cpumask_test_cpu(psds->idle_core, p->cpus_ptr))
+ return psds->idle_core;
+ return prev_cpu;
+ }
+
+ if (!sync) {
+ if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
+ return this_cpu;
+ if (tsds->idle_core != -1) {
+ if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
+ return tsds->idle_core;
+ return this_cpu;
+ }
+ }
+
+#ifdef CONFIG_NO_HZ_COMMON
+ tnr_busy = atomic_read(&tsds->nr_busy_cpus);
+ pnr_busy = atomic_read(&psds->nr_busy_cpus);
+
+ tllc_size = per_cpu(sd_llc_size, this_cpu);
+ pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+ if (pnr_busy == pllc_size && tnr_busy == tllc_size)
+ return nr_cpumask_bits;
+
+ diff = pnr_busy * tllc_size - tnr_busy * pllc_size;
+ if (diff > 0)
+ return this_cpu;
+ if (diff < 0)
+ return prev_cpu;
+#endif /* CONFIG_NO_HZ_COMMON */
+
+ return nr_cpumask_bits;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
@@ -5877,6 +5938,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);

+ if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+ target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b158fc51..c77349a47e01 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)

SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, false)
SCHED_FEAT(WA_BIAS, true)

/*
--
2.18.2

2021-04-22 10:26:05

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 07/10] sched/fair: Remove ifdefs in waker_affine_idler_llc

Now that idle callbacks are updating nr_busy_cpus, remove ifdefs in
wake_affine_idler_llc

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00bcf1d861b5..8f752f77b76f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5870,9 +5870,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,

static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu, int sync)
{
-#ifdef CONFIG_NO_HZ_COMMON
int pnr_busy, pllc_size, tnr_busy, tllc_size;
-#endif
struct sched_domain_shared *tsds, *psds;
int diff;

@@ -5909,7 +5907,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
}
}

-#ifdef CONFIG_NO_HZ_COMMON
tnr_busy = atomic_read(&tsds->nr_busy_cpus);
pnr_busy = atomic_read(&psds->nr_busy_cpus);

@@ -5924,7 +5921,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
return this_cpu;
if (diff < 0)
return prev_cpu;
-#endif /* CONFIG_NO_HZ_COMMON */

return nr_cpumask_bits;
}
--
2.18.2

2021-04-22 10:26:07

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 01/10] sched/fair: Update affine statistics when needed

wake_affine_idle() can return prev_cpu. Even in such a scenario,
scheduler was going ahead and updating schedstats related to wake
affine. i.e even if the task is not moved across LLC domains,
schedstats would have accounted.

Hence add a check before updating schedstats.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..a258a84cfdfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (target == nr_cpumask_bits)
return prev_cpu;

- schedstat_inc(sd->ttwu_move_affine);
- schedstat_inc(p->se.statistics.nr_wakeups_affine);
+ if (!cpus_share_cache(prev_cpu, target)) {
+ schedstat_inc(sd->ttwu_move_affine);
+ schedstat_inc(p->se.statistics.nr_wakeups_affine);
+ }
return target;
}

--
2.18.2

2021-04-22 10:26:21

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 03/10] sched/fair: Update idle-core more often

Currently when the scheduler does a load balance and pulls a task or
when a CPU picks up a task during wakeup without having to call
select_idle_cpu(), it never checks if the target CPU is part of the
idle-core. This makes idle-core less accurate.

Given that the identity of idle-core for LLC is maintained, its easy to
update the idle-core as soon as the CPU picks up a task.

This change will update the idle-core whenever a CPU from the idle-core
picks up a task. However if there are multiple idle-cores in the LLC,
and waking CPU happens to be part of the designated idle-core, idle-core
is set to -1 (i.e there are no idle-cores).

To reduce this case, whenever a CPU updates idle-core, it will look for
other cores in the LLC for an idle-core, if the core to which it belongs
to is not idle.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/idle.c | 6 ++++++
kernel/sched/sched.h | 2 ++
3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03083eacdaf0..09c33cca0349 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6037,6 +6037,39 @@ static inline int get_idle_core(int cpu, int def)
return def;
}

+static void set_next_idle_core(struct sched_domain *sd, int target)
+{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ int core, cpu;
+
+ cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
+ for_each_cpu_wrap(core, cpus, target) {
+ bool idle = true;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ if (!available_idle_cpu(cpu)) {
+ idle = false;
+ break;
+ }
+ }
+
+ if (idle) {
+ set_idle_core(core, per_cpu(smt_id, core));
+ return;
+ }
+
+ cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+ }
+}
+
+void set_core_busy(int core)
+{
+ rcu_read_lock();
+ if (get_idle_core(core, -1) == per_cpu(smt_id, core))
+ set_idle_core(core, -1);
+ rcu_read_unlock();
+}
+
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
* information in sd_llc_shared->idle_core.
@@ -6046,11 +6079,13 @@ static inline int get_idle_core(int cpu, int def)
*/
void __update_idle_core(struct rq *rq)
{
+ struct sched_domain *sd;
int core = cpu_of(rq);
int cpu;

rcu_read_lock();
- if (get_idle_core(core, 0) != -1)
+ sd = rcu_dereference(per_cpu(sd_llc, core));
+ if (!sd || get_idle_core(core, 0) != -1)
goto unlock;

for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6058,10 +6093,15 @@ void __update_idle_core(struct rq *rq)
continue;

if (!available_idle_cpu(cpu))
- goto unlock;
+ goto try_next;
}

set_idle_core(core, per_cpu(smt_id, core));
+ goto unlock;
+
+try_next:
+ set_next_idle_core(sd, core);
+
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7199e6f23789..cc828f3efe71 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,6 +425,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+#ifdef CONFIG_SCHED_SMT
+ int cpu = rq->cpu;
+
+ if (static_branch_likely(&sched_smt_present))
+ set_core_busy(cpu);
+#endif
}

static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46d40a281724..5c0bd4b0e73a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1102,6 +1102,7 @@ static inline bool is_migration_disabled(struct task_struct *p)

#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
+extern void set_core_busy(int cpu);

static inline void update_idle_core(struct rq *rq)
{
@@ -1111,6 +1112,7 @@ static inline void update_idle_core(struct rq *rq)

#else
static inline void update_idle_core(struct rq *rq) { }
+static inline void set_core_busy(int cpu) { }
#endif

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
--
2.18.2

2021-04-22 10:27:03

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 09/10] sched/topology: Introduce fallback LLC

On some systems, LLC sizes may be too small. Some of these systems may
also support multiple cache access latency levels i.e between the
previous LLC and waker LLC, there could be other LLCs that have a lesser
cache access latency to waker LLC. If the waker LLC is busy, then
scheduler could choose to scheduler a task on such LLC.

Here is one approach to identity a static fallback LLC for each LLC for
systems that support multiple cache access latency levels. In this
approach, the fallback LLCs are decided at boot/CPU bring up time. There
is a one-to-one mapping between the LLC and fallback LLC. The fallback
LLC will only be used if wakeup is a sync wakeup and the current LLC is
more busy than the fallback LLC. Also scheduler will not choose fallback
LLC if the previous LLC has same cache access latency as fallback LLC.

It is expected that fallback LLC has to be part of parent domain of
LLC domain. Archs can choose to use fallback LLC by setting the
SD_FALLBACK_LLC flag.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/sched/sd_flags.h | 7 ++++++
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++---
kernel/sched/topology.c | 45 ++++++++++++++++++++++++++++++++--
4 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 34b21e971d77..3ca44dd421e4 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -129,6 +129,13 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
*/
SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

+/*
+ * Consider waking task on near-by idle LLC.
+ *
+ * NEEDS_GROUPS: Load balancing flag.
+ */
+SD_FLAG(SD_FALLBACK_LLC, SDF_NEEDS_GROUPS)
+
/*
* Prefer to place tasks in a sibling domain
*
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 285165a35f21..b0446191319a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,6 +74,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int idle_core;
+ int fallback_llc_id;
};

struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db5dc9875e4c..8ea6d0183fc8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5873,7 +5873,8 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
{
int pnr_busy, pllc_size, tnr_busy, tllc_size;
struct sched_domain_shared *tsds, *psds;
- int diff;
+ bool try_fallback = false;
+ int diff, fcpu = -1;

tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
@@ -5890,6 +5891,43 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
}
}

+ tnr_busy = atomic_read(&tsds->nr_busy_cpus);
+ tllc_size = per_cpu(sd_llc_size, this_cpu);
+
+ if (sync) {
+ struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
+
+ /*
+ * task is a target of *sync* wakeup. However there are no
+ * idle cores in the waking CPU. Ignore fallback LLC if the
+ * previous CPU is part of the LLC's parent domain.
+ */
+ try_fallback = !cpumask_test_cpu(prev_cpu, sched_domain_span(sd->parent));
+ fcpu = tsds->fallback_llc_id;
+ }
+
+ if (try_fallback && fcpu != -1 && cpumask_test_cpu(fcpu, p->cpus_ptr)) {
+ struct sched_domain_shared *fsds;
+ int fnr_busy, fllc_size;
+
+ fsds = rcu_dereference(per_cpu(sd_llc_shared, fcpu));
+ if (fsds && fsds != psds) {
+ if (fsds->idle_core != -1) {
+ if (cpumask_test_cpu(fsds->idle_core, p->cpus_ptr))
+ return fsds->idle_core;
+ return fcpu;
+ }
+
+ fnr_busy = atomic_read(&fsds->nr_busy_cpus);
+ fllc_size = per_cpu(sd_llc_size, fcpu);
+ if (fnr_busy * tllc_size < tnr_busy * fllc_size) {
+ tnr_busy = fnr_busy;
+ tllc_size = fllc_size;
+ this_cpu = fcpu;
+ }
+ }
+ }
+
if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
return prev_cpu;
if (psds->idle_core != -1) {
@@ -5908,10 +5946,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
}
}

- tnr_busy = atomic_read(&tsds->nr_busy_cpus);
pnr_busy = atomic_read(&psds->nr_busy_cpus);
-
- tllc_size = per_cpu(sd_llc_size, this_cpu);
pllc_size = per_cpu(sd_llc_size, prev_cpu);

if (pnr_busy == pllc_size && tnr_busy == tllc_size) {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 00e4669bb241..89aa8986c58b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -603,6 +603,8 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)

static void destroy_sched_domain(struct sched_domain *sd)
{
+ struct sched_domain_shared *sds = sd->shared;
+
/*
* A normal sched domain may have multiple group references, an
* overlapping domain, having private groups, only one. Iterate,
@@ -610,8 +612,18 @@ static void destroy_sched_domain(struct sched_domain *sd)
*/
free_sched_groups(sd->groups, 1);

- if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
- kfree(sd->shared);
+ if (sds && atomic_dec_and_test(&sds->ref)) {
+ struct sched_domain_shared *next_sds;
+
+ if (sds->fallback_llc_id != -1) {
+ next_sds = rcu_dereference(per_cpu(sd_llc_shared, sds->fallback_llc_id));
+ if (next_sds && next_sds->fallback_llc_id != -1)
+ next_sds->fallback_llc_id = -1;
+
+ sds->fallback_llc_id = -1;
+ }
+ kfree(sds);
+ }
kfree(sd);
}

@@ -663,9 +675,36 @@ static void update_top_cache_domain(int cpu)

sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
if (sd) {
+ struct sched_domain *sd_parent = sd->parent;
+
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
sds = sd->shared;
+
+ if (sds->fallback_llc_id == -1 && sd_parent &&
+ sd_parent->flags & SD_FALLBACK_LLC) {
+ const struct cpumask *parent_span = sched_domain_span(sd->parent);
+ struct cpumask *span = sched_domains_tmpmask;
+ int fcpu;
+
+ /*
+ * If LLC's parent domain has SD_FALLBACK_LLC flag
+ * set and this LLC's fallback_llc_id is not yet
+ * set, then walk through the LLC parent's domain to
+ * find a fallback_llc.
+ */
+ cpumask_andnot(span, parent_span, sched_domain_span(sd));
+ for_each_cpu_wrap(fcpu, span, cpu) {
+ struct sched_domain_shared *next_sds;
+
+ next_sds = rcu_dereference(per_cpu(sd_llc_shared, fcpu));
+ if (next_sds && next_sds->fallback_llc_id == -1) {
+ sds->fallback_llc_id = fcpu;
+ next_sds->fallback_llc_id = cpu;
+ break;
+ }
+ }
+ }
}

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
@@ -1370,6 +1409,7 @@ int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#define TOPOLOGY_SD_FLAGS \
(SD_SHARE_CPUCAPACITY | \
SD_SHARE_PKG_RESOURCES | \
+ SD_FALLBACK_LLC | \
SD_NUMA | \
SD_ASYM_PACKING)

@@ -1475,6 +1515,7 @@ sd_init(struct sched_domain_topology_level *tl,
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
sd->shared->idle_core = -1;
+ sd->shared->fallback_llc_id = -1;
}

sd->private = sdd;
--
2.18.2

2021-04-22 10:27:13

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 10/10] powerpc/smp: Add fallback flag to powerpc MC domain

Power 10 supports MC domain. Cores within the MC domain
aka hemisphere have lesser cache access latency compared to cache access
latency for cores across the neighbouring MC. Last level of cache for a
Power 10 is at Core (4 threads).
Now that scheduler supports fallback LLC domain, mark MC domain with
SD_FALLBACK_LLC flag. With this each SMT 4 core forms a pair (as
fallback LLC) with another SMT 4 core within the MC domain.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/powerpc/kernel/smp.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a1070d..bc6386055cbe 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -950,6 +950,11 @@ static int powerpc_shared_cache_flags(void)
return SD_SHARE_PKG_RESOURCES;
}

+static int powerpc_mc_flags(void)
+{
+ return SD_FALLBACK_LLC;
+}
+
/*
* We can't just pass cpu_l2_cache_mask() directly because
* returns a non-const pointer and the compiler barfs on that.
@@ -986,7 +991,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
#endif
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
- { cpu_mc_mask, SD_INIT_NAME(MC) },
+ { cpu_mc_mask, powerpc_mc_flags, SD_INIT_NAME(MC) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
};
--
2.18.2

2021-04-22 10:27:13

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 08/10] sched/fair: Dont iterate if no idle CPUs

Now that the nr_busy_cpus for a LLC are updated in idle callbacks,
scheduler can detect if all threads of a LLC are busy. In such cases, it
can avoid searching for idle CPUs in the LLC that can run the wakee
thread.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f752f77b76f..db5dc9875e4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,7 +715,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
#include "pelt.h"
#ifdef CONFIG_SMP

-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, bool idle);
static unsigned long task_h_load(struct task_struct *p);
static unsigned long capacity_of(int cpu);

@@ -5868,7 +5868,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}

-static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu,
+ int sync, bool *idle)
{
int pnr_busy, pllc_size, tnr_busy, tllc_size;
struct sched_domain_shared *tsds, *psds;
@@ -5913,8 +5914,10 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
tllc_size = per_cpu(sd_llc_size, this_cpu);
pllc_size = per_cpu(sd_llc_size, prev_cpu);

- if (pnr_busy == pllc_size && tnr_busy == tllc_size)
+ if (pnr_busy == pllc_size && tnr_busy == tllc_size) {
+ *idle = false;
return nr_cpumask_bits;
+ }

diff = pnr_busy * tllc_size - tnr_busy * pllc_size;
if (diff > 0)
@@ -5926,7 +5929,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int prev_cpu, int sync)
+ int this_cpu, int prev_cpu, int sync, bool *idle)
{
bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
int target = nr_cpumask_bits;
@@ -5935,7 +5938,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
target = wake_affine_idle(this_cpu, prev_cpu);

else if (sched_feat(WA_IDLER_LLC) && !share_caches)
- target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+ target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync, idle);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
@@ -6333,7 +6336,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, bool idle)
{
struct sched_domain *sd;
unsigned long task_util;
@@ -6410,6 +6413,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}

+ if (!idle)
+ return target;
+
sd = rcu_dereference(per_cpu(sd_llc, target));
if (!sd)
return target;
@@ -6818,6 +6824,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
int want_affine = 0;
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
+ bool idle = true;

if (wake_flags & WF_TTWU) {
record_wakee(p);
@@ -6841,7 +6848,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
if (cpu != prev_cpu)
- new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
+ new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, &idle);

sd = NULL; /* Prefer wake_affine over balance flags */
break;
@@ -6858,7 +6865,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, idle);

if (want_affine)
current->recent_used_cpu = cpu;
--
2.18.2

2021-04-22 15:15:06

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/topology: Introduce fallback LLC

Hi Srikar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on linus/master v5.12-rc8]
[cannot apply to tip/sched/core tip/master next-20210422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Srikar-Dronamraju/sched-fair-wake_affine-improvements/20210422-182725
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: x86_64-randconfig-s022-20210421 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.3-341-g8af24329-dirty
# https://github.com/0day-ci/linux/commit/dcb752591087d90fd4093094e42baf167fdcae7b
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Srikar-Dronamraju/sched-fair-wake_affine-improvements/20210422-182725
git checkout dcb752591087d90fd4093094e42baf167fdcae7b
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)
kernel/sched/topology.c:106:56: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:106:56: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:106:56: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:125:60: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:125:60: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:125:60: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:148:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:148:20: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:148:20: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:431:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct perf_domain *[assigned] tmp @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:431:13: sparse: expected struct perf_domain *[assigned] tmp
kernel/sched/topology.c:431:13: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:440:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct perf_domain *[assigned] tmp @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:440:13: sparse: expected struct perf_domain *[assigned] tmp
kernel/sched/topology.c:440:13: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct perf_domain *[assigned] pd @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:461:19: sparse: expected struct perf_domain *[assigned] pd
kernel/sched/topology.c:461:19: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:635:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:635:49: sparse: expected struct sched_domain *parent
kernel/sched/topology.c:635:49: sparse: got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/topology.c:678:52: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *sd_parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:678:52: sparse: expected struct sched_domain *sd_parent
kernel/sched/topology.c:678:52: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:686:81: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:686:81: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:686:81: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:742:50: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:742:50: sparse: expected struct sched_domain *parent
kernel/sched/topology.c:742:50: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:749:55: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@ got struct sched_domain *[assigned] tmp @@
kernel/sched/topology.c:749:55: sparse: expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
kernel/sched/topology.c:749:55: sparse: got struct sched_domain *[assigned] tmp
kernel/sched/topology.c:759:29: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:759:29: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:759:29: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:764:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:764:20: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:764:20: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:770:33: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:770:33: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:770:33: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:805:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *sd @@
kernel/sched/topology.c:805:13: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:805:13: sparse: got struct sched_domain [noderef] __rcu *sd
kernel/sched/topology.c:967:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:967:70: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:967:70: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:996:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:996:59: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:996:59: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1166:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1166:40: sparse: expected struct sched_domain *child
kernel/sched/topology.c:1166:40: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1468:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain [noderef] __rcu *child @@ got struct sched_domain *child @@
kernel/sched/topology.c:1468:43: sparse: expected struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1468:43: sparse: got struct sched_domain *child
kernel/sched/topology.c:1955:31: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *parent @@ got struct sched_domain *sd @@
kernel/sched/topology.c:1955:31: sparse: expected struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:1955:31: sparse: got struct sched_domain *sd
kernel/sched/topology.c:2123:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2123:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2123:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2140:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2140:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2140:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:59:25: sparse: sparse: dereference of noderef expression
kernel/sched/topology.c:64:25: sparse: sparse: dereference of noderef expression
kernel/sched/topology.c: note: in included file:
kernel/sched/sched.h:1459:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1459:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1459:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/sched.h:1472:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1472:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1472:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/sched.h:1459:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1459:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1459:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/sched.h:1472:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1472:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1472:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:1483:19: sparse: sparse: dereference of noderef expression
--
kernel/sched/fair.c:861:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_entity *se @@ got struct sched_entity [noderef] __rcu * @@
kernel/sched/fair.c:861:34: sparse: expected struct sched_entity *se
kernel/sched/fair.c:861:34: sparse: got struct sched_entity [noderef] __rcu *
kernel/sched/fair.c:10750:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:10750:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/fair.c:10750:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:4907:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/fair.c:4907:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/fair.c:4907:22: sparse: struct task_struct *
kernel/sched/fair.c:5430:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/fair.c:5430:38: sparse: expected struct task_struct *curr
kernel/sched/fair.c:5430:38: sparse: got struct task_struct [noderef] __rcu *curr
>> kernel/sched/fair.c:5905:80: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:5905:80: sparse: expected struct sched_domain *sd
kernel/sched/fair.c:5905:80: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:6755:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:6755:20: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/fair.c:6755:20: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:6878:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:6878:9: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/fair.c:6878:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:7076:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/fair.c:7076:38: sparse: expected struct task_struct *curr
kernel/sched/fair.c:7076:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/fair.c:7327:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/fair.c:7327:38: sparse: expected struct task_struct *curr
kernel/sched/fair.c:7327:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/fair.c:8298:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/fair.c:8298:40: sparse: expected struct sched_domain *child
kernel/sched/fair.c:8298:40: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/fair.c:8791:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/fair.c:8791:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/fair.c:8791:22: sparse: struct task_struct *
kernel/sched/fair.c:10067:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:10067:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/fair.c:10067:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:9723:44: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *sd_parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:9723:44: sparse: expected struct sched_domain *sd_parent
kernel/sched/fair.c:9723:44: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:10145:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/fair.c:10145:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/fair.c:10145:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/fair.c:4575:31: sparse: sparse: marked inline, but without a definition
kernel/sched/fair.c: note: in included file:
kernel/sched/sched.h:1730:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:1730:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:1730:25: sparse: struct task_struct *
kernel/sched/sched.h:1885:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:1885:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:1885:9: sparse: struct task_struct *
kernel/sched/sched.h:1730:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:1730:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:1730:25: sparse: struct task_struct *
kernel/sched/sched.h:1730:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:1730:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:1730:25: sparse: struct task_struct *

vim +678 kernel/sched/topology.c

668
669 static void update_top_cache_domain(int cpu)
670 {
671 struct sched_domain_shared *sds = NULL;
672 struct sched_domain *sd;
673 int id = cpu;
674 int size = 1;
675
676 sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
677 if (sd) {
> 678 struct sched_domain *sd_parent = sd->parent;
679
680 id = cpumask_first(sched_domain_span(sd));
681 size = cpumask_weight(sched_domain_span(sd));
682 sds = sd->shared;
683
684 if (sds->fallback_llc_id == -1 && sd_parent &&
685 sd_parent->flags & SD_FALLBACK_LLC) {
686 const struct cpumask *parent_span = sched_domain_span(sd->parent);
687 struct cpumask *span = sched_domains_tmpmask;
688 int fcpu;
689
690 /*
691 * If LLC's parent domain has SD_FALLBACK_LLC flag
692 * set and this LLC's fallback_llc_id is not yet
693 * set, then walk through the LLC parent's domain to
694 * find a fallback_llc.
695 */
696 cpumask_andnot(span, parent_span, sched_domain_span(sd));
697 for_each_cpu_wrap(fcpu, span, cpu) {
698 struct sched_domain_shared *next_sds;
699
700 next_sds = rcu_dereference(per_cpu(sd_llc_shared, fcpu));
701 if (next_sds && next_sds->fallback_llc_id == -1) {
702 sds->fallback_llc_id = fcpu;
703 next_sds->fallback_llc_id = cpu;
704 break;
705 }
706 }
707 }
708 }
709
710 rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
711 per_cpu(sd_llc_size, cpu) = size;
712 per_cpu(sd_llc_id, cpu) = id;
713 #ifdef CONFIG_SCHED_SMT
714 per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
715 #endif
716 per_cpu(is_idle, cpu) = 1;
717 rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
718
719 sd = lowest_flag_domain(cpu, SD_NUMA);
720 rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
721
722 sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
723 rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
724
725 sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
726 rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
727 }
728

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (20.29 kB)
.config.gz (35.96 kB)
Download all attachments

2021-04-23 08:28:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> Recently we found that some of the benchmark numbers on Power10 were lesser
> than expected. Some analysis showed that the problem lies in the fact that
> L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
>

I didn't get the chance to review this properly although I am suspicious
of tracking idle_core and updating that more frequently. It becomes a very
hot cache line that bounces. I did experiement with tracking an idle core
but the data either went stale too quickly or the updates incurred more
overhead than a reduced search saved.

The series also oopses a *lot* and didn't get through a run of basic
workloads on x86 on any of three machines. An example oops is

[ 137.770968] BUG: unable to handle page fault for address: 000000000001a5c8
[ 137.777836] #PF: supervisor read access in kernel mode
[ 137.782965] #PF: error_code(0x0000) - not-present page
[ 137.788097] PGD 8000004098a42067 P4D 8000004098a42067 PUD 4092e36067 PMD 40883ac067 PTE 0
[ 137.796261] Oops: 0000 [#1] SMP PTI
[ 137.799747] CPU: 0 PID: 14913 Comm: GC Slave Tainted: G E 5.12.0-rc8-llcfallback-v1r1 #1
[ 137.809123] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[ 137.816765] RIP: 0010:cpus_share_cache+0x22/0x30
[ 137.821396] Code: fc ff 0f 0b eb 80 66 90 0f 1f 44 00 00 48 63 ff 48 63 f6 48 c7 c0 c8 a5 01 00 48 8b 0c fd 00 59 9d 9a 48 8b 14 f5 00 59 9d 9a <8b> 14 02 39 14 01 0f 94 c0 c3 0f 1f 40 00 0f 1f 44 00 00 41 57 41
[ 137.840133] RSP: 0018:ffff9e0aa26a7c60 EFLAGS: 00010086
[ 137.845360] RAX: 000000000001a5c8 RBX: ffff8bfbc65cf280 RCX: ffff8c3abf440000
[ 137.852483] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 0000000000000015
[ 137.859605] RBP: ffff8bbc478b9fe0 R08: 0000000000000000 R09: 000000000001a5b8
[ 137.866730] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bbc47c88a00
[ 137.873855] R13: 0000000000000015 R14: 0000000000000001 R15: ffff8bfde0199fc0
[ 137.880978] FS: 00007fa33c173700(0000) GS:ffff8bfabf400000(0000) knlGS:0000000000000000
[ 137.889055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 137.894794] CR2: 000000000001a5c8 CR3: 0000004096b22006 CR4: 00000000003706f0
[ 137.901918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 137.909041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 137.916164] Call Trace:
[ 137.918614] select_task_rq_fair+0xb88/0x1410
[ 137.922974] try_to_wake_up+0x114/0x4b0
[ 137.926813] ? should_numa_migrate_memory+0x54/0x170
[ 137.931775] wake_up_q+0x64/0xa0
[ 137.934999] futex_wake+0x13e/0x160
[ 137.938493] do_futex+0xce/0xa80
[ 137.941724] ? asm_sysvec_call_function+0x12/0x20
[ 137.946422] ? __cgroup_account_cputime+0x24/0x30
[ 137.951126] ? update_curr+0xfa/0x1a0
[ 137.954784] ? pick_next_entity+0x70/0x110
[ 137.958876] ? pick_next_task_fair+0xd2/0x3b0
[ 137.963233] __x64_sys_futex+0x13b/0x1f0
[ 137.967155] do_syscall_64+0x33/0x80
[ 137.970737] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 137.975786] RIP: 0033:0x7fc9de7e1cda
[ 137.979365] Code: 00 00 b8 ca 00 00 00 0f 05 5a 5e c3 0f 1f 40 00 56 52 c7 07 00 00 00 00 81 f6 81 00 00 00 ba 01 00 00 00 b8 ca 00 00 00 0f 05 <5a> 5e c3 0f 1f 00 41 54 41 55 49 89 fc 49 89 f5 48 83 ec 18 48 89
[ 137.998101] RSP: 002b:00007fa33c170af0 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
[ 138.005657] RAX: ffffffffffffffda RBX: 00007fc9d80050a8 RCX: 00007fc9de7e1cda
[ 138.012783] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 00007fc9d8005128
[ 138.019907] RBP: 00007fc9d85b9298 R08: 0000000000000000 R09: 0000000000000001
[ 138.027031] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fc9d8005128
[ 138.034155] R13: 00007fc9d85b9268 R14: 00007fc9d8005150 R15: 00007fc9d85b8df8
[ 138.041279] Modules linked in: binfmt_misc(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) dmi_sysfs(E) msr(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) iTCO_vendor_support(E) ipmi_ssif(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ixgbe(E) ghash_clmulni_intel(E) xfrm_algo(E) aesni_intel(E) mdio_devres(E) crypto_simd(E) cryptd(E) joydev(E) igb(E) libphy(E) ioatdma(E) mei_me(E) i2c_i801(E) mdio(E) pcspkr(E) acpi_ipmi(E) i2c_smbus(E) lpc_ich(E) mei(E) dca(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) button(E) btrfs(E) blake2b_generic(E) libcrc32c(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) xhci_pci(E) drm_ttm_helper(E) ehci_pci(E) ttm(E) xhci_hcd(E) ehci_hcd(E) crc32c_intel(E) drm(E) usbcore(E) mpt3sas(E)
[ 138.041333] raid_class(E) scsi_transport_sas(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E)
[ 138.140016] CR2: 000000000001a5c8

--
Mel Gorman
SUSE Labs

2021-04-23 10:36:30

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

* Mel Gorman <[email protected]> [2021-04-23 09:25:32]:

> On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> >
>
> I didn't get the chance to review this properly although I am suspicious
> of tracking idle_core and updating that more frequently. It becomes a very
> hot cache line that bounces. I did experiement with tracking an idle core
> but the data either went stale too quickly or the updates incurred more
> overhead than a reduced search saved.

Thanks a lot for trying this.

> The series also oopses a *lot* and didn't get through a run of basic
> workloads on x86 on any of three machines. An example oops is
>

Can you pass me your failing config. I am somehow not been seeing this
either on x86 or on Powerpc on multiple systems.
Also if possible cat /proc/schedstat and cat
/proc/sys/kernel/sched_domain/cpu0/domain*/name

> [ 137.770968] BUG: unable to handle page fault for address: 000000000001a5c8
> [ 137.777836] #PF: supervisor read access in kernel mode
> [ 137.782965] #PF: error_code(0x0000) - not-present page
> [ 137.788097] PGD 8000004098a42067 P4D 8000004098a42067 PUD 4092e36067 PMD 40883ac067 PTE 0
> [ 137.796261] Oops: 0000 [#1] SMP PTI
> [ 137.799747] CPU: 0 PID: 14913 Comm: GC Slave Tainted: G E 5.12.0-rc8-llcfallback-v1r1 #1
> [ 137.809123] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [ 137.816765] RIP: 0010:cpus_share_cache+0x22/0x30
> [ 137.821396] Code: fc ff 0f 0b eb 80 66 90 0f 1f 44 00 00 48 63 ff 48 63 f6 48 c7 c0 c8 a5 01 00 48 8b 0c fd 00 59 9d 9a 48 8b 14 f5 00 59 9d 9a <8b> 14 02 39 14 01 0f 94 c0 c3 0f 1f 40 00 0f 1f 44 00 00 41 57 41

IP says cpus_share_cache, and it takes 2 ints,
RAX is 000000000001a5c8 but the panic says
"unable to handle page fault for address: 000000000001a5c8"
so it must have failed for "per_cpu(sd_llc_id, xx_cpu)"

> [ 137.840133] RSP: 0018:ffff9e0aa26a7c60 EFLAGS: 00010086
> [ 137.845360] RAX: 000000000001a5c8 RBX: ffff8bfbc65cf280 RCX: ffff8c3abf440000
> [ 137.852483] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 0000000000000015
> [ 137.859605] RBP: ffff8bbc478b9fe0 R08: 0000000000000000 R09: 000000000001a5b8
> [ 137.866730] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bbc47c88a00
> [ 137.873855] R13: 0000000000000015 R14: 0000000000000001 R15: ffff8bfde0199fc0
> [ 137.880978] FS: 00007fa33c173700(0000) GS:ffff8bfabf400000(0000) knlGS:0000000000000000
> [ 137.889055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 137.894794] CR2: 000000000001a5c8 CR3: 0000004096b22006 CR4: 00000000003706f0
> [ 137.901918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 137.909041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

--
Thanks and Regards
Srikar Dronamraju

2021-04-23 12:41:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

On Fri, Apr 23, 2021 at 04:01:29PM +0530, Srikar Dronamraju wrote:
> > The series also oopses a *lot* and didn't get through a run of basic
> > workloads on x86 on any of three machines. An example oops is
> >
>
> Can you pass me your failing config. I am somehow not been seeing this
> either on x86 or on Powerpc on multiple systems.

The machines have since moved onto testing something else (Rik's patch
for newidle) but the attached config should be close enough.

> Also if possible cat /proc/schedstat and cat
> /proc/sys/kernel/sched_domain/cpu0/domain*/name
>

For the vanilla kernel

SMT
MC
NUMA

> > [ 137.770968] BUG: unable to handle page fault for address: 000000000001a5c8
> > [ 137.777836] #PF: supervisor read access in kernel mode
> > [ 137.782965] #PF: error_code(0x0000) - not-present page
> > [ 137.788097] PGD 8000004098a42067 P4D 8000004098a42067 PUD 4092e36067 PMD 40883ac067 PTE 0
> > [ 137.796261] Oops: 0000 [#1] SMP PTI
> > [ 137.799747] CPU: 0 PID: 14913 Comm: GC Slave Tainted: G E 5.12.0-rc8-llcfallback-v1r1 #1
> > [ 137.809123] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> > [ 137.816765] RIP: 0010:cpus_share_cache+0x22/0x30
> > [ 137.821396] Code: fc ff 0f 0b eb 80 66 90 0f 1f 44 00 00 48 63 ff 48 63 f6 48 c7 c0 c8 a5 01 00 48 8b 0c fd 00 59 9d 9a 48 8b 14 f5 00 59 9d 9a <8b> 14 02 39 14 01 0f 94 c0 c3 0f 1f 40 00 0f 1f 44 00 00 41 57 41
>
> IP says cpus_share_cache, and it takes 2 ints,
> RAX is 000000000001a5c8 but the panic says
> "unable to handle page fault for address: 000000000001a5c8"
> so it must have failed for "per_cpu(sd_llc_id, xx_cpu)"
>

More than likely, I didn't look closely because the intent was to schedule
tests to get some data and do the review later when I had time. tbench
partially completed but oopsed for high thread counts. Another load failed
completely and I didn't test beyond that but tbench for high thread counts
should be reproducible.

--
Mel Gorman
SUSE Labs


Attachments:
(No filename) (2.00 kB)
config.gz (57.94 kB)
Download all attachments

2021-04-26 10:32:39

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

* Mel Gorman <[email protected]> [2021-04-23 13:38:55]:

Hi Mel,

> On Fri, Apr 23, 2021 at 04:01:29PM +0530, Srikar Dronamraju wrote:
> > > The series also oopses a *lot* and didn't get through a run of basic
> > > workloads on x86 on any of three machines. An example oops is
> > >
> >
> > Can you pass me your failing config. I am somehow not been seeing this
> > either on x86 or on Powerpc on multiple systems.
>
> The machines have since moved onto testing something else (Rik's patch
> for newidle) but the attached config should be close enough.
>
> > Also if possible cat /proc/schedstat and cat
> > /proc/sys/kernel/sched_domain/cpu0/domain*/name
> >
>
> For the vanilla kernel
>
> SMT
> MC
> NUMA

I was able to reproduce the problem and analyze why it would panic in
cpus_share_cache.

In my patch(es), we have code snippets like this.

if (tsds->idle_core != -1) {
if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
return tsds->idle_core;
return this_cpu;
}

Here when we tested the idle_core and cpumask_test_cpu,
tsds->idle_core may not have been -1; However by the time it returns,
tsds->idle_core could be -1;

cpus_share_cpus() then tries to find sd_llc_id for -1 and crashes.

Its more easier to reproduce this on a machine with more cores in a
LLC than say a Power10/Power9. Hence we are hitting this more often
on x86.

One way could be to save the idle_core to a local variable, but that
negates the whole purpose since we may end up choosing a busy CPU. I
will find a way to fix this problem.

--
Thanks and Regards
Srikar Dronamraju

2021-04-26 10:41:48

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

* Mel Gorman <[email protected]> [2021-04-23 09:25:32]:

> On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> >
>
> I didn't get the chance to review this properly although I am suspicious
> of tracking idle_core and updating that more frequently. It becomes a very
> hot cache line that bounces. I did experiement with tracking an idle core
> but the data either went stale too quickly or the updates incurred more
> overhead than a reduced search saved.
>

This change does increase the number of times we read the idle-core. There
are also more places where we try to update the idle-core. However I feel
the number of times, we actually update the idle-core now will be much
lesser than previous, because we are mostly doing a conditional update. i.e
we are updating the idle-core only if the waking up CPU happens to be part
of our core.

Also if the system is mostly lightly loaded, we check for
available_idle_cpu, so we may not look for an idle-core. If the system is
running a CPU intensive task, then the idle-core will most likely to be -1.
Its only the cases where the system utilization keeps swinging between
lightly loaded to heavy load, that we would end up checking and setting
idle-core.

Do let me know your thoughts.

--
Thanks and Regards
Srikar Dronamraju

2021-04-26 11:36:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

On Mon, Apr 26, 2021 at 04:00:32PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <[email protected]> [2021-04-23 13:38:55]:
>
> Hi Mel,
>
> > On Fri, Apr 23, 2021 at 04:01:29PM +0530, Srikar Dronamraju wrote:
> > > > The series also oopses a *lot* and didn't get through a run of basic
> > > > workloads on x86 on any of three machines. An example oops is
> > > >
> > >
> > > Can you pass me your failing config. I am somehow not been seeing this
> > > either on x86 or on Powerpc on multiple systems.
> >
> > The machines have since moved onto testing something else (Rik's patch
> > for newidle) but the attached config should be close enough.
> >
> > > Also if possible cat /proc/schedstat and cat
> > > /proc/sys/kernel/sched_domain/cpu0/domain*/name
> > >
> >
> > For the vanilla kernel
> >
> > SMT
> > MC
> > NUMA
>
> I was able to reproduce the problem and analyze why it would panic in
> cpus_share_cache.
>
> In my patch(es), we have code snippets like this.
>
> if (tsds->idle_core != -1) {
> if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
> return tsds->idle_core;
> return this_cpu;
> }
>
> Here when we tested the idle_core and cpumask_test_cpu,
> tsds->idle_core may not have been -1; However by the time it returns,
> tsds->idle_core could be -1;
>
> cpus_share_cpus() then tries to find sd_llc_id for -1 and crashes.
>
> Its more easier to reproduce this on a machine with more cores in a
> LLC than say a Power10/Power9. Hence we are hitting this more often
> on x86.
>
> One way could be to save the idle_core to a local variable, but that
> negates the whole purpose since we may end up choosing a busy CPU. I
> will find a way to fix this problem.
>

As there is no locking that protects the variable, it's inherently
race-prone. A READ_ONCE to a local variable may be your only choice

--
Mel Gorman
SUSE Labs

2021-04-26 11:42:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

On Mon, Apr 26, 2021 at 04:09:40PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <[email protected]> [2021-04-23 09:25:32]:
>
> > On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> > > Recently we found that some of the benchmark numbers on Power10 were lesser
> > > than expected. Some analysis showed that the problem lies in the fact that
> > > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> > >
> >
> > I didn't get the chance to review this properly although I am suspicious
> > of tracking idle_core and updating that more frequently. It becomes a very
> > hot cache line that bounces. I did experiement with tracking an idle core
> > but the data either went stale too quickly or the updates incurred more
> > overhead than a reduced search saved.
> >
>
> This change does increase the number of times we read the idle-core. There
> are also more places where we try to update the idle-core. However I feel
> the number of times, we actually update the idle-core now will be much
> lesser than previous, because we are mostly doing a conditional update. i.e
> we are updating the idle-core only if the waking up CPU happens to be part
> of our core.
>

Increased cache misses may be detectable from perf.

> Also if the system is mostly lightly loaded, we check for
> available_idle_cpu, so we may not look for an idle-core. If the system is
> running a CPU intensive task, then the idle-core will most likely to be -1.
> Its only the cases where the system utilization keeps swinging between
> lightly loaded to heavy load, that we would end up checking and setting
> idle-core.
>

But this is a "how long is a piece of string" question because the benefit
of tracking an idle core depends on both the interarrival time of wakeups,
the domain utilisation and the length of time tasks are running. When
I was looking at the area, I tracked the SIS efficiency to see how much
each change was helping. The patch no longer applies but the stats are
understood by mmtests if you wanted to forward port it. It's possible
you would do something similar but specific to idle_core -- e.g. track
how often it's updated, how often it's read, how often a CPU is returned
and how often it's still an idle core and use those stats to calculate
hit/miss ratios.

However, I would caution against conflating the "fallback search domain"
with the patches tracking idle core because they should be independent
of each other.

Old patch that no longer applies that was the basis for some SIS work
over a year ago is below

---8<---
From c791354b92a5723b0da14d050f942f61f0c12857 Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Fri, 14 Feb 2020 19:11:16 +0000
Subject: [PATCH] sched/fair: Track efficiency of select_idle_sibling

select_idle_sibling is an important path that finds a nearby idle CPU on
wakeup. As it is examining other CPUs state, it can be expensive in terms
of cache usage. This patch tracks the search efficiency if schedstats
are enabled. In general, this is only useful for kernel developers but
schedstats are typically disabled by default so it is convenient for
development and mostly free otherwise.

The series can be done without this patch but the stats were used to
generate a number of useful metrics in mmtest to analyse what was
going on.

SIS Search: Number of calls to select_idle_sibling

SIS Domain Search: Number of times the domain was searched because the
fast path failed.

SIS Scanned: Generally the number of runqueues scanned but the fast
path counts as 1 regardless of the values for target, prev
and recent.

SIS Domain Scanned: Number of runqueues scanned during a search of the
LLC domain.

SIS Failures: Number of SIS calls that failed to find an idle CPU

SIS Search Efficiency: A ratio expressed as a percentage of runqueues
scanned versus idle CPUs found. A 100% efficiency indicates that
the target, prev or recent CPU of a task was idle at wakeup. The
lower the efficiency, the more runqueues were scanned before an
idle CPU was found.

SIS Domain Search Efficiency: Similar, except only for the slower SIS
patch.

SIS Fast Success Rate: Percentage of SIS that used target, prev or
recent CPUs.

SIS Success rate: Percentage of scans that found an idle CPU.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/debug.c | 4 ++++
kernel/sched/fair.c | 14 ++++++++++++++
kernel/sched/sched.h | 6 ++++++
kernel/sched/stats.c | 8 +++++---
4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8331bc04aea2..7af6e8a12f40 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -661,6 +661,10 @@ do { \
P(sched_goidle);
P(ttwu_count);
P(ttwu_local);
+ P(sis_search);
+ P(sis_domain_search);
+ P(sis_scanned);
+ P(sis_failed);
}
#undef P

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23a6fe298720..10cdd28e910e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,6 +5975,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
bool idle = true;

for_each_cpu(cpu, cpu_smt_mask(core)) {
+ schedstat_inc(this_rq()->sis_scanned);
if (!available_idle_cpu(cpu)) {
idle = false;
break;
@@ -6005,6 +6006,7 @@ static int select_idle_smt(struct task_struct *p, int target)
return -1;

for_each_cpu(cpu, cpu_smt_mask(target)) {
+ schedstat_inc(this_rq()->sis_scanned);
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
@@ -6070,6 +6072,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

for_each_cpu_wrap(cpu, cpus, target) {
+ schedstat_inc(this_rq()->sis_scanned);
if (!--nr)
return -1;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
@@ -6126,6 +6129,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd;
int i, recent_used_cpu;

+ schedstat_inc(this_rq()->sis_search);
+
+ /*
+ * Checking if prev, target and recent is treated as one scan. A
+ * perfect hit on one of those is considered 100% efficiency.
+ * Further scanning impairs efficiency.
+ */
+ schedstat_inc(this_rq()->sis_scanned);
+
/*
* For asymmetric CPU capacity systems, our domain of interest is
* sd_asym_cpucapacity rather than sd_llc.
@@ -6191,6 +6203,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;

+ schedstat_inc(this_rq()->sis_domain_search);
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
@@ -6203,6 +6216,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ schedstat_inc(this_rq()->sis_failed);
return target;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a0caf394dd4..c1a5c27cfd72 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1006,6 +1006,12 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* select_idle_sibling stats */
+ unsigned int sis_search;
+ unsigned int sis_domain_search;
+ unsigned int sis_scanned;
+ unsigned int sis_failed;
#endif

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 750fb3c67eed..390bfcc3842c 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
* Bump this up when changing the output format or the meaning of an existing
* format, so that tools can adapt (or abort)
*/
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16

static int show_schedstat(struct seq_file *seq, void *v)
{
@@ -30,12 +30,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->sis_search, rq->sis_domain_search,
+ rq->sis_scanned, rq->sis_failed);

seq_printf(seq, "\n");


--
Mel Gorman
SUSE Labs

2021-04-27 14:56:28

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

Hi Srikar,

On Thu, 22 Apr 2021 at 12:23, Srikar Dronamraju
<[email protected]> wrote:
>
> Recently we found that some of the benchmark numbers on Power10 were lesser
> than expected. Some analysis showed that the problem lies in the fact that
> L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
>
> One probable solution to the problem was worked by Gautham where he posted
> http://lore.kernel.org/lkml/[email protected]/t/#u
> a patch that marks MC domain as LLC.
>
> Here the focus is on seeing if we can improve the current core scheduler's
> wakeup mechanism by looking at idle-cores and nr_busy_cpus that is already
> maintained per Last level cache(aka LLC) (first 8 patches) + explore the
> possibility to provide a fallback LLC domain, that can be preferred if the
> current LLC is busy (last 2 patches).
>
> Except the last 2 patches, the rest patches should work independently of the
> other proposed solution. i.e if the mc-llc patch is accepted, then the last
> two patches may not be needed for Power10. However this may be helpful for
> other archs/platforms.
>
> In the fallback approach, we look for a one-to-one mapping for each LLC.
> However this can be easily modified to look for all LLC's in the current
> LLC's parent. Also fallback is only used for sync wakeups. This is because
> that is where we expect the maximum benefit of moving the task closer to the
> task. For non-sync wakeups, its expected that CPU from previous LLC may be
> better off.
>
> Request you to please review and provide your feedback.
>
> Benchmarking numbers are from Power 10 but I have verified that we don't
> regress on Power 9 setup.
>
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 80
> On-line CPU(s) list: 0-79
> Thread(s) per core: 8
> Core(s) per socket: 10
> Socket(s): 1
> NUMA node(s): 1
> Model: 1.0 (pvr 0080 0100)
> Model name: POWER10 (architected), altivec supported
> Hypervisor vendor: pHyp
> Virtualization type: para
> L1d cache: 64K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 8K
> NUMA node2 CPU(s): 0-79
>
> Hackbench: (latency, lower is better)
>
> v5.12-rc5
> instances = 1, min = 24.102529 usecs/op, median = usecs/op, max = 24.102529 usecs/op
> instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
> instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
> instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
> instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
> instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
> instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
> instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op
>
> v5.12-rc5 + mc-llc
> instances = 1, min = 18.676505 usecs/op, median = usecs/op, max = 18.676505 usecs/op
> instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
> instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
> instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
> instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
> instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
> instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
> instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op
>
> v5.12-rc5 + this patchset
> instances = 1, min = 19.368805 usecs/op, median = usecs/op, max = 19.368805 usecs/op
> instances = 2, min = 19.423674 usecs/op, median = 19.423674 usecs/op, max = 19.506203 usecs/op
> instances = 4, min = 19.454523 usecs/op, median = 19.596947 usecs/op, max = 19.863620 usecs/op
> instances = 8, min = 20.005272 usecs/op, median = 20.239924 usecs/op, max = 20.878947 usecs/op
> instances = 16, min = 21.856779 usecs/op, median = 24.102147 usecs/op, max = 25.496110 usecs/op
> instances = 32, min = 31.460159 usecs/op, median = 32.809621 usecs/op, max = 33.939650 usecs/op
> instances = 64, min = 39.506553 usecs/op, median = 43.835221 usecs/op, max = 45.645505 usecs/op
> instances = 80, min = 43.805716 usecs/op, median = 44.314757 usecs/op, max = 48.910236 usecs/op
>
> Summary:
> mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5
>
> DayTrader (throughput, higher is better)
> v5.12-rc5 v5.12-rc5 v5.12-rc5
> + mc-llc + patchset
> 64CPUs/1JVM/ 60Users 6373.7 7520.5 7232.3
> 64CPUs/1JVM/ 80Users 6742.1 7940.9 7732.8
> 64CPUs/1JVM/100Users 6482.2 7730.3 7540
> 64CPUs/2JVM/ 60Users 6335 8081.6 7914.3
> 64CPUs/2JVM/ 80Users 6360.8 8259.6 8138.6
> 64CPUs/2JVM/100Users 6215.6 8046.5 8039.2
> 64CPUs/4JVM/ 60Users 5385.4 7685.3 7706.1
> 64CPUs/4JVM/ 80Users 5380.8 7753.3 7721.5
> 64CPUs/4JVM/100Users 5275.2 7549.2 7608.3
>
> Summary: Across all profiles, this patchset or mc-llc out perform
> vanilla v5.12-rc5
> Not: Only 64 cores were online during this test.
>
> schbench (latency: lesser is better)
> ======== Running schbench -m 3 -r 30 =================
> Latency percentiles (usec) runtime 10 (s) (2545 total samples)
> v5.12-rc5 | v5.12-rc5 + mc-llc | v5.12-rc5 + patchset
>
> 50.0th: 56 (1301 samples) | 50.0th: 49 (1309 samples) | 50.0th: 50 (1310 samples)
> 75.0th: 76 (623 samples) | 75.0th: 66 (628 samples) | 75.0th: 68 (632 samples)
> 90.0th: 93 (371 samples) | 90.0th: 78 (371 samples) | 90.0th: 80 (354 samples)
> 95.0th: 107 (123 samples) | 95.0th: 87 (117 samples) | 95.0th: 86 (126 samples)
> *99.0th: 12560 (102 samples) *99.0th: 100 (97 samples) | *99.0th: 103 (97 samples)
> 99.5th: 15312 (14 samples) | 99.5th: 104 (12 samples) | 99.5th: 1202 (13 samples)
> 99.9th: 19936 (9 samples) | 99.9th: 106 (8 samples) | 99.9th: 14992 (10 samples)
> min=13, max=20684 | min=15, max=113 | min=15, max=18721
>
> Latency percentiles (usec) runtime 20 (s) (7649 total samples)
>
> 50.0th: 51 (3884 samples) | 50.0th: 50 (3935 samples) | 50.0th: 49 (3841 samples)
> 75.0th: 69 (1859 samples) | 75.0th: 66 (1817 samples) | 75.0th: 67 (1965 samples)
> 90.0th: 87 (1173 samples) | 90.0th: 80 (1204 samples) | 90.0th: 78 (1134 samples)
> 95.0th: 97 (368 samples) | 95.0th: 87 (342 samples) | 95.0th: 83 (359 samples)
> *99.0th: 8624 (290 samples)| *99.0th: 98 (294 samples) | *99.0th: 93 (296 samples)
> 99.5th: 11344 (37 samples) | 99.5th: 102 (37 samples) | 99.5th: 98 (34 samples)
> 99.9th: 18592 (31 samples) | 99.9th: 106 (30 samples) | 99.9th: 7544 (28 samples)
> min=13, max=20684 | min=12, max=113 | min=13, max=18721
>
> Latency percentiles (usec) runtime 30 (s) (12785 total samples)
>
> 50.0th: 50 (6614 samples) | 50.0th: 49 (6544 samples) | 50.0th: 48 (6527 samples)
> 75.0th: 67 (3059 samples) | 75.0th: 65 (3100 samples) | 75.0th: 64 (3143 samples)
> 90.0th: 84 (1894 samples) | 90.0th: 79 (1912 samples) | 90.0th: 76 (1985 samples)
> 95.0th: 94 (586 samples) | 95.0th: 87 (646 samples) | 95.0th: 81 (585 samples)
> *99.0th: 8304 (507 samples)| *99.0th: 101 (496 samples) | *99.0th: 90 (453 samples)
> 99.5th: 11696 (62 samples) | 99.5th: 104 (45 samples) | 99.5th: 94 (66 samples)
> 99.9th: 18592 (51 samples) | 99.9th: 110 (51 samples) | 99.9th: 1202 (49 samples)
> min=12, max=21421 | min=1, max=126 | min=3, max=18721
>
> Summary:
> mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5
>
>
> mongodb (threads=6) (throughput, higher is better)
> Throughput read clean update
> latency latency latency
> v5.12-rc5 JVM=YCSB_CLIENTS=14 68116.05 ops/sec 1109.82 us 944.19 us 1342.29 us
> v5.12-rc5 JVM=YCSB_CLIENTS=21 64802.69 ops/sec 1772.64 us 944.69 us 2099.57 us
> v5.12-rc5 JVM=YCSB_CLIENTS=28 61792.78 ops/sec 2490.48 us 930.09 us 2928.03 us
> v5.12-rc5 JVM=YCSB_CLIENTS=35 59604.44 ops/sec 3236.86 us 870.28 us 3787.48 us
>
> v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=14 70948.51 ops/sec 1060.21 us 842.02 us 1289.44 us
> v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=21 68732.48 ops/sec 1669.91 us 871.57 us 1975.19 us
> v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=28 66674.81 ops/sec 2313.79 us 889.59 us 2702.36 us
> v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=35 64397.51 ops/sec 3010.66 us 966.28 us 3484.19 us
>
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14 67403.29 ops/sec 1121.80 us 797.81 us 1357.28 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21 63952.79 ops/sec 1792.86 us 779.59 us 2130.54 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28 62198.83 ops/sec 2469.60 us 780.00 us 2914.48 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35 60333.81 ops/sec 3192.41 us 822.09 us 3748.24 us
>
> Summary:
> mc-llc outperforms, this patchset and upstream almost give similar performance.

So mc-llc patch seems to be the best approach IMHO. Although the
hemisphere don't share cache, they share enough resources so
cache-snooping is as efficient as sharing cache

>
>
> Cc: LKML <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Parth Shah <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Rik van Riel <[email protected]>
>
> Srikar Dronamraju (10):
> sched/fair: Update affine statistics when needed
> sched/fair: Maintain the identity of idle-core
> sched/fair: Update idle-core more often
> sched/fair: Prefer idle CPU to cache affinity
> sched/fair: Call wake_affine only if necessary
> sched/idle: Move busy_cpu accounting to idle callback
> sched/fair: Remove ifdefs in waker_affine_idler_llc
> sched/fair: Dont iterate if no idle CPUs
> sched/topology: Introduce fallback LLC
> powerpc/smp: Add fallback flag to powerpc MC domain
>
> arch/powerpc/kernel/smp.c | 7 +-
> include/linux/sched/sd_flags.h | 7 +
> include/linux/sched/topology.h | 3 +-
> kernel/sched/fair.c | 229 +++++++++++++++++++++++++++------
> kernel/sched/features.h | 1 +
> kernel/sched/idle.c | 33 ++++-
> kernel/sched/sched.h | 6 +
> kernel/sched/topology.c | 54 +++++++-
> 8 files changed, 296 insertions(+), 44 deletions(-)
>
> --
> 2.18.2
>

2021-04-28 14:51:06

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

* Mel Gorman <[email protected]> [2021-04-26 12:41:12]:

> On Mon, Apr 26, 2021 at 04:09:40PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <[email protected]> [2021-04-23 09:25:32]:

<snip>

> >
> > This change does increase the number of times we read the idle-core. There
> > are also more places where we try to update the idle-core. However I feel
> > the number of times, we actually update the idle-core now will be much
> > lesser than previous, because we are mostly doing a conditional update. i.e
> > we are updating the idle-core only if the waking up CPU happens to be part
> > of our core.
> >
>
> Increased cache misses may be detectable from perf.

I will get some cache miss numbers pre and post the patchset.
Since we may be altering the selection of CPUs esp on Systems that have a
small LLCs, I was thinking the cache misses could be different between pre
and post patchset.

>
> > Also if the system is mostly lightly loaded, we check for
> > available_idle_cpu, so we may not look for an idle-core. If the system is
> > running a CPU intensive task, then the idle-core will most likely to be -1.
> > Its only the cases where the system utilization keeps swinging between
> > lightly loaded to heavy load, that we would end up checking and setting
> > idle-core.
> >
>
> But this is a "how long is a piece of string" question because the benefit
> of tracking an idle core depends on both the interarrival time of wakeups,
> the domain utilisation and the length of time tasks are running. When
> I was looking at the area, I tracked the SIS efficiency to see how much
> each change was helping. The patch no longer applies but the stats are
> understood by mmtests if you wanted to forward port it. It's possible
> you would do something similar but specific to idle_core -- e.g. track
> how often it's updated, how often it's read, how often a CPU is returned
> and how often it's still an idle core and use those stats to calculate
> hit/miss ratios.
>
> However, I would caution against conflating the "fallback search domain"
> with the patches tracking idle core because they should be independent
> of each other.
>
> Old patch that no longer applies that was the basis for some SIS work
> over a year ago is below
>

Thanks Mel for sharing this, I will build a prototype patch similar to this
and see what inputs it will come up with.

> ---8<---
> From c791354b92a5723b0da14d050f942f61f0c12857 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <[email protected]>
> Date: Fri, 14 Feb 2020 19:11:16 +0000
> Subject: [PATCH] sched/fair: Track efficiency of select_idle_sibling
>
<snip>

--
Thanks and Regards
Srikar Dronamraju

2021-04-28 14:51:13

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements

* Vincent Guittot <[email protected]> [2021-04-27 16:52:30]:

> Hi Srikar,

Hi Vincent,

> On Thu, 22 Apr 2021 at 12:23, Srikar Dronamraju
> <[email protected]> wrote:
> >
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> >
> >
> > Summary:
> > mc-llc outperforms, this patchset and upstream almost give similar performance.
>
> So mc-llc patch seems to be the best approach IMHO. Although the
> hemisphere don't share cache, they share enough resources so
> cache-snooping is as efficient as sharing cache
>

Yes, mc-llc helps just specific systems like Power10 but its shows better
numbers than my posted patchset.

However in this patchset, we are looking at areas in wakeup (aka idler llcs)
we could optimize which can help other archs too. + the fallback mechanism
is generic enough that we could use it for other Systems too.

I know that there are valid concerns raised by Mel and I working to resolve
them. Some of them are.
- How hot is idle-core
- Crashes when running tbench (I was able to reproduce with kernbench on x86)

Also I am adding some more changes with which we are getting similar
performance as mc-llc.

> > Cc: LKML <[email protected]>
> > Cc: Michael Ellerman <[email protected]>
> > Cc: Gautham R Shenoy <[email protected]>
> > Cc: Parth Shah <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> >
> > Srikar Dronamraju (10):
> > sched/fair: Update affine statistics when needed
> > sched/fair: Maintain the identity of idle-core
> > sched/fair: Update idle-core more often
> > sched/fair: Prefer idle CPU to cache affinity
> > sched/fair: Call wake_affine only if necessary
> > sched/idle: Move busy_cpu accounting to idle callback
> > sched/fair: Remove ifdefs in waker_affine_idler_llc
> > sched/fair: Dont iterate if no idle CPUs
> > sched/topology: Introduce fallback LLC
> > powerpc/smp: Add fallback flag to powerpc MC domain
> >
> > arch/powerpc/kernel/smp.c | 7 +-
> > include/linux/sched/sd_flags.h | 7 +
> > include/linux/sched/topology.h | 3 +-
> > kernel/sched/fair.c | 229 +++++++++++++++++++++++++++------
> > kernel/sched/features.h | 1 +
> > kernel/sched/idle.c | 33 ++++-
> > kernel/sched/sched.h | 6 +
> > kernel/sched/topology.c | 54 +++++++-
> > 8 files changed, 296 insertions(+), 44 deletions(-)
> >
> > --
> > 2.18.2
> >

--
Thanks and Regards
Srikar Dronamraju