2021-05-06 16:47:53

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 0/8] sched/fair: wake_affine improvements

Changelog v1->v2:
v1 Link: http://lore.kernel.org/lkml/[email protected]/t/#u
- Fallback LLC domain has been split out as a subsequent patchset.
(suggested by Mel)
- Fix a panic due to two wakeups racing for the same idle-core
(Reported by Mel)
- Differentiate if a LLC surely has no idle-cores(-2) vs a LLC may or
may not have idle-cores(-1).
- Rebased to v5.12

Recently we found that some of the benchmark numbers on Power10 were lesser
than expected. Some analysis showed that the problem lies in the fact that
L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.

One probable solution to the problem was worked by Gautham where he posted
http://lore.kernel.org/lkml/[email protected]/t/#u
a patch that marks MC domain as LLC.

Here the focus is on improving the current core scheduler's wakeup
mechanism by looking at idle-cores and nr_busy_cpus that is already
maintained per Last level cache(aka LLC)

Hence this approach can work well with the mc-llc too. It can help
other architectures too.

Request you to please review and provide your feedback.

Benchmarking numbers are from Power 10 but I have verified that we don't
regress on Power 9 setup.

# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 8
Core(s) per socket: 10
Socket(s): 1
NUMA node(s): 1
Model: 1.0 (pvr 0080 0100)
Model name: POWER10 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8K
NUMA node2 CPU(s): 0-79

Hackbench: (latency, lower is better)

v5.12-rc5
instances = 1, min = 24.102529 usecs/op, median = usecs/op, max = 24.102529 usecs/op
instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op

v5.12-rc5 + mc-llc+
instances = 1, min = 18.676505 usecs/op, median = usecs/op, max = 18.676505 usecs/op
instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op

v5.12-rc5 + this patchset
instances = 1, min = 19.240824 usecs/op, median = usecs/op, max = 19.240824 usecs/op
instances = 2, min = 19.143470 usecs/op, median = 19.143470 usecs/op, max = 19.249875 usecs/op
instances = 4, min = 19.399812 usecs/op, median = 19.487433 usecs/op, max = 19.501298 usecs/op
instances = 8, min = 19.024297 usecs/op, median = 19.908682 usecs/op, max = 20.741605 usecs/op
instances = 16, min = 22.209444 usecs/op, median = 23.971275 usecs/op, max = 25.145198 usecs/op
instances = 32, min = 31.220392 usecs/op, median = 32.689189 usecs/op, max = 34.081588 usecs/op
instances = 64, min = 39.012110 usecs/op, median = 44.062042 usecs/op, max = 45.370525 usecs/op
instances = 80, min = 43.884358 usecs/op, median = 44.326417 usecs/op, max = 48.031303 usecs/op

Summary:
mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5

DayTrader (throughput, higher is better)
v5.12-rc5 v5.12-rc5 v5.12-rc5
+ mc-llc + patchset
64CPUs/1JVM/ 60Users 6373.7 7520.5 7375.6
64CPUs/1JVM/ 80Users 6742.1 7940.9 7832.9
64CPUs/1JVM/100Users 6482.2 7730.3 7538.4
64CPUs/2JVM/ 60Users 6335 8081.6 8000.2
64CPUs/2JVM/ 80Users 6360.8 8259.6 8315.4
64CPUs/2JVM/100Users 6215.6 8046.5 8049.4
64CPUs/4JVM/ 60Users 5385.4 7685.3 8013.5
64CPUs/4JVM/ 80Users 5380.8 7753.3 7868
64CPUs/4JVM/100Users 5275.2 7549.2 7620

Summary: Across all profiles, this patchset or mc-llc out perform
vanilla v5.12-rc5
Not: Only 64 cores were online during this test.

schbench (latency: lesser is better)
======== Running schbench -m 3 -r 30 =================
Latency percentiles (usec) runtime 10 (s) (2545 total samples)
v5.12-rc5 | v5.12-rc5 + mc-llc | v5.12-rc5 + patchset

50.0th: 56 (1301 samples) | 50.0th: 49 (1309 samples) | 50.0th: 53 (1285 samples)
75.0th: 76 (623 samples) | 75.0th: 66 (628 samples) | 75.0th: 72 (635 samples)
90.0th: 93 (371 samples) | 90.0th: 78 (371 samples) | 90.0th: 88 (388 samples)
95.0th: 107 (123 samples) | 95.0th: 87 (117 samples) | 95.0th: 94 (118 samples)
*99.0th: 12560 (102 samples) *99.0th: 100 (97 samples) | *99.0th: 108 (108 samples)
99.5th: 15312 (14 samples) | 99.5th: 104 (12 samples) | 99.5th: 108 (0 samples)
99.9th: 19936 (9 samples) | 99.9th: 106 (8 samples) | 99.9th: 110 (8 samples)
min=13, max=20684 | min=15, max=113 | min=15, max=1433

Latency percentiles (usec) runtime 20 (s) (7649 total samples)

50.0th: 51 (3884 samples) | 50.0th: 50 (3935 samples) | 50.0th: 51 (3843 samples)
75.0th: 69 (1859 samples) | 75.0th: 66 (1817 samples) | 75.0th: 69 (1962 samples)
90.0th: 87 (1173 samples) | 90.0th: 80 (1204 samples) | 90.0th: 84 (1103 samples)
95.0th: 97 (368 samples) | 95.0th: 87 (342 samples) | 95.0th: 93 (386 samples)
*99.0th: 8624 (290 samples)| *99.0th: 98 (294 samples) | *99.0th: 107 (297 samples)
99.5th: 11344 (37 samples) | 99.5th: 102 (37 samples) | 99.5th: 110 (39 samples)
99.9th: 18592 (31 samples) | 99.9th: 106 (30 samples) | 99.9th: 1714 (27 samples)
min=13, max=20684 | min=12, max=113 | min=15, max=4456

Latency percentiles (usec) runtime 30 (s) (12785 total samples)

50.0th: 50 (6614 samples) | 50.0th: 49 (6544 samples) | 50.0th: 50 (6443 samples)
75.0th: 67 (3059 samples) | 75.0th: 65 (3100 samples) | 75.0th: 67 (3263 samples)
90.0th: 84 (1894 samples) | 90.0th: 79 (1912 samples) | 90.0th: 82 (1890 samples)
95.0th: 94 (586 samples) | 95.0th: 87 (646 samples) | 95.0th: 92 (652 samples)
*99.0th: 8304 (507 samples)| *99.0th: 101 (496 samples) | *99.0th: 107 (464 samples)
99.5th: 11696 (62 samples) | 99.5th: 104 (45 samples) | 99.5th: 110 (61 samples)
99.9th: 18592 (51 samples) | 99.9th: 110 (51 samples) | 99.9th: 1434 (47 samples)
min=12, max=21421 | min=1, max=126 | min=15, max=4456

Summary:
mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5

mongodb (threads=6) (throughput, higher is better)
Throughput read clean update
latency latency latency
v5.12-rc5 JVM=YCSB_CLIENTS=14 68116.05 ops/sec 1109.82 us 944.19 us 1342.29 us
v5.12-rc5 JVM=YCSB_CLIENTS=21 64802.69 ops/sec 1772.64 us 944.69 us 2099.57 us
v5.12-rc5 JVM=YCSB_CLIENTS=28 61792.78 ops/sec 2490.48 us 930.09 us 2928.03 us
v5.12-rc5 JVM=YCSB_CLIENTS=35 59604.44 ops/sec 3236.86 us 870.28 us 3787.48 us

v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=14 70948.51 ops/sec 1060.21 us 842.02 us 1289.44 us
v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=21 68732.48 ops/sec 1669.91 us 871.57 us 1975.19 us
v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=28 66674.81 ops/sec 2313.79 us 889.59 us 2702.36 us
v5.12-rc5 + mc-llc JVM=YCSB_CLIENTS=35 64397.51 ops/sec 3010.66 us 966.28 us 3484.19 us

v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14 67604.51 ops/sec 1117.91 us 947.07 us 1353.41 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21 63979.39 ops/sec 1793.63 us 869.72 us 2130.22 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28 62032.34 ops/sec 2475.89 us 869.06 us 2922.01 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35 60152.96 ops/sec 3203.84 us 972.00 us 3756.52 us

Summary:
mc-llc outperforms, this patchset and upstream almost give similar performance.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>

Srikar Dronamraju (8):
sched/fair: Update affine statistics when needed
sched/fair: Maintain the identity of idle-core
sched/fair: Update idle-core more often
sched/fair: Prefer idle CPU to cache affinity
sched/fair: Use affine_idler_llc for wakeups across LLC
sched/idle: Move busy_cpu accounting to idle callback
sched/fair: Remove ifdefs in waker_affine_idler_llc
sched/fair: Dont iterate if no idle CPUs

include/linux/sched/topology.h | 2 +-
kernel/sched/fair.c | 204 ++++++++++++++++++++++++++-------
kernel/sched/features.h | 1 +
kernel/sched/idle.c | 33 +++++-
kernel/sched/sched.h | 6 +
kernel/sched/topology.c | 9 ++
6 files changed, 214 insertions(+), 41 deletions(-)

--
2.18.2


2021-05-06 16:47:54

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 4/8] sched/fair: Prefer idle CPU to cache affinity

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a method where Scheduler
compares idle CPUs in waker and previous LLCs and picks the appropriate
one.

The new method looks at idle-core to figure out idle LLC. If there are
no idle LLCs, it compares the ratio of busy CPUs to the total number of
CPUs in the LLC. This method will only be useful to compare 2 LLCs. If
the previous CPU and the waking CPU are in the same LLC, this method
would not be useful. For now the new method is disabled by default.

sync flag decides which CPU/LLC to try first. If sync is set, choose
current LLC, else choose previous LLC.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog: v1->v2:
- Swap the cpus, if the wakeup is not sync, so that a single order of
code suffices for both sync and non-sync wakeups.

- Mel reported a crash. Apparently two threads can race to find an
idle-core. I now cache the idlecore. Also use compare-exchange, so
that no 2 waking tasks contend on the same CPU.

Also Based on similar posting:
http://lore.kernel.org/lkml/[email protected]/t/#u
- Make WA_WAKER default (Suggested by Rik) : done in next patch
- Make WA_WAKER check more conservative: (Suggested by Rik / Peter)
- Rename WA_WAKER to WA_IDLER_LLC (Suggested by Vincent)
- s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar)
- Add rcu_read_lock and check for validity of shared domains
- Add idle-core support

kernel/sched/fair.c | 66 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 67 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50da2363317d..72bf1996903d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,59 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}

+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val);
+
+static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
+{
+#ifdef CONFIG_NO_HZ_COMMON
+ int tnr_busy, tllc_size, pnr_busy, pllc_size;
+#endif
+ struct sched_domain_shared *pref_sds, *try_sds;
+ int diff, idle_core;
+
+ if (!sync)
+ swap(pref_cpu, try_cpu);
+
+ pref_sds = rcu_dereference(per_cpu(sd_llc_shared, pref_cpu));
+ try_sds = rcu_dereference(per_cpu(sd_llc_shared, try_cpu));
+ if (!pref_sds || !try_sds)
+ return nr_cpumask_bits;
+
+ if (available_idle_cpu(pref_cpu) || sched_idle_cpu(pref_cpu))
+ return pref_cpu;
+
+ idle_core = READ_ONCE(pref_sds->idle_core);
+ if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
+ test_reset_idle_core(pref_sds, idle_core))
+ return idle_core;
+
+ if (available_idle_cpu(try_cpu) || sched_idle_cpu(try_cpu))
+ return try_cpu;
+
+ idle_core = READ_ONCE(try_sds->idle_core);
+ if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
+ test_reset_idle_core(try_sds, idle_core))
+ return idle_core;
+
+#ifdef CONFIG_NO_HZ_COMMON
+ pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
+ tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
+ pllc_size = per_cpu(sd_llc_size, pref_cpu);
+ tllc_size = per_cpu(sd_llc_size, try_cpu);
+
+ if (tnr_busy == tllc_size && pnr_busy == pllc_size)
+ return nr_cpumask_bits;
+
+ diff = tnr_busy * pllc_size - pnr_busy * tllc_size;
+ if (diff > 0)
+ return pref_cpu;
+ if (diff < 0)
+ return try_cpu;
+#endif /* CONFIG_NO_HZ_COMMON */
+
+ return nr_cpumask_bits;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
@@ -5877,6 +5930,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);

+ if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+ target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

@@ -6044,6 +6100,11 @@ static inline int get_idle_core(int cpu, int def)
return def;
}

+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val)
+{
+ return cmpxchg(&sds->idle_core, val, -1) == val;
+}
+
static void set_next_idle_core(struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
@@ -6161,6 +6222,11 @@ static inline bool get_idle_core(int cpu, int def)
return def;
}

+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val)
+{
+ return false;
+}
+
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b158fc51..c77349a47e01 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)

SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, false)
SCHED_FEAT(WA_BIAS, true)

/*
--
2.18.2

2021-05-06 16:49:52

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc

Now that idle callbacks are updating nr_busy_cpus, remove ifdefs in
wake_affine_idler_llc

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d3b0928fe98..c70f0889258f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5872,9 +5872,7 @@ static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val

static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
{
-#ifdef CONFIG_NO_HZ_COMMON
int tnr_busy, tllc_size, pnr_busy, pllc_size;
-#endif
struct sched_domain_shared *pref_sds, *try_sds;
int diff, idle_core;

@@ -5902,7 +5900,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
test_reset_idle_core(try_sds, idle_core))
return idle_core;

-#ifdef CONFIG_NO_HZ_COMMON
pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
pllc_size = per_cpu(sd_llc_size, pref_cpu);
@@ -5916,7 +5913,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
return pref_cpu;
if (diff < 0)
return try_cpu;
-#endif /* CONFIG_NO_HZ_COMMON */

return nr_cpumask_bits;
}
--
2.18.2

2021-05-06 16:49:53

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 8/8] sched/fair: Dont iterate if no idle CPUs

Now that the nr_busy_cpus for a LLC are updated in idle callbacks,
scheduler can detect if all threads of a LLC are busy. In such cases, it
can avoid searching for idle CPUs in the LLC that can run the wakee
thread.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c70f0889258f..83104d3bd0f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,7 +715,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
#include "pelt.h"
#ifdef CONFIG_SMP

-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, bool idle);
static unsigned long task_h_load(struct task_struct *p);
static unsigned long capacity_of(int cpu);

@@ -5870,7 +5870,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,

static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val);

-static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
+static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu,
+ int sync, bool *idle)
{
int tnr_busy, tllc_size, pnr_busy, pllc_size;
struct sched_domain_shared *pref_sds, *try_sds;
@@ -5905,8 +5906,10 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
pllc_size = per_cpu(sd_llc_size, pref_cpu);
tllc_size = per_cpu(sd_llc_size, try_cpu);

- if (tnr_busy == tllc_size && pnr_busy == pllc_size)
+ if (tnr_busy == tllc_size && pnr_busy == pllc_size) {
+ *idle = false;
return nr_cpumask_bits;
+ }

diff = tnr_busy * pllc_size - pnr_busy * tllc_size;
if (diff > 0)
@@ -5918,7 +5921,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int prev_cpu, int sync)
+ int this_cpu, int prev_cpu, int sync, bool *idle)
{
bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
int target = nr_cpumask_bits;
@@ -5927,7 +5930,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
target = wake_affine_idle(this_cpu, prev_cpu);

else if (sched_feat(WA_IDLER_LLC) && !share_caches)
- target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+ target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync, idle);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
@@ -6343,7 +6346,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, bool idle)
{
struct sched_domain *sd;
unsigned long task_util;
@@ -6420,6 +6423,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}

+ if (!idle)
+ return target;
+
sd = rcu_dereference(per_cpu(sd_llc, target));
if (!sd)
return target;
@@ -6828,6 +6834,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
int want_affine = 0;
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
+ bool idle = true;

if (wake_flags & WF_TTWU) {
record_wakee(p);
@@ -6851,7 +6858,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
if (cpu != prev_cpu)
- new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
+ new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, &idle);

sd = NULL; /* Prefer wake_affine over balance flags */
break;
@@ -6868,7 +6875,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, idle);

if (want_affine)
current->recent_used_cpu = cpu;
--
2.18.2

2021-05-06 16:51:52

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

Currently we account nr_busy_cpus in no_hz idle functions.
There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
configs only. Also scheduler can mark a CPU as non-busy as soon as an
idle class task starts to run. Scheduler can then mark a CPU as busy
as soon as its woken up from idle or a new task is placed on it's
runqueue.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/idle.c | 29 +++++++++++++++++++++++++++--
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 2 ++
4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c30587631a24..4d3b0928fe98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10394,7 +10394,10 @@ static void set_cpu_sd_state_busy(int cpu)
goto unlock;
sd->nohz_idle = 0;

- atomic_inc(&sd->shared->nr_busy_cpus);
+ if (sd && per_cpu(is_idle, cpu)) {
+ atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
+ per_cpu(is_idle, cpu) = 0;
+ }
unlock:
rcu_read_unlock();
}
@@ -10424,7 +10427,6 @@ static void set_cpu_sd_state_idle(int cpu)
goto unlock;
sd->nohz_idle = 1;

- atomic_dec(&sd->shared->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index cc828f3efe71..e624a05e48bd 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
-#ifdef CONFIG_SCHED_SMT
+#ifdef CONFIG_SMP
+ struct sched_domain_shared *sds;
int cpu = rq->cpu;

+#ifdef CONFIG_SCHED_SMT
if (static_branch_likely(&sched_smt_present))
set_core_busy(cpu);
#endif
+
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ if (per_cpu(is_idle, cpu)) {
+ atomic_inc(&sds->nr_busy_cpus);
+ per_cpu(is_idle, cpu) = 0;
+ }
+ }
+ rcu_read_unlock();
+#endif
}

static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
@@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
struct task_struct *pick_next_task_idle(struct rq *rq)
{
struct task_struct *next = rq->idle;
+#ifdef CONFIG_SMP
+ struct sched_domain_shared *sds;
+ int cpu = rq->cpu;

- set_next_task_idle(rq, next, true);
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
+ per_cpu(is_idle, cpu) = 1;
+ }

+ rcu_read_unlock();
+#endif
+
+ set_next_task_idle(rq, next, true);
return next;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c0bd4b0e73a..baf8d9a4cb26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
#ifdef CONFIG_SCHED_SMT
DECLARE_PER_CPU(int, smt_id);
#endif
+DECLARE_PER_CPU(int, is_idle);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8db40c8a6ad0..00e4669bb241 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
#ifdef CONFIG_SCHED_SMT
DEFINE_PER_CPU(int, smt_id);
#endif
+DEFINE_PER_CPU(int, is_idle);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
#ifdef CONFIG_SCHED_SMT
per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
#endif
+ per_cpu(is_idle, cpu) = 1;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);

sd = lowest_flag_domain(cpu, SD_NUMA);
--
2.18.2

2021-05-06 16:54:59

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/fair: wake_affine improvements

* Srikar Dronamraju <[email protected]> [2021-05-06 22:15:35]:

Mel had asked for additional profile data that could be collected
for idlecore

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5200.05
Virtualization: VT-x
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 6 MiB
L3 cache: 60 MiB
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

# cat test.sh
#! /bin/bash
tbench_srv &

#20 iterations of tbench on 48 CPU, 2 node syste
for i in $(seq 1 20); do
#enable schedstat just before tbench
echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
tbench -t 60 48 127.0.0.1
#disable schedstat just before tbench
echo 0 | sudo tee /proc/sys/kernel/sched_schedstats
IDLE_COUNT=$(awk '/nr_idlecore_write/{a+=$NF}END{print a}' /proc/sched_debug)
NR_IDLE_COUNT=$(($IDLE_COUNT-$NR_IDLE_COUNT))
SELECT_COUNT=$(awk '/nr_idlecore_select/{a+=$NF}END{print a}' /proc/sched_debug)
NR_SELECT_COUNT=$(($SELECT_COUNT-$NR_SELECT_COUNT))
# select means we selected an idle core when the preferred CPU is
# busy. write means, unconditional set of idlecore.
# seqno total_updates=select+write select write
echo $i $(($NR_IDLE_COUNT+$NR_SELECT_COUNT)) $NR_SELECT_COUNT $NR_IDLE_COUNT
done
#

a5e13c6df0e4 aka v5.12-rc5
seqno nr_update_idle_core nr_select_idle_core nr_write_idle_core
1 11515 0 11515
2 13439 0 13439
3 42339 0 42339
4 13642 0 13642
5 44770 0 44770
6 32402 0 32402
7 65638 0 65638
8 106601 0 106601
9 99819 0 99819
10 106754 0 106754
11 107899 0 107899
12 112432 0 112432
13 125329 0 125329
14 127363 0 127363
15 133821 0 133821
16 127495 0 127495
17 133957 0 133957
18 185021 0 185021
19 137139 0 137139
20 221413 0 221413

N Min Max Median Avg Stddev
20 11515 221413 107899 97439.4 57524.696

Average of 1353 updates per second.


635bb392f382 aka v5.12-rc5 + patches
seqno nr_update_idle_core nr_select_idle_core nr_write_idle_core
1 2112856 218 2112638
2 1727892 84 1727808
3 3662807 280 3662527
4 3623563 220 3623343
5 4972825 308 4972517
6 3625828 258 3625570
7 6703820 407 6703413
8 5565289 390 5564899
9 8376039 528 8375511
10 6643273 405 6642868
11 10041803 605 10041198
12 8322148 537 8321611
13 11941494 729 11940765
14 10125633 704 10124929
15 12810965 797 12810168
16 12269912 857 12269055
17 14798232 912 14797320
18 14378202 980 14377222
19 15705273 935 15704338
20 16407305 1122 16406183

N Min Max Median Avg Stddev
20 1727892 16407305 8376039 8690757.9 4718172.5

Average of 120704 updates per second which is around 89X times without the
patch.

--
Thanks and Regards
Srikar Dronamraju

---->8----------------------------------------------------8<--------------

From e361fdd4234ff718247f0ee20f4f836ccbbc1df8 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju <[email protected]>
Date: Thu, 6 May 2021 19:33:51 +0530
Subject: sched: Show idlecore update stats

Not for inclusion; Just for demonstration.

nr_idlecore_write: idlecore set unconditionally.
nr_idlecore_select; wakeup found an idle core when the current CPU was
busy.

nr_idlecore updates = nr_idlecore_write + nr_idlecore_select

Not-Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/debug.c | 2 ++
kernel/sched/fair.c | 12 +++++++++---
kernel/sched/sched.h | 2 ++
3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..b50e0c5acf46 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -715,6 +715,8 @@ do { \
P(ttwu_count);
P(ttwu_local);
}
+ P(nr_idlecore_write);
+ P(nr_idlecore_select);
#undef P

spin_lock_irqsave(&sched_debug_lock, flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83104d3bd0f9..4ef0b7d959d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5890,16 +5890,20 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp

idle_core = READ_ONCE(pref_sds->idle_core);
if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
- test_reset_idle_core(pref_sds, idle_core))
+ test_reset_idle_core(pref_sds, idle_core)) {
+ schedstat_inc(cpu_rq(idle_core)->nr_idlecore_select);
return idle_core;
+ }

if (available_idle_cpu(try_cpu) || sched_idle_cpu(try_cpu))
return try_cpu;

idle_core = READ_ONCE(try_sds->idle_core);
if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
- test_reset_idle_core(try_sds, idle_core))
+ test_reset_idle_core(try_sds, idle_core)) {
+ schedstat_inc(cpu_rq(idle_core)->nr_idlecore_select);
return idle_core;
+ }

pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
@@ -6082,8 +6086,10 @@ static inline void set_idle_core(int cpu, int val)
struct sched_domain_shared *sds;

sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
+ if (sds) {
WRITE_ONCE(sds->idle_core, val);
+ schedstat_inc(cpu_rq(cpu)->nr_idlecore_write);
+ }
}

static inline int get_idle_core(int cpu, int def)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index baf8d9a4cb26..06d81a9e0a48 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1060,6 +1060,8 @@ struct rq {
#ifdef CONFIG_SMP
unsigned int nr_pinned;
#endif
+ unsigned int nr_idlecore_write;
+ unsigned int nr_idlecore_select;
unsigned int push_busy;
struct cpu_stop_work push_work;
};
--
2.25.1

2021-05-06 23:29:10

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC

Currently if a waker is the only running thread of the CPU, and waking a
wakee (both of them interacting over a pipe aka sync wakeups) with the
wakee previously running on a different LLC), then wakee is pulled and
queued on the CPU that is running the waker.

This is true even if the previous CPU was completely idle. The rationale
being waker would soon relinquish the CPU and wakee would benefit from
the cache access. However if the previous CPU is idle, then wakee thread
will incur latency cost of migration to the waker CPU + latency cost of
the waker context switching.

Before the waker switches out, load balancer could also kick in and pull
the wakee out resulting in another migration cost. This will mostly hurt
systems where LLC have just one core. For example:- Hackbench. Both the
threads of hackbench would then end up running on the same core
resulting in CPUs running in higher SMT modes, more load balances and
non optimal performance.

It would be beneficial to run the wakee thread on the same CPU as waker
only if other CPUs are busy. To achieve this move this part of the code
that check if waker is the only running thread to early part of
wake_affine_weight().

Also post this change, wake_affine_idle() will only look at wakeups
within the LLC. For wakeups within LLC, there should be no difference
between sync and no-sync. For wakeups across LLC boundaries, lets use
wake_affine_idler_llc.

Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Changelog v1->v2:
- Update the patch subject line.

kernel/sched/fair.c | 22 +++++++++++-----------
kernel/sched/features.h | 2 +-
2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 72bf1996903d..c30587631a24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5799,8 +5799,7 @@ static int wake_wide(struct task_struct *p)
* scheduling latency of the CPUs. This seems to work
* for the overloaded case.
*/
-static int
-wake_affine_idle(int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idle(int this_cpu, int prev_cpu)
{
/*
* If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5814,15 +5813,12 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
* a cpufreq perspective, it's better to have higher utilisation
* on one CPU.
*/
- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+ if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+ return prev_cpu;

- if (sync && cpu_rq(this_cpu)->nr_running == 1)
+ if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
return this_cpu;

- if (available_idle_cpu(prev_cpu))
- return prev_cpu;
-
return nr_cpumask_bits;
}

@@ -5838,6 +5834,9 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
if (sync) {
unsigned long current_load = task_h_load(current);

+ if (cpu_rq(this_cpu)->nr_running <= 1)
+ return this_cpu;
+
if (current_load > this_eff_load)
return this_cpu;

@@ -5925,12 +5924,13 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
+ bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
int target = nr_cpumask_bits;

- if (sched_feat(WA_IDLE))
- target = wake_affine_idle(this_cpu, prev_cpu, sync);
+ if (sched_feat(WA_IDLE) && share_caches)
+ target = wake_affine_idle(this_cpu, prev_cpu);

- if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+ else if (sched_feat(WA_IDLER_LLC) && !share_caches)
target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c77349a47e01..c60a6d2b2126 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)

SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
-SCHED_FEAT(WA_IDLER_LLC, false)
+SCHED_FEAT(WA_IDLER_LLC, true)
SCHED_FEAT(WA_BIAS, true)

/*
--
2.18.2

2021-05-06 23:29:36

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 3/8] sched/fair: Update idle-core more often

Currently when the scheduler does a load balance and pulls a task or
when a CPU picks up a task during wakeup without having to call
select_idle_cpu(), it never checks if the target CPU is part of the
idle-core. This makes idle-core less accurate.

Given that the identity of idle-core for LLC is maintained, its easy to
update the idle-core as soon as the CPU picks up a task.

This change will update the idle-core whenever a CPU from the idle-core
picks up a task. However if there are multiple idle-cores in the LLC,
and waking CPU happens to be part of the designated idle-core, idle-core
is set to -1. In cases where the scheduler is sure that there are no
more idle-cores, idle-core is set to -2.

To reduce this case, whenever a CPU updates idle-core, it will look for
other cores in the LLC for an idle-core, if the core to which it belongs
to is not idle.

Cc: LKML <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/idle.c | 6 +++++
kernel/sched/sched.h | 2 ++
3 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c9d1a210820..50da2363317d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6015,6 +6015,13 @@ static inline int __select_idle_cpu(int cpu)
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);

+/*
+ * Value of -2 indicates there are no idle-cores in LLC.
+ * Value of -1 indicates an idle-core turned to busy recently.
+ * However there could be other idle-cores in the system.
+ * Anyother value indicates core to which the CPU(value)
+ * belongs is idle.
+ */
static inline void set_idle_core(int cpu, int val)
{
struct sched_domain_shared *sds;
@@ -6037,6 +6044,40 @@ static inline int get_idle_core(int cpu, int def)
return def;
}

+static void set_next_idle_core(struct sched_domain *sd, int target)
+{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ int core, cpu;
+
+ cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
+ for_each_cpu_wrap(core, cpus, target) {
+ bool idle = true;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ if (!available_idle_cpu(cpu)) {
+ idle = false;
+ break;
+ }
+ }
+
+ if (idle) {
+ set_idle_core(core, per_cpu(smt_id, core));
+ return;
+ }
+
+ cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+ }
+ set_idle_core(target, -2);
+}
+
+void set_core_busy(int core)
+{
+ rcu_read_lock();
+ if (get_idle_core(core, -1) == per_cpu(smt_id, core))
+ set_idle_core(core, -1);
+ rcu_read_unlock();
+}
+
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
* information in sd_llc_shared->idle_core.
@@ -6046,11 +6087,13 @@ static inline int get_idle_core(int cpu, int def)
*/
void __update_idle_core(struct rq *rq)
{
+ struct sched_domain *sd;
int core = cpu_of(rq);
int cpu;

rcu_read_lock();
- if (get_idle_core(core, 0) >= 0)
+ sd = rcu_dereference(per_cpu(sd_llc, core));
+ if (!sd || get_idle_core(core, 0) >= 0)
goto unlock;

for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6058,10 +6101,15 @@ void __update_idle_core(struct rq *rq)
continue;

if (!available_idle_cpu(cpu))
- goto unlock;
+ goto try_next;
}

set_idle_core(core, per_cpu(smt_id, core));
+ goto unlock;
+
+try_next:
+ set_next_idle_core(sd, core);
+
unlock:
rcu_read_unlock();
}
@@ -6130,7 +6178,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
int idle_core = get_idle_core(target, -1);
- bool smt = (idle_core != -1);
+ bool smt = (idle_core != -2);
int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7199e6f23789..cc828f3efe71 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,6 +425,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+#ifdef CONFIG_SCHED_SMT
+ int cpu = rq->cpu;
+
+ if (static_branch_likely(&sched_smt_present))
+ set_core_busy(cpu);
+#endif
}

static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46d40a281724..5c0bd4b0e73a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1102,6 +1102,7 @@ static inline bool is_migration_disabled(struct task_struct *p)

#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
+extern void set_core_busy(int cpu);

static inline void update_idle_core(struct rq *rq)
{
@@ -1111,6 +1112,7 @@ static inline void update_idle_core(struct rq *rq)

#else
static inline void update_idle_core(struct rq *rq) { }
+static inline void set_core_busy(int cpu) { }
#endif

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
--
2.18.2

2021-05-11 11:53:08

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 06/05/21 22:15, Srikar Dronamraju wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8db40c8a6ad0..00e4669bb241 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> #ifdef CONFIG_SCHED_SMT
> DEFINE_PER_CPU(int, smt_id);
> #endif
> +DEFINE_PER_CPU(int, is_idle);

This + patch 8 immediately reminds me of Aubrey's patch:

http://lore.kernel.org/r/[email protected]

last I looked it seemed OK, even the test bot seems happy. Aubrey, did you
have any more work to do on that one (other than rebasing)?

> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> #ifdef CONFIG_SCHED_SMT
> per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> #endif
> + per_cpu(is_idle, cpu) = 1;
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> --
> 2.18.2

2021-05-11 16:57:01

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Valentin Schneider <[email protected]> [2021-05-11 12:51:41]:

> On 06/05/21 22:15, Srikar Dronamraju wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8db40c8a6ad0..00e4669bb241 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> > #ifdef CONFIG_SCHED_SMT
> > DEFINE_PER_CPU(int, smt_id);
> > #endif
> > +DEFINE_PER_CPU(int, is_idle);
>
> This + patch 8 immediately reminds me of Aubrey's patch:
>
> http://lore.kernel.org/r/[email protected]
>
> last I looked it seemed OK, even the test bot seems happy. Aubrey, did you
> have any more work to do on that one (other than rebasing)?
>

The above patch also is aimed at similar problems.

However I feel this patch has few differences.
- We use the nr_busy_cpus in this patchset in the wake_affine_idler_llc() to
differentiate between 2 LLCs and choose a LLC that is lightly loaded.

- Except for the per-cpu is_idle, it gets it done with the existing
infrastructure. (And we could move the is_idle in the rq struct too.)

- Mel had reservations on per-LLC idlecore, I would think the per-LLC idle
mask would mean more updates and dirty. Everytime a CPU goes to idle,
comes out of idle, the mask would be dirtied. Though the number of times
this new per LLC CPU mask is read is probably going to be lesser than
idle-cores with this patch series.

- In the said implementation, the idleness of a CPU is done at every check
which may not be necessary if handled in the callbacks.

> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> > #ifdef CONFIG_SCHED_SMT
> > per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > #endif
> > + per_cpu(is_idle, cpu) = 1;
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > sd = lowest_flag_domain(cpu, SD_NUMA);
> > --
> > 2.18.2

--
Thanks and Regards
Srikar Dronamraju

2021-05-12 00:33:46

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

Hi Valentin,

On 5/11/21 7:51 PM, Valentin Schneider wrote:
> On 06/05/21 22:15, Srikar Dronamraju wrote:
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 8db40c8a6ad0..00e4669bb241 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
>> #ifdef CONFIG_SCHED_SMT
>> DEFINE_PER_CPU(int, smt_id);
>> #endif
>> +DEFINE_PER_CPU(int, is_idle);
>
> This + patch 8 immediately reminds me of Aubrey's patch:
>
> http://lore.kernel.org/r/[email protected]
>
> last I looked it seemed OK, even the test bot seems happy. Aubrey, did you
> have any more work to do on that one (other than rebasing)?
>

Thanks to mention this patch, in terms of the patch itself, I don't have any more
work, other than rebasing it to Rik's bring-back-select_idle_smt() patch.

But I have some other ideas based on this patch need to verify, for example, if we
have an idle_cpu mask when we scan the idle CPU, we should be able to remove SIS_PROP
feature at all. This removes scan cost computation and should reduce the task wakeup
latency. But I need a while to understand some benchmark regressions.

Thanks,
-Aubrey

2021-05-12 08:11:47

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> Currently we account nr_busy_cpus in no_hz idle functions.
> There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> configs only. Also scheduler can mark a CPU as non-busy as soon as an
> idle class task starts to run. Scheduler can then mark a CPU as busy
> as soon as its woken up from idle or a new task is placed on it's
> runqueue.

IIRC, we discussed this before, if a SCHED_IDLE task is placed on the
CPU's runqueue, this CPU should be still taken as a wakeup target.

Also, for those frequent context-switching tasks with very short idle,
it's expensive for scheduler to mark idle/busy every time, that's why
my patch only marks idle every time and marks busy ratelimited in
scheduler tick.

Thanks,
-Aubrey

>
> Cc: LKML <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Parth Shah <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> kernel/sched/fair.c | 6 ++++--
> kernel/sched/idle.c | 29 +++++++++++++++++++++++++++--
> kernel/sched/sched.h | 1 +
> kernel/sched/topology.c | 2 ++
> 4 files changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c30587631a24..4d3b0928fe98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10394,7 +10394,10 @@ static void set_cpu_sd_state_busy(int cpu)
> goto unlock;
> sd->nohz_idle = 0;
>
> - atomic_inc(&sd->shared->nr_busy_cpus);
> + if (sd && per_cpu(is_idle, cpu)) {
> + atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> + per_cpu(is_idle, cpu) = 0;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -10424,7 +10427,6 @@ static void set_cpu_sd_state_idle(int cpu)
> goto unlock;
> sd->nohz_idle = 1;
>
> - atomic_dec(&sd->shared->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index cc828f3efe71..e624a05e48bd 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
>
> static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> {
> -#ifdef CONFIG_SCHED_SMT
> +#ifdef CONFIG_SMP
> + struct sched_domain_shared *sds;
> int cpu = rq->cpu;
>
> +#ifdef CONFIG_SCHED_SMT
> if (static_branch_likely(&sched_smt_present))
> set_core_busy(cpu);
> #endif
> +
> + rcu_read_lock();
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds) {
> + if (per_cpu(is_idle, cpu)) {
> + atomic_inc(&sds->nr_busy_cpus);
> + per_cpu(is_idle, cpu) = 0;
> + }
> + }
> + rcu_read_unlock();
> +#endif
> }
>
> static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> @@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> struct task_struct *pick_next_task_idle(struct rq *rq)
> {
> struct task_struct *next = rq->idle;
> +#ifdef CONFIG_SMP
> + struct sched_domain_shared *sds;
> + int cpu = rq->cpu;
>
> - set_next_task_idle(rq, next, true);
> + rcu_read_lock();
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds) {
> + atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> + per_cpu(is_idle, cpu) = 1;
> + }
>
> + rcu_read_unlock();
> +#endif
> +
> + set_next_task_idle(rq, next, true);
> return next;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5c0bd4b0e73a..baf8d9a4cb26 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
> #ifdef CONFIG_SCHED_SMT
> DECLARE_PER_CPU(int, smt_id);
> #endif
> +DECLARE_PER_CPU(int, is_idle);
> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8db40c8a6ad0..00e4669bb241 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> #ifdef CONFIG_SCHED_SMT
> DEFINE_PER_CPU(int, smt_id);
> #endif
> +DEFINE_PER_CPU(int, is_idle);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> #ifdef CONFIG_SCHED_SMT
> per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> #endif
> + per_cpu(is_idle, cpu) = 1;
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
>

2021-05-13 07:34:35

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Aubrey Li <[email protected]> [2021-05-12 16:08:24]:

> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> > Currently we account nr_busy_cpus in no_hz idle functions.
> > There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> > configs only. Also scheduler can mark a CPU as non-busy as soon as an
> > idle class task starts to run. Scheduler can then mark a CPU as busy
> > as soon as its woken up from idle or a new task is placed on it's
> > runqueue.
>
> IIRC, we discussed this before, if a SCHED_IDLE task is placed on the
> CPU's runqueue, this CPU should be still taken as a wakeup target.
>

Yes, this CPU is still a wakeup target, its only when this CPU is busy, that
we look at other CPUs

> Also, for those frequent context-switching tasks with very short idle,
> it's expensive for scheduler to mark idle/busy every time, that's why
> my patch only marks idle every time and marks busy ratelimited in
> scheduler tick.
>

I have tried few tasks with very short idle times and updating nr_busy
everytime, doesnt seem to be impacting. Infact, it seems to help in picking
the idler-llc more often.

> Thanks,
> -Aubrey
>
> >
> > Cc: LKML <[email protected]>
> > Cc: Gautham R Shenoy <[email protected]>
> > Cc: Parth Shah <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Signed-off-by: Srikar Dronamraju <[email protected]>
> > ---
> > kernel/sched/fair.c | 6 ++++--
> > kernel/sched/idle.c | 29 +++++++++++++++++++++++++++--
> > kernel/sched/sched.h | 1 +
> > kernel/sched/topology.c | 2 ++
> > 4 files changed, 34 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c30587631a24..4d3b0928fe98 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10394,7 +10394,10 @@ static void set_cpu_sd_state_busy(int cpu)
> > goto unlock;
> > sd->nohz_idle = 0;
> >
> > - atomic_inc(&sd->shared->nr_busy_cpus);
> > + if (sd && per_cpu(is_idle, cpu)) {
> > + atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> > + per_cpu(is_idle, cpu) = 0;
> > + }
> > unlock:
> > rcu_read_unlock();
> > }
> > @@ -10424,7 +10427,6 @@ static void set_cpu_sd_state_idle(int cpu)
> > goto unlock;
> > sd->nohz_idle = 1;
> >
> > - atomic_dec(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index cc828f3efe71..e624a05e48bd 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> >
> > static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> > {
> > -#ifdef CONFIG_SCHED_SMT
> > +#ifdef CONFIG_SMP
> > + struct sched_domain_shared *sds;
> > int cpu = rq->cpu;
> >
> > +#ifdef CONFIG_SCHED_SMT
> > if (static_branch_likely(&sched_smt_present))
> > set_core_busy(cpu);
> > #endif
> > +
> > + rcu_read_lock();
> > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > + if (sds) {
> > + if (per_cpu(is_idle, cpu)) {
> > + atomic_inc(&sds->nr_busy_cpus);
> > + per_cpu(is_idle, cpu) = 0;
> > + }
> > + }
> > + rcu_read_unlock();
> > +#endif
> > }
> >
> > static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> > @@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> > struct task_struct *pick_next_task_idle(struct rq *rq)
> > {
> > struct task_struct *next = rq->idle;
> > +#ifdef CONFIG_SMP
> > + struct sched_domain_shared *sds;
> > + int cpu = rq->cpu;
> >
> > - set_next_task_idle(rq, next, true);
> > + rcu_read_lock();
> > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > + if (sds) {
> > + atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> > + per_cpu(is_idle, cpu) = 1;
> > + }
> >
> > + rcu_read_unlock();
> > +#endif
> > +
> > + set_next_task_idle(rq, next, true);
> > return next;
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 5c0bd4b0e73a..baf8d9a4cb26 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
> > #ifdef CONFIG_SCHED_SMT
> > DECLARE_PER_CPU(int, smt_id);
> > #endif
> > +DECLARE_PER_CPU(int, is_idle);
> > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8db40c8a6ad0..00e4669bb241 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> > #ifdef CONFIG_SCHED_SMT
> > DEFINE_PER_CPU(int, smt_id);
> > #endif
> > +DEFINE_PER_CPU(int, is_idle);
> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> > #ifdef CONFIG_SCHED_SMT
> > per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > #endif
> > + per_cpu(is_idle, cpu) = 1;
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > sd = lowest_flag_domain(cpu, SD_NUMA);
> >
>

--
Thanks and Regards
Srikar Dronamraju

2021-05-14 13:09:03

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
>
>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>>> Currently we account nr_busy_cpus in no_hz idle functions.
>>> There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
>>> configs only. Also scheduler can mark a CPU as non-busy as soon as an
>>> idle class task starts to run. Scheduler can then mark a CPU as busy
>>> as soon as its woken up from idle or a new task is placed on it's
>>> runqueue.
>>
>> IIRC, we discussed this before, if a SCHED_IDLE task is placed on the
>> CPU's runqueue, this CPU should be still taken as a wakeup target.
>>
>
> Yes, this CPU is still a wakeup target, its only when this CPU is busy, that
> we look at other CPUs
>
>> Also, for those frequent context-switching tasks with very short idle,
>> it's expensive for scheduler to mark idle/busy every time, that's why
>> my patch only marks idle every time and marks busy ratelimited in
>> scheduler tick.
>>
>
> I have tried few tasks with very short idle times and updating nr_busy
> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> the idler-llc more often.
>

How many CPUs in your LLC?

This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
domain.

It looks like for netperf both TCP and UDP cases have the notable change
under 2 x overcommit, it may be not interesting though.


hackbench(48 tasks per group)
=========
case load baseline(std%) compare%( std%)
process-pipe group-1 1.00 ( 6.74) -4.61 ( 8.97)
process-pipe group-2 1.00 ( 36.84) +11.53 ( 26.35)
process-pipe group-3 1.00 ( 24.97) +12.21 ( 19.05)
process-pipe group-4 1.00 ( 18.27) -2.62 ( 17.60)
process-pipe group-8 1.00 ( 4.33) -2.22 ( 3.08)
process-sockets group-1 1.00 ( 7.88) -20.26 ( 15.97)
process-sockets group-2 1.00 ( 5.38) -19.41 ( 9.25)
process-sockets group-3 1.00 ( 4.22) -5.70 ( 3.00)
process-sockets group-4 1.00 ( 1.44) -1.80 ( 0.79)
process-sockets group-8 1.00 ( 0.44) -2.86 ( 0.06)
threads-pipe group-1 1.00 ( 5.43) -3.69 ( 3.59)
threads-pipe group-2 1.00 ( 18.00) -2.69 ( 16.79)
threads-pipe group-3 1.00 ( 21.72) -9.01 ( 21.34)
threads-pipe group-4 1.00 ( 21.58) -6.43 ( 16.26)
threads-pipe group-8 1.00 ( 3.05) -0.15 ( 2.31)
threads-sockets group-1 1.00 ( 14.51) -5.35 ( 13.85)
threads-sockets group-2 1.00 ( 3.97) -24.15 ( 4.40)
threads-sockets group-3 1.00 ( 4.97) -9.05 ( 2.46)
threads-sockets group-4 1.00 ( 1.98) -3.44 ( 0.49)
threads-sockets group-8 1.00 ( 0.37) -2.13 ( 0.20)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR thread-48 1.00 ( 3.84) -2.20 ( 3.83)
TCP_RR thread-96 1.00 ( 5.22) -4.97 ( 3.90)
TCP_RR thread-144 1.00 ( 7.97) -0.75 ( 4.39)
TCP_RR thread-192 1.00 ( 3.03) -0.67 ( 4.40)
TCP_RR thread-384 1.00 ( 22.27) -14.15 ( 36.28)
UDP_RR thread-48 1.00 ( 2.08) -0.39 ( 2.29)
UDP_RR thread-96 1.00 ( 2.48) -4.26 ( 16.06)
UDP_RR thread-144 1.00 ( 49.50) -3.28 ( 34.86)
UDP_RR thread-192 1.00 ( 6.39) +8.07 ( 88.15)
UDP_RR thread-384 1.00 ( 31.54) -12.76 ( 35.98)

2021-05-17 17:34:18

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Aubrey Li <[email protected]> [2021-05-14 12:11:50]:

> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
> >> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:

<snip>

> >> Also, for those frequent context-switching tasks with very short idle,
> >> it's expensive for scheduler to mark idle/busy every time, that's why
> >> my patch only marks idle every time and marks busy ratelimited in
> >> scheduler tick.
> >>
> >
> > I have tried few tasks with very short idle times and updating nr_busy
> > everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> > the idler-llc more often.
> >
>
> How many CPUs in your LLC?

I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
+
POWER10, Multiple CPUs with 4 CPUs in LLC
+
POWER9, Multiple CPUs with 8 CPUs in LLC

>
> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
> domain.
>

Okay,

> It looks like for netperf both TCP and UDP cases have the notable change
> under 2 x overcommit, it may be not interesting though.
>
>

I believe the extra load on this 24 core LLC could be because we may end up
trying to set the idle-core, even when there is no idle core available.

If possible, can you please give a try with v3 with the call to
set_next_idle_core commented out?


--
Thanks and Regards
Srikar Dronamraju

2021-05-17 18:26:03

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Aubrey Li <[email protected]> [2021-05-17 20:48:46]:

> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <[email protected]> [2021-05-14 12:11:50]:
> >
> >> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> >>> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
> >>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> >
> > <snip>
> >
> >>>> Also, for those frequent context-switching tasks with very short idle,
> >>>> it's expensive for scheduler to mark idle/busy every time, that's why
> >>>> my patch only marks idle every time and marks busy ratelimited in
> >>>> scheduler tick.
> >>>>
> >>>
> >>> I have tried few tasks with very short idle times and updating nr_busy
> >>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> >>> the idler-llc more often.
> >>>
> >>
> >> How many CPUs in your LLC?
> >
> > I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
> > +
> > POWER10, Multiple CPUs with 4 CPUs in LLC
> > +
> > POWER9, Multiple CPUs with 8 CPUs in LLC
> >
> >>
> >> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
> >> domain.
> >>
> >
> > Okay,
> >
> >> It looks like for netperf both TCP and UDP cases have the notable change
> >> under 2 x overcommit, it may be not interesting though.
> >>
> >>
> >
> > I believe the extra load on this 24 core LLC could be because we may end up
> > trying to set the idle-core, even when there is no idle core available.
> >
> > If possible, can you please give a try with v3 with the call to
> > set_next_idle_core commented out?
> >
> >
>
> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?

I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
which was tag: sched-core-2021-04-28

The only conflict you get on today's tip is Gautham's one line patch.
Gautham's patch replaced 'this' with 'target'.

The 2nd patch does away with that line

--
Thanks and Regards
Srikar Dronamraju

2021-05-17 19:23:55

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> * Aubrey Li <[email protected]> [2021-05-14 12:11:50]:
>
>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
>>> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>
> <snip>
>
>>>> Also, for those frequent context-switching tasks with very short idle,
>>>> it's expensive for scheduler to mark idle/busy every time, that's why
>>>> my patch only marks idle every time and marks busy ratelimited in
>>>> scheduler tick.
>>>>
>>>
>>> I have tried few tasks with very short idle times and updating nr_busy
>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
>>> the idler-llc more often.
>>>
>>
>> How many CPUs in your LLC?
>
> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
> +
> POWER10, Multiple CPUs with 4 CPUs in LLC
> +
> POWER9, Multiple CPUs with 8 CPUs in LLC
>
>>
>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
>> domain.
>>
>
> Okay,
>
>> It looks like for netperf both TCP and UDP cases have the notable change
>> under 2 x overcommit, it may be not interesting though.
>>
>>
>
> I believe the extra load on this 24 core LLC could be because we may end up
> trying to set the idle-core, even when there is no idle core available.
>
> If possible, can you please give a try with v3 with the call to
> set_next_idle_core commented out?
>
>

v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?

2021-05-19 09:44:33

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
> * Aubrey Li <[email protected]> [2021-05-17 20:48:46]:
>
>> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
>>> * Aubrey Li <[email protected]> [2021-05-14 12:11:50]:
>>>
>>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
>>>>> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
>>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>>>
>>> <snip>
>>>
>>>>>> Also, for those frequent context-switching tasks with very short idle,
>>>>>> it's expensive for scheduler to mark idle/busy every time, that's why
>>>>>> my patch only marks idle every time and marks busy ratelimited in
>>>>>> scheduler tick.
>>>>>>
>>>>>
>>>>> I have tried few tasks with very short idle times and updating nr_busy
>>>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
>>>>> the idler-llc more often.
>>>>>
>>>>
>>>> How many CPUs in your LLC?
>>>
>>> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
>>> +
>>> POWER10, Multiple CPUs with 4 CPUs in LLC
>>> +
>>> POWER9, Multiple CPUs with 8 CPUs in LLC
>>>
>>>>
>>>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
>>>> domain.
>>>>
>>>
>>> Okay,
>>>
>>>> It looks like for netperf both TCP and UDP cases have the notable change
>>>> under 2 x overcommit, it may be not interesting though.
>>>>
>>>>
>>>
>>> I believe the extra load on this 24 core LLC could be because we may end up
>>> trying to set the idle-core, even when there is no idle core available.
>>>
>>> If possible, can you please give a try with v3 with the call to
>>> set_next_idle_core commented out?
>>>
>>>
>>
>> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?
>
> I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
> which was tag: sched-core-2021-04-28
>
> The only conflict you get on today's tip is Gautham's one line patch.
> Gautham's patch replaced 'this' with 'target'.
>
> The 2nd patch does away with that line
>

This is v3. It looks like hackbench gets better. And netperf still has
some notable changes under 2 x overcommit cases.


hackbench (48 tasks per group)
=========
case load baseline(std%) compare%( std%)
process-pipe group-1 1.00 ( 4.51) +1.36 ( 4.26)
process-pipe group-2 1.00 ( 18.73) -9.66 ( 31.15)
process-pipe group-3 1.00 ( 23.67) +8.52 ( 21.13)
process-pipe group-4 1.00 ( 14.65) +17.12 ( 25.23)
process-pipe group-8 1.00 ( 3.11) +16.41 ( 5.94)
process-sockets group-1 1.00 ( 8.83) +1.53 ( 11.93)
process-sockets group-2 1.00 ( 5.32) -15.43 ( 7.17)
process-sockets group-3 1.00 ( 4.79) -4.14 ( 1.90)
process-sockets group-4 1.00 ( 2.39) +4.37 ( 1.31)
process-sockets group-8 1.00 ( 0.38) +4.41 ( 0.05)
threads-pipe group-1 1.00 ( 3.06) -1.57 ( 3.71)
threads-pipe group-2 1.00 ( 17.41) -2.16 ( 15.29)
threads-pipe group-3 1.00 ( 17.94) +19.86 ( 13.24)
threads-pipe group-4 1.00 ( 15.38) +3.71 ( 11.97)
threads-pipe group-8 1.00 ( 2.72) +13.40 ( 8.43)
threads-sockets group-1 1.00 ( 8.51) -2.73 ( 17.48)
threads-sockets group-2 1.00 ( 5.44) -12.04 ( 5.91)
threads-sockets group-3 1.00 ( 4.38) -5.00 ( 1.48)
threads-sockets group-4 1.00 ( 1.08) +4.46 ( 1.15)
threads-sockets group-8 1.00 ( 0.61) +5.12 ( 0.20)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR thread-48 1.00 ( 3.79) +4.69 ( 4.03)
TCP_RR thread-96 1.00 ( 4.98) -6.74 ( 3.59)
TCP_RR thread-144 1.00 ( 6.04) -2.36 ( 3.57)
TCP_RR thread-192 1.00 ( 4.97) -0.44 ( 4.89)
TCP_RR thread-384 1.00 ( 19.87) -19.12 ( 28.99)
UDP_RR thread-48 1.00 ( 12.54) -2.73 ( 1.59)
UDP_RR thread-96 1.00 ( 6.51) -6.66 ( 10.42)
UDP_RR thread-144 1.00 ( 45.41) -3.81 ( 31.37)
UDP_RR thread-192 1.00 ( 32.06) +3.07 ( 71.89)
UDP_RR thread-384 1.00 ( 29.57) -21.52 ( 35.50)

2021-05-19 11:50:09

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Aubrey Li <[email protected]> [2021-05-18 08:59:00]:

> On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <[email protected]> [2021-05-17 20:48:46]:
> >
> >> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> >>> * Aubrey Li <[email protected]> [2021-05-14 12:11:50]:
> >>>
> >>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> >>>>> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
> >>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> >>>
> >>> <snip>
> >>>
> >>>>>> Also, for those frequent context-switching tasks with very short idle,
> >>>>>> it's expensive for scheduler to mark idle/busy every time, that's why
> >>>>>> my patch only marks idle every time and marks busy ratelimited in
> >>>>>> scheduler tick.
> >>>>>>
> >>>>>
> >>>>> I have tried few tasks with very short idle times and updating nr_busy
> >>>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> >>>>> the idler-llc more often.
> >>>>>
> >>>>
> >>>> How many CPUs in your LLC?
> >>>
> >>> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
> >>> +
> >>> POWER10, Multiple CPUs with 4 CPUs in LLC
> >>> +
> >>> POWER9, Multiple CPUs with 8 CPUs in LLC
> >>>
> >>>>
> >>>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
> >>>> domain.
> >>>>
> >>>
> >>> Okay,
> >>>
> >>>> It looks like for netperf both TCP and UDP cases have the notable change
> >>>> under 2 x overcommit, it may be not interesting though.
> >>>>
> >>>>
> >>>
> >>> I believe the extra load on this 24 core LLC could be because we may end up
> >>> trying to set the idle-core, even when there is no idle core available.
> >>>
> >>> If possible, can you please give a try with v3 with the call to
> >>> set_next_idle_core commented out?
> >>>
> >>>
> >>
> >> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?
> >
> > I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
> > which was tag: sched-core-2021-04-28
> >
> > The only conflict you get on today's tip is Gautham's one line patch.
> > Gautham's patch replaced 'this' with 'target'.
> >
> > The 2nd patch does away with that line
> >
>
> This is v3. It looks like hackbench gets better. And netperf still has
> some notable changes under 2 x overcommit cases.
>

Thanks Aubrey for the results. netperf (2X) case does seem to regress.
I was actually expecting the results to get better with overcommit.
Can you confirm if this was just v3 or with v3 + set_next_idle_core
disabled?

>
> hackbench (48 tasks per group)
> =========
> case load baseline(std%) compare%( std%)
> process-pipe group-1 1.00 ( 4.51) +1.36 ( 4.26)
> process-pipe group-2 1.00 ( 18.73) -9.66 ( 31.15)
> process-pipe group-3 1.00 ( 23.67) +8.52 ( 21.13)
> process-pipe group-4 1.00 ( 14.65) +17.12 ( 25.23)
> process-pipe group-8 1.00 ( 3.11) +16.41 ( 5.94)
> process-sockets group-1 1.00 ( 8.83) +1.53 ( 11.93)
> process-sockets group-2 1.00 ( 5.32) -15.43 ( 7.17)
> process-sockets group-3 1.00 ( 4.79) -4.14 ( 1.90)
> process-sockets group-4 1.00 ( 2.39) +4.37 ( 1.31)
> process-sockets group-8 1.00 ( 0.38) +4.41 ( 0.05)
> threads-pipe group-1 1.00 ( 3.06) -1.57 ( 3.71)
> threads-pipe group-2 1.00 ( 17.41) -2.16 ( 15.29)
> threads-pipe group-3 1.00 ( 17.94) +19.86 ( 13.24)
> threads-pipe group-4 1.00 ( 15.38) +3.71 ( 11.97)
> threads-pipe group-8 1.00 ( 2.72) +13.40 ( 8.43)
> threads-sockets group-1 1.00 ( 8.51) -2.73 ( 17.48)
> threads-sockets group-2 1.00 ( 5.44) -12.04 ( 5.91)
> threads-sockets group-3 1.00 ( 4.38) -5.00 ( 1.48)
> threads-sockets group-4 1.00 ( 1.08) +4.46 ( 1.15)
> threads-sockets group-8 1.00 ( 0.61) +5.12 ( 0.20)
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR thread-48 1.00 ( 3.79) +4.69 ( 4.03)
> TCP_RR thread-96 1.00 ( 4.98) -6.74 ( 3.59)
> TCP_RR thread-144 1.00 ( 6.04) -2.36 ( 3.57)
> TCP_RR thread-192 1.00 ( 4.97) -0.44 ( 4.89)
> TCP_RR thread-384 1.00 ( 19.87) -19.12 ( 28.99)
> UDP_RR thread-48 1.00 ( 12.54) -2.73 ( 1.59)
> UDP_RR thread-96 1.00 ( 6.51) -6.66 ( 10.42)
> UDP_RR thread-144 1.00 ( 45.41) -3.81 ( 31.37)
> UDP_RR thread-192 1.00 ( 32.06) +3.07 ( 71.89)
> UDP_RR thread-384 1.00 ( 29.57) -21.52 ( 35.50)

--
Thanks and Regards
Srikar Dronamraju

2021-05-19 13:31:40

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 5/18/21 12:00 PM, Srikar Dronamraju wrote:
> * Aubrey Li <[email protected]> [2021-05-18 08:59:00]:
>
>> On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
>>> * Aubrey Li <[email protected]> [2021-05-17 20:48:46]:
>>>
>>>> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
>>>>> * Aubrey Li <[email protected]> [2021-05-14 12:11:50]:
>>>>>
>>>>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
>>>>>>> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
>>>>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>>>> Also, for those frequent context-switching tasks with very short idle,
>>>>>>>> it's expensive for scheduler to mark idle/busy every time, that's why
>>>>>>>> my patch only marks idle every time and marks busy ratelimited in
>>>>>>>> scheduler tick.
>>>>>>>>
>>>>>>>
>>>>>>> I have tried few tasks with very short idle times and updating nr_busy
>>>>>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
>>>>>>> the idler-llc more often.
>>>>>>>
>>>>>>
>>>>>> How many CPUs in your LLC?
>>>>>
>>>>> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
>>>>> +
>>>>> POWER10, Multiple CPUs with 4 CPUs in LLC
>>>>> +
>>>>> POWER9, Multiple CPUs with 8 CPUs in LLC
>>>>>
>>>>>>
>>>>>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
>>>>>> domain.
>>>>>>
>>>>>
>>>>> Okay,
>>>>>
>>>>>> It looks like for netperf both TCP and UDP cases have the notable change
>>>>>> under 2 x overcommit, it may be not interesting though.
>>>>>>
>>>>>>
>>>>>
>>>>> I believe the extra load on this 24 core LLC could be because we may end up
>>>>> trying to set the idle-core, even when there is no idle core available.
>>>>>
>>>>> If possible, can you please give a try with v3 with the call to
>>>>> set_next_idle_core commented out?
>>>>>
>>>>>
>>>>
>>>> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?
>>>
>>> I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
>>> which was tag: sched-core-2021-04-28
>>>
>>> The only conflict you get on today's tip is Gautham's one line patch.
>>> Gautham's patch replaced 'this' with 'target'.
>>>
>>> The 2nd patch does away with that line
>>>
>>
>> This is v3. It looks like hackbench gets better. And netperf still has
>> some notable changes under 2 x overcommit cases.
>>
>
> Thanks Aubrey for the results. netperf (2X) case does seem to regress.
> I was actually expecting the results to get better with overcommit.
> Can you confirm if this was just v3 or with v3 + set_next_idle_core
> disabled?

Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
changed "this" to "target" in set_idle_cores, and I removed it to apply
v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
commit-id 915a2bc3c6b7.

2021-05-19 15:56:15

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Aubrey Li <[email protected]> [2021-05-18 14:05:56]:

> On 5/18/21 12:00 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <[email protected]> [2021-05-18 08:59:00]:
> >
> >> On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
> >>> * Aubrey Li <[email protected]> [2021-05-17 20:48:46]:
> >>>
> >>>> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> >>>>> * Aubrey Li <[email protected]> [2021-05-14 12:11:50]:
> >>>>>
> >>>>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> >>>>>>> * Aubrey Li <[email protected]> [2021-05-12 16:08:24]:
> >>>>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> >>>>>

<snip>
> >>
> >> This is v3. It looks like hackbench gets better. And netperf still has
> >> some notable changes under 2 x overcommit cases.
> >>
> >
> > Thanks Aubrey for the results. netperf (2X) case does seem to regress.
> > I was actually expecting the results to get better with overcommit.
> > Can you confirm if this was just v3 or with v3 + set_next_idle_core
> > disabled?
>
> Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
> changed "this" to "target" in set_idle_cores, and I removed it to apply
> v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
> commit-id 915a2bc3c6b7.

Thats correct,

In the 3rd patch, I had introduced set_next_idle_core
which is suppose to set idle_cores in the LLC.
What I suspected was is this one is causing issues in your 48 CPU LLC.

I am expecting set_next_idle_core to be spending much time in your scenario.
I was planning for something like the below on top of my patch.
With this we dont look for an idle-core if we already know that we dont find one.
But in the mean while I had asked if you could have dropped the call to
set_next_idle_core.

--
Thanks and Regards
Srikar Dronamraju

------------>8-----------------8<--------------------------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bee8e5225d99..2e2113262647 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6207,6 +6207,9 @@ static void set_next_idle_core(int target)
if (!sd)
return;

+ if (atomic_read(&sd->shared->nr_busy_cpus) * 2 >= per_cpu(sd_llc_size, target))
+ goto out;
+
cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
for_each_cpu_wrap(core, cpus, target) {
bool idle = true;
@@ -6225,6 +6228,7 @@ static void set_next_idle_core(int target)

cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
}
+out:
set_idle_core(target, -2);
}



2021-05-19 19:25:53

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

On 5/18/21 3:18 PM, Srikar Dronamraju wrote:

>>>> This is v3. It looks like hackbench gets better. And netperf still has
>>>> some notable changes under 2 x overcommit cases.
>>>>
>>>
>>> Thanks Aubrey for the results. netperf (2X) case does seem to regress.
>>> I was actually expecting the results to get better with overcommit.
>>> Can you confirm if this was just v3 or with v3 + set_next_idle_core
>>> disabled?
>>
>> Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
>> changed "this" to "target" in set_idle_cores, and I removed it to apply
>> v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
>> commit-id 915a2bc3c6b7.
>
> Thats correct,
>
> In the 3rd patch, I had introduced set_next_idle_core
> which is suppose to set idle_cores in the LLC.
> What I suspected was is this one is causing issues in your 48 CPU LLC.
>
> I am expecting set_next_idle_core to be spending much time in your scenario.
> I was planning for something like the below on top of my patch.
> With this we dont look for an idle-core if we already know that we dont find one.
> But in the mean while I had asked if you could have dropped the call to
> set_next_idle_core.
>

+ if (atomic_read(&sd->shared->nr_busy_cpus) * 2 >= per_cpu(sd_llc_size, target))
+ goto out;

Does this has side effect if waker and wakee are coalesced on a portion of cores?
Also, is 2 a SMT2 assumption?

I did a quick testing on this, it looks like the regression of netperf 2x cases are
improved indeed, but hackbench two mid-load cases get worse.

process-sockets group-2 1.00 ( 5.32) -18.40 ( 7.32)
threads-sockets group-2 1.00 ( 5.44) -20.44 ( 4.60)

Thanks,
-Aubrey

2021-05-19 20:19:27

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback

* Aubrey Li <[email protected]> [2021-05-19 17:43:55]:

> On 5/18/21 3:18 PM, Srikar Dronamraju wrote:
>
> >>>> This is v3. It looks like hackbench gets better. And netperf still has
> >>>> some notable changes under 2 x overcommit cases.
> >>>>
> >>>
> >>> Thanks Aubrey for the results. netperf (2X) case does seem to regress.
> >>> I was actually expecting the results to get better with overcommit.
> >>> Can you confirm if this was just v3 or with v3 + set_next_idle_core
> >>> disabled?
> >>
> >> Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
> >> changed "this" to "target" in set_idle_cores, and I removed it to apply
> >> v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
> >> commit-id 915a2bc3c6b7.
> >
> > Thats correct,
> >
> > In the 3rd patch, I had introduced set_next_idle_core
> > which is suppose to set idle_cores in the LLC.
> > What I suspected was is this one is causing issues in your 48 CPU LLC.
> >
> > I am expecting set_next_idle_core to be spending much time in your scenario.
> > I was planning for something like the below on top of my patch.
> > With this we dont look for an idle-core if we already know that we dont find one.
> > But in the mean while I had asked if you could have dropped the call to
> > set_next_idle_core.
> >
>
> + if (atomic_read(&sd->shared->nr_busy_cpus) * 2 >= per_cpu(sd_llc_size, target))
> + goto out;
>
> Does this has side effect if waker and wakee are coalesced on a portion of cores?
> Also, is 2 a SMT2 assumption?

The above line was just a hack to see if things change by not spending time
searching for an idle-core. Since you were running on Intel, I had
hard-coded it to 2.

>
> I did a quick testing on this, it looks like the regression of netperf 2x cases are
> improved indeed, but hackbench two mid-load cases get worse.
>

In the mid-loaded, case, there was a chance of idle-core being around, Since
we dont set it to an idle-core or -2, i.e idle-core is set to -1, it may or
may not get selected.

> process-sockets group-2 1.00 ( 5.32) -18.40 ( 7.32)
> threads-sockets group-2 1.00 ( 5.44) -20.44 ( 4.60)
>
> Thanks,
> -Aubrey

--
Thanks and Regards
Srikar Dronamraju