2018-06-29 06:26:27

by Subhra Mazumdar

[permalink] [raw]
Subject: [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
- Setting an upper and lower limit of idle cpu search in select_idle_cpu
to keep search time low and constant
- Adding a new sched feature SIS_CORE to disable select_idle_core

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Following are the performance numbers with various benchmarks with SIS_CORE
true (idle core search enabled).

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5816 8.94 0.5903 (-1.5%) 11.28
2 0.6428 10.64 0.5843 (9.1%) 4.93
4 1.0152 1.99 0.9965 (1.84%) 1.83
8 1.8128 1.4 1.7921 (1.14%) 1.76
16 3.1666 0.8 3.1345 (1.01%) 0.81
32 5.6084 0.83 5.5677 (0.73%) 0.8

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline %stdev patch %stdev
8 2095.45 1.82 2102.6 (0.34%) 2.11
16 4218.45 0.06 4221.35 (0.07%) 0.38
32 7531.36 0.49 7607.18 (1.01%) 0.25
48 10206.42 0.21 10324.26 (1.15%) 0.13
64 12053.73 0.1 12158.3 (0.87%) 0.24
128 14810.33 0.04 14840.4 (0.2%) 0.38

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 0.9 1.0068 (0.68%) 0.27
40 1 0.8 1.0103 (1.03%) 1.24
60 1 0.34 1.0178 (1.78%) 0.49
80 1 0.53 1.0092 (0.92%) 1.5
100 1 0.79 1.0090 (0.9%) 0.88
120 1 0.06 1.0048 (0.48%) 0.72
140 1 0.22 1.0116 (1.16%) 0.05
160 1 0.57 1.0264 (2.64%) 0.67
180 1 0.81 1.0194 (1.94%) 0.91
200 1 0.44 1.028 (2.8%) 3.09
220 1 1.74 1.0229 (2.29%) 0.21

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 45.36 0.43 46.28 (2.01%) 0.29
16 87.81 0.82 89.67 (2.12%) 0.38
32 151.19 0.02 153.5 (1.53%) 0.41
48 190.2 0.21 194.79 (2.41%) 0.07
64 190.42 0.35 202.9 (6.55%) 1.66
128 323.86 0.28 343.56 (6.08%) 1.34

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 629.8 603.83 (-4.12%)
2 1159.65 1155.75 (-0.34%)
4 2121.61 2093.99 (-1.3%)
8 2620.52 2641.51 (0.8%)
16 2879.31 2897.6 (0.64%)
32 2791.24 2936.47 (5.2%)
64 1853.07 1894.74 (2.25%)
128 1484.95 1494.29 (0.63%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 256.41 255.8 (-0.24%)
2 509.89 504.52 (-1.05%)
4 999.44 1003.74 (0.43%)
8 1982.7 1976.42 (-0.32%)
16 3891.51 3916.04 (0.63%)
32 6819.24 6845.06 (0.38%)
64 8542.95 8568.28 (0.3%)
128 15277.6 15754.6 (3.12%)

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 94 2.82 92 (2.13%) 2.17
75 124 2.13 122 (1.61%) 1.42
90 152 1.74 151 (0.66%) 0.66
95 171 2.11 170 (0.58%) 0
99 512.67 104.96 208.33 (59.36%) 1.2
99.5 2296 82.55 3674.66 (-60.05%) 22.19
99.9 12517.33 2.38 12784 (-2.13%) 0.66

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups baseline %stdev patch %stdev
1 1.3085 6.65 1.2213 (6.66%) 10.32
2 1.4559 8.55 1.5048 (-3.36%) 4.72
4 2.6271 1.74 2.5532 (2.81%) 2.02
8 4.7089 3.01 4.5118 (4.19%) 2.74
16 8.7406 2.25 8.6801 (0.69%) 4.78
32 17.7835 1.01 16.759 (5.76%) 1.38
64 36.1901 0.65 34.6652 (4.21%) 1.24
128 72.6585 0.51 70.9762 (2.32%) 0.9

Following are the performance numbers with various benchmarks with SIS_CORE
false (idle core search disabled). This improves throughput of certain
workloads but increases latency of other workloads.

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5816 8.94 0.5835 (-0.33%) 8.21
2 0.6428 10.64 0.5752 (10.52%) 4.05
4 1.0152 1.99 0.9946 (2.03%) 2.56
8 1.8128 1.4 1.7619 (2.81%) 1.88
16 3.1666 0.8 3.1275 (1.23%) 0.42
32 5.6084 0.83 5.5856 (0.41%) 0.89

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline %stdev patch %stdev
8 2095.45 1.82 2084.72 (-0.51%) 1.65
16 4218.45 0.06 4179.69 (-0.92%) 0.18
32 7531.36 0.49 7623.18 (1.22%) 0.39
48 10206.42 0.21 10159.16 (-0.46%) 0.21
64 12053.73 0.1 12087.21 (0.28%) 0.19
128 14810.33 0.04 14894.08 (0.57%) 0.08

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 0.9 1.0056 (0.56%) 0.34
40 1 0.8 1.0173 (1.73%) 0.13
60 1 0.34 0.9995 (-0.05%) 0.85
80 1 0.53 1.0175 (1.75%) 1.56
100 1 0.79 1.0151 (1.51%) 1.31
120 1 0.06 1.0244 (2.44%) 0.5
140 1 0.22 1.034 (3.4%) 0.66
160 1 0.57 1.0362 (3.62%) 0.07
180 1 0.81 1.041 (4.1%) 0.8
200 1 0.44 1.0233 (2.33%) 1.4
220 1 1.74 1.0125 (1.25%) 1.41

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 45.36 0.43 46.94 (3.48%) 0.2
16 87.81 0.82 91.75 (4.49%) 0.43
32 151.19 0.02 167.74 (10.95%) 1.29
48 190.2 0.21 200.57 (5.45%) 0.89
64 190.42 0.35 226.74 (19.07%) 1.79
128 323.86 0.28 348.12 (7.49%) 0.77

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 629.8 600.19 (-4.7%)
2 1159.65 1162.07 (0.21%)
4 2121.61 2112.27 (-0.44%)
8 2620.52 2645.55 (0.96%)
16 2879.31 2828.87 (-1.75%)
32 2791.24 2760.97 (-1.08%)
64 1853.07 1747.66 (-5.69%)
128 1484.95 1459.81 (-1.69%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 256.41 258.11 (0.67%)
2 509.89 509.13 (-0.15%)
4 999.44 1016.58 (1.72%)
8 1982.7 2006.53 (1.2%)
16 3891.51 3964.43 (1.87%)
32 6819.24 7376.92 (8.18%)
64 8542.95 9660.45 (13.08%)
128 15277.6 15438.4 (1.05%)

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 94 2.82 94.67 (-0.71%) 2.2
75 124 2.13 124.67 (-0.54%) 1.67
90 152 1.74 154.33 (-1.54%) 0.75
95 171 2.11 176.67 (-3.31%) 0.86
99 512.67 104.96 4130.33 (-705.65%) 79.41
99.5 2296 82.55 10066.67 (-338.44%) 26.15
99.9 12517.33 2.38 12869.33 (-2.81%) 0.8

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups baseline %stdev patch %stdev
1 1.3085 6.65 1.2514 (4.36%) 11.1
2 1.4559 8.55 1.5433 (-6%) 3.05
4 2.6271 1.74 2.5626 (2.5%) 2.69
8 4.7089 3.01 4.5316 (3.77%) 2.95
16 8.7406 2.25 8.6585 (0.94%) 2.91
32 17.7835 1.01 17.175 (3.42%) 1.38
64 36.1901 0.65 35.5294 (1.83%) 1.02
128 72.6585 0.51 71.8821 (1.07%) 1.05

Following are the schbench performance numbers with SIS_CORE false and
SIS_PROP false. This recovers the latency increase by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 94 2.82 93.33 (0.71%) 1.24
75 124 2.13 122.67 (1.08%) 1.7
90 152 1.74 149.33 (1.75%) 2.35
95 171 2.11 167 (2.34%) 2.74
99 512.67 104.96 206 (59.82%) 8.86
99.5 2296 82.55 3121.67 (-35.96%) 97.37
99.9 12517.33 2.38 12592 (-0.6%) 1.67

Changes since v1
- Compute the upper and lower limit based on number of cpus in a core
- Split up the search limit and search window rotation into separate
patches
- Add new sched feature to have option of disabling idle core search

subhra mazumdar (5):
sched: limit cpu search in select_idle_cpu
sched: introduce per-cpu var next_cpu to track search limit
sched: rotate the cpu search window for better spread
sched: add sched feature to disable idle core search
sched: SIS_CORE to disable idle core search

kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 31 +++++++++++++++++++++++--------
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 27 insertions(+), 8 deletions(-)

--
2.9.3



2018-06-29 06:25:14

by Subhra Mazumdar

[permalink] [raw]
Subject: [PATCH 1/5] sched: limit cpu search in select_idle_cpu

Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..7243146 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, nr = INT_MAX;
+ int cpu, limit, floor, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
+ floor = cpumask_weight(topology_sibling_cpumask(target));
+ if (floor < 2)
+ floor = 2;
+ limit = 2*floor;
+ if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
- else
- nr = 4;
+ if (nr > limit)
+ nr = limit;
+ } else {
+ nr = floor;
+ }
}

time = local_clock();
--
2.9.3


2018-06-29 06:25:27

by Subhra Mazumdar

[permalink] [raw]
Subject: [PATCH 5/5] sched: SIS_CORE to disable idle core search

Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 574cb14..33fbc47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6464,9 +6464,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;

- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
+ if (sched_feat(SIS_CORE)) {
+ i = select_idle_core(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }

i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
--
2.9.3


2018-06-29 06:25:34

by Subhra Mazumdar

[permalink] [raw]
Subject: [PATCH 3/5] sched: rotate the cpu search window for better spread

Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7243146..574cb14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, limit, floor, nr = INT_MAX;
+ int cpu, limit, floor, target_tmp, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6403,9 +6403,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}
}

+ if (per_cpu(next_cpu, target) != -1)
+ target_tmp = per_cpu(next_cpu, target);
+ else
+ target_tmp = target;
+
time = local_clock();

- for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+ for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+ per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
--
2.9.3


2018-06-29 06:26:19

by Subhra Mazumdar

[permalink] [raw]
Subject: [PATCH 4/5] sched: add sched feature to disable idle core search

Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/features.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae848..de15733 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.9.3


2018-06-29 06:26:55

by Subhra Mazumdar

[permalink] [raw]
Subject: [PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit

Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/core.c | 2 ++
kernel/sched/sched.h | 1 +
2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b25..b3e4ec1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -20,6 +20,7 @@
#include <trace/events/sched.h>

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);

#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
/*
@@ -5996,6 +5997,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;

+ per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4..eb12b50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -912,6 +912,7 @@ static inline void update_idle_core(struct rq *rq) { }
#endif

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);

#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() this_cpu_ptr(&runqueues)
--
2.9.3