Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.
This patch solves the scalability problem by:
-Removing select_idle_core() as it can potentially scan the full LLC domain
even if there is only one idle core which doesn't scale
-Lowering the lower limit of nr variable in select_idle_cpu() and also
setting an upper limit to restrict search time
Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.
Following are the performance numbers with various benchmarks.
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.5334 (7.10%) 5.2
2 0.5776 7.87 0.5393 (6.63%) 6.39
4 0.9578 1.12 0.9537 (0.43%) 1.08
8 1.7018 1.35 1.682 (1.16%) 1.33
16 2.9955 1.36 2.9849 (0.35%) 0.96
32 5.4354 0.59 5.3308 (1.92%) 0.60
Sysbench MySQL on 1 socket, 6 core and 12 threads Intel x86 machine
(higher is better):
threads baseline patch
2 49.53 49.83 (0.61%)
4 89.07 90 (1.05%)
8 149 154 (3.31%)
16 240 246 (2.56%)
32 357 351 (-1.69%)
64 428 428 (-0.03%)
128 473 469 (-0.92%)
Sysbench PostgresSQL on 1 socket, 6 core and 12 threads Intel x86 machine
(higher is better):
threads baseline patch
2 68.35 70.07 (2.51%)
4 93.53 92.54 (-1.05%)
8 125 127 (1.16%)
16 145 146 (0.92%)
32 158 156 (-1.24%)
64 160 160 (0.47%)
Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 1.35 1.0075 (0.75%) 0.71
40 1 0.42 0.9971 (-0.29%) 0.26
60 1 1.54 0.9955 (-0.45%) 0.83
80 1 0.58 1.0059 (0.59%) 0.59
100 1 0.77 1.0201 (2.01%) 0.39
120 1 0.35 1.0145 (1.45%) 1.41
140 1 0.19 1.0325 (3.25%) 0.77
160 1 0.09 1.0277 (2.77%) 0.57
180 1 0.99 1.0249 (2.49%) 0.79
200 1 1.03 1.0133 (1.33%) 0.77
220 1 1.69 1.0317 (3.17%) 1.41
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 49.47 0.35 50.96 (3.02%) 0.12
16 95.28 0.77 99.01 (3.92%) 0.14
32 156.77 1.17 180.64 (15.23%) 1.05
48 193.24 0.22 214.7 (11.1%) 1
64 216.21 9.33 252.81 (16.93%) 1.68
128 379.62 10.29 397.47 (4.75) 0.41
Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 627.62 629.14 (0.24%)
2 1153.45 1179.9 (2.29%)
4 2060.29 2051.62 (-0.42%)
8 2724.41 2609.4 (-4.22%)
16 2987.56 2891.54 (-3.21%)
32 2375.82 2345.29 (-1.29%)
64 1963.31 1903.61 (-3.04%)
128 1546.01 1513.17 (-2.12%)
Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 279.33 285.154 (2.08%)
2 545.961 572.538 (4.87%)
4 1081.06 1126.51 (4.2%)
8 2158.47 2234.78 (3.53%)
16 4223.78 4358.11 (3.18%)
32 7117.08 8022.19 (12.72%)
64 8947.28 10719.7 (19.81%)
128 15976.7 17531.2 (9.73%)
Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 256 (higher is better):
clients baseline %stdev patch %stdev
1 2699 4.86 2697 (-0.1%) 3.74
10 18832 0 18830 (0%) 0.01
100 18830 0.05 18827 (0%) 0.08
Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 1K (higher is better):
clients baseline %stdev patch %stdev
1 9414 0.02 9414 (0%) 0.01
10 18832 0 18832 (0%) 0
100 18830 0.05 18829 (0%) 0.04
Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 4K (higher is better):
clients baseline %stdev patch %stdev
1 9414 0.01 9414 (0%) 0
10 18832 0 18832 (0%) 0
100 18829 0.04 18833 (0%) 0
Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 64K (higher is better):
clients baseline %stdev patch %stdev
1 9415 0.01 9415 (0%) 0
10 18832 0 18832 (0%) 0
100 18830 0.04 18833 (0%) 0
Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 1M (higher is better):
clients baseline %stdev patch %stdev
1 9415 0.01 9415 (0%) 0.01
10 18832 0 18832 (0%) 0
100 18830 0.04 18819 (-0.1%) 0.13
JBB on 2 socket, 28 core and 56 threads Intel x86 machine
(higher is better):
baseline %stdev patch %stdev
jops 60049 0.65 60191 (0.2%) 0.99
critical jops 29689 0.76 29044 (-2.2%) 1.46
Schbench on 2 socket, 24 core and 48 threads Intel x86 machine with 24
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 5007 0.16 5003 (0.1%) 0.12
75 10000 0 10000 (0%) 0
90 16992 0 16998 (0%) 0.12
95 21984 0 22043 (-0.3%) 0.83
99 34229 1.2 34069 (0.5%) 0.87
99.5 39147 1.1 38741 (1%) 1.1
99.9 49568 1.59 49579 (0%) 1.78
Ebizzy on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline %stdev patch %stdev
1 26477 2.66 26646 (0.6%) 2.81
2 52303 1.72 52987 (1.3%) 1.59
4 100854 2.48 101824 (1%) 2.42
8 188059 6.91 189149 (0.6%) 1.75
16 328055 3.42 333963 (1.8%) 2.03
32 504419 2.23 492650 (-2.3%) 1.76
88 534999 5.35 569326 (6.4%) 3.07
156 541703 2.42 544463 (0.5%) 2.17
NAS: A whole suite of NAS benchmarks were run on 2 socket, 36 core and 72
threads Intel x86 machine with no statistically significant regressions
while giving improvements in some cases. I am not listing the results due
to too many data points.
subhra mazumdar (3):
sched: remove select_idle_core() for scalability
sched: introduce per-cpu var next_cpu to track search limit
sched: limit cpu search and rotate search window for scalability
include/linux/sched/topology.h | 1 -
kernel/sched/core.c | 2 +
kernel/sched/fair.c | 116 +++++------------------------------------
kernel/sched/idle.c | 1 -
kernel/sched/sched.h | 11 +---
5 files changed, 17 insertions(+), 114 deletions(-)
--
2.9.3
select_idle_core() can potentially search all cpus to find the fully idle
core even if there is one such core. Removing this is necessary to achieve
scalability in the fast path.
Signed-off-by: subhra mazumdar <[email protected]>
---
include/linux/sched/topology.h | 1 -
kernel/sched/fair.c | 97 ------------------------------------------
kernel/sched/idle.c | 1 -
kernel/sched/sched.h | 10 -----
4 files changed, 109 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2634774..ac7944d 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,7 +71,6 @@ struct sched_group;
struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
- int has_idle_cores;
};
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e..d1d4769 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6239,94 +6239,6 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
#ifdef CONFIG_SCHED_SMT
-static inline void set_idle_cores(int cpu, int val)
-{
- struct sched_domain_shared *sds;
-
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- WRITE_ONCE(sds->has_idle_cores, val);
-}
-
-static inline bool test_idle_cores(int cpu, bool def)
-{
- struct sched_domain_shared *sds;
-
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
-
- return def;
-}
-
-/*
- * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
- *
- * Since SMT siblings share all cache levels, inspecting this limited remote
- * state should be fairly cheap.
- */
-void __update_idle_core(struct rq *rq)
-{
- int core = cpu_of(rq);
- int cpu;
-
- rcu_read_lock();
- if (test_idle_cores(core, true))
- goto unlock;
-
- for_each_cpu(cpu, cpu_smt_mask(core)) {
- if (cpu == core)
- continue;
-
- if (!idle_cpu(cpu))
- goto unlock;
- }
-
- set_idle_cores(core, 1);
-unlock:
- rcu_read_unlock();
-}
-
-/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
- */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu;
-
- if (!static_branch_likely(&sched_smt_present))
- return -1;
-
- if (!test_idle_cores(target, false))
- return -1;
-
- cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
-
- for_each_cpu_wrap(core, cpus, target) {
- bool idle = true;
-
- for_each_cpu(cpu, cpu_smt_mask(core)) {
- cpumask_clear_cpu(cpu, cpus);
- if (!idle_cpu(cpu))
- idle = false;
- }
-
- if (idle)
- return core;
- }
-
- /*
- * Failed to find an idle core; stop looking for one.
- */
- set_idle_cores(target, 0);
-
- return -1;
-}
-
/*
* Scan the local SMT mask for idle CPUs.
*/
@@ -6349,11 +6261,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
#else /* CONFIG_SCHED_SMT */
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
- return -1;
-}
-
static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
{
return -1;
@@ -6451,10 +6358,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bd..7ca8e18 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -392,7 +392,6 @@ static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
put_prev_task(rq, prev);
- update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
return rq->idle;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c2..3f1874c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -899,16 +899,6 @@ static inline int cpu_of(struct rq *rq)
extern struct static_key_false sched_smt_present;
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
- if (static_branch_unlikely(&sched_smt_present))
- __update_idle_core(rq);
-}
-
-#else
-static inline void update_idle_core(struct rq *rq) { }
#endif
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
--
2.9.3
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.
Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/core.c | 2 ++
kernel/sched/sched.h | 1 +
2 files changed, 3 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aae..cd5c08d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -17,6 +17,7 @@
#include <trace/events/sched.h>
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
/*
@@ -6018,6 +6019,7 @@ void __init sched_init(void)
struct rq *rq;
rq = cpu_rq(i);
+ per_cpu(next_cpu, i) = -1;
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3f1874c..a2db041 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -902,6 +902,7 @@ extern struct static_key_false sched_smt_present;
#endif
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() this_cpu_ptr(&runqueues)
--
2.9.3
Lower the lower limit of idle cpu search in select_idle_cpu() and also put
an upper limit. This helps in scalability of the search by restricting the
search window. Also rotating the search window with help of next_cpu
ensures any idle cpu is eventually found in case of high load.
Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1d4769..62d585b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6279,7 +6279,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, nr = INT_MAX;
+ int cpu, target_tmp, nr = INT_MAX;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
+ if (span_avg > 2*avg_cost) {
nr = div_u64(span_avg, avg_cost);
- else
- nr = 4;
+ if (nr > 4)
+ nr = 4;
+ } else {
+ nr = 2;
+ }
}
+ if (per_cpu(next_cpu, target) != -1)
+ target_tmp = per_cpu(next_cpu, target);
+ else
+ target_tmp = target;
+
time = local_clock();
- for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+ for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+ per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
--
2.9.3
On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
> + if (per_cpu(next_cpu, target) != -1)
> + target_tmp = per_cpu(next_cpu, target);
> + else
> + target_tmp = target;
> +
This one; what's the point here?
On Mon, Apr 23, 2018 at 05:41:15PM -0700, subhra mazumdar wrote:
> @@ -17,6 +17,7 @@
> #include <trace/events/sched.h>
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
>
> #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
> /*
> @@ -6018,6 +6019,7 @@ void __init sched_init(void)
> struct rq *rq;
>
> rq = cpu_rq(i);
> + per_cpu(next_cpu, i) = -1;
If you leave it uninitialized it'll be 0, and we can avoid that extra
branch in the next patch, no?
On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
> an upper limit. This helps in scalability of the search by restricting the
> search window. Also rotating the search window with help of next_cpu
> ensures any idle cpu is eventually found in case of high load.
So this patch does 2 (possibly 3) things, that's not good.
On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
> an upper limit. This helps in scalability of the search by restricting the
> search window.
> @@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> if (sched_feat(SIS_PROP)) {
> u64 span_avg = sd->span_weight * avg_idle;
> - if (span_avg > 4*avg_cost)
> + if (span_avg > 2*avg_cost) {
> nr = div_u64(span_avg, avg_cost);
> - else
> - nr = 4;
> + if (nr > 4)
> + nr = 4;
> + } else {
> + nr = 2;
> + }
> }
Why do you need to put a max on? Why isn't the proportional thing
working as is? (is the average no good because of big variance or what)
Again, why do you need to lower the min; what's wrong with 4?
The reason I picked 4 is that many laptops have 4 CPUs and desktops
really want to avoid queueing if at all possible.
On Mon, Apr 23, 2018 at 05:41:14PM -0700, subhra mazumdar wrote:
> select_idle_core() can potentially search all cpus to find the fully idle
> core even if there is one such core. Removing this is necessary to achieve
> scalability in the fast path.
So this removes the whole core awareness from the wakeup path; this
needs far more justification.
In general running on pure cores is much faster than running on threads.
If you plot performance numbers there's almost always a fairly
significant drop in slope at the moment when we run out of cores and
start using threads.
Also, depending on cpu enumeration, your next patch might not even leave
the core scanning for idle CPUs.
Now, typically on Intel systems, we first enumerate cores and then
siblings, but I've seen Intel systems that don't do this and enumerate
all threads together. Also other architectures are known to iterate full
cores together, both s390 and Power for example do this.
So by only doing a linear scan on CPU number you will actually fill
cores instead of equally spreading across cores. Worse still, by
limiting the scan to _4_ you only barely even get onto a next core for
SMT4 hardware, never mind SMT8.
So while I'm not adverse to limiting the empty core search; I do feel it
is important to have. Overloading cores when you don't have to is not
good.
On 04/24/2018 05:46 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:14PM -0700, subhra mazumdar wrote:
>> select_idle_core() can potentially search all cpus to find the fully idle
>> core even if there is one such core. Removing this is necessary to achieve
>> scalability in the fast path.
> So this removes the whole core awareness from the wakeup path; this
> needs far more justification.
>
> In general running on pure cores is much faster than running on threads.
> If you plot performance numbers there's almost always a fairly
> significant drop in slope at the moment when we run out of cores and
> start using threads.
The only justification I have is the benchmarks I ran all most all
improved, most importantly our internal Oracle DB tests which we care about
a lot. So what you said makes sense in theory but is not borne out by real
world results. This indicates that threads of these benchmarks care more
about running immediately on any idle cpu rather than spending time to find
fully idle core to run on.
> Also, depending on cpu enumeration, your next patch might not even leave
> the core scanning for idle CPUs.
>
> Now, typically on Intel systems, we first enumerate cores and then
> siblings, but I've seen Intel systems that don't do this and enumerate
> all threads together. Also other architectures are known to iterate full
> cores together, both s390 and Power for example do this.
>
> So by only doing a linear scan on CPU number you will actually fill
> cores instead of equally spreading across cores. Worse still, by
> limiting the scan to _4_ you only barely even get onto a next core for
> SMT4 hardware, never mind SMT8.
Again this doesn't matter for the benchmarks I ran. Most are happy to make
the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
the scan window is rotated over all cpus, so idle cpus will be found soon.
There is also stealing by idle cpus. Also this was an RFT so I request this
to be tested on other architectrues like SMT4/SMT8.
>
> So while I'm not adverse to limiting the empty core search; I do feel it
> is important to have. Overloading cores when you don't have to is not
> good.
Can we have a config or a way for enabling/disabling select_idle_core?
On 04/24/2018 05:47 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:15PM -0700, subhra mazumdar wrote:
>> @@ -17,6 +17,7 @@
>> #include <trace/events/sched.h>
>>
>> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>> +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
>>
>> #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
>> /*
>> @@ -6018,6 +6019,7 @@ void __init sched_init(void)
>> struct rq *rq;
>>
>> rq = cpu_rq(i);
>> + per_cpu(next_cpu, i) = -1;
> If you leave it uninitialized it'll be 0, and we can avoid that extra
> branch in the next patch, no?
0 can be a valid cpu id. I wanted to distinguish the first time. The branch
predictor will be fully trained so will not have any cost.
On 04/24/2018 05:48 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
>> + if (per_cpu(next_cpu, target) != -1)
>> + target_tmp = per_cpu(next_cpu, target);
>> + else
>> + target_tmp = target;
>> +
> This one; what's the point here?
Want to start search from target first time and from the next_cpu next time
onwards. If this doesn't many any difference in performance I can change
it. Will require re-running all the tests.
On 04/24/2018 05:48 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
>> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
>> an upper limit. This helps in scalability of the search by restricting the
>> search window. Also rotating the search window with help of next_cpu
>> ensures any idle cpu is eventually found in case of high load.
> So this patch does 2 (possibly 3) things, that's not good.
During testing I did try with first only restricting the search window.
That alone wasn't enough to give the full benefit, rotating search window
was essential to get best results. I will break this up in next version.
On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
>> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
>> an upper limit. This helps in scalability of the search by restricting the
>> search window.
>> @@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>
>> if (sched_feat(SIS_PROP)) {
>> u64 span_avg = sd->span_weight * avg_idle;
>> - if (span_avg > 4*avg_cost)
>> + if (span_avg > 2*avg_cost) {
>> nr = div_u64(span_avg, avg_cost);
>> - else
>> - nr = 4;
>> + if (nr > 4)
>> + nr = 4;
>> + } else {
>> + nr = 2;
>> + }
>> }
> Why do you need to put a max on? Why isn't the proportional thing
> working as is? (is the average no good because of big variance or what)
Firstly the choosing of 512 seems arbitrary. Secondly the logic here is
that the enqueuing cpu should search up to time it can get work itself.
Why is that the optimal amount to search?
>
> Again, why do you need to lower the min; what's wrong with 4?
>
> The reason I picked 4 is that many laptops have 4 CPUs and desktops
> really want to avoid queueing if at all possible.
To find the optimum upper and lower limit I varied them over many
combinations. 4 and 2 gave the best results across most benchmarks.
On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
> > Why do you need to put a max on? Why isn't the proportional thing
> > working as is? (is the average no good because of big variance or what)
> Firstly the choosing of 512 seems arbitrary.
It is; it is a crud attempt to deal with big variance. The comment says
as much.
> Secondly the logic here is that the enqueuing cpu should search up to
> time it can get work itself. Why is that the optimal amount to
> search?
1/512-th of the time in fact, per the above random number, but yes.
Because searching for longer than we're expecting to be idle for is
clearly bad, at that point we're inhibiting doing useful work.
But while thinking about all this, I think I've spotted a few more
issues, aside from the variance:
Firstly, while avg_idle estimates the average duration for _when_ we go
idle, it doesn't give a good measure when we do not in fact go idle. So
when we get wakeups while fully busy, avg_idle is a poor measure.
Secondly, the number of wakeups performed is also important. If we have
a lot of wakeups, we need to look at aggregate wakeup time over a
period. Not just single wakeup time.
And thirdly, we're sharing the idle duration with newidle balance.
And I think the 512 is a result of me not having recognised these
additional issues when looking at the traces, I saw variance and left it
there.
This leaves me thinking we need a better estimator for wakeups. Because
if there really is significant idle time, not looking for idle CPUs to
run on is bad. Placing that upper limit, especially such a low one, is
just an indication of failure.
I'll see if I can come up with something.
On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote:
> So what you said makes sense in theory but is not borne out by real
> world results. This indicates that threads of these benchmarks care more
> about running immediately on any idle cpu rather than spending time to find
> fully idle core to run on.
But you only ran on Intel which emunerates siblings far apart in the
cpuid space. Which is not something we should rely on.
> > So by only doing a linear scan on CPU number you will actually fill
> > cores instead of equally spreading across cores. Worse still, by
> > limiting the scan to _4_ you only barely even get onto a next core for
> > SMT4 hardware, never mind SMT8.
> Again this doesn't matter for the benchmarks I ran. Most are happy to make
> the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
> the scan window is rotated over all cpus, so idle cpus will be found soon.
You've not been reading well. The Intel machine you tested this on most
likely doesn't suffer that problem because of the way it happens to
iterate SMT threads.
How does Sparc iterate its SMT siblings in cpuid space?
Also, your benchmarks chose an unfortunate nr of threads vs topology.
The 2^n thing chosen never hits the 100% core case (6,22 resp.).
> > So while I'm not adverse to limiting the empty core search; I do feel it
> > is important to have. Overloading cores when you don't have to is not
> > good.
> Can we have a config or a way for enabling/disabling select_idle_core?
I like Rohit's suggestion of folding select_idle_core and
select_idle_cpu much better, then it stays SMT aware.
Something like the completely untested patch below.
---
include/linux/sched/topology.h | 1 -
kernel/sched/fair.c | 148 +++++++++++++----------------------------
kernel/sched/idle.c | 1 -
kernel/sched/sched.h | 10 ---
4 files changed, 47 insertions(+), 113 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..ac7944dd8bc6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,7 +71,6 @@ struct sched_group;
struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
- int has_idle_cores;
};
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..95fed8dcea7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6239,124 +6239,72 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
#ifdef CONFIG_SCHED_SMT
-static inline void set_idle_cores(int cpu, int val)
-{
- struct sched_domain_shared *sds;
-
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- WRITE_ONCE(sds->has_idle_cores, val);
-}
-
-static inline bool test_idle_cores(int cpu, bool def)
-{
- struct sched_domain_shared *sds;
-
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
-
- return def;
-}
-
-/*
- * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
- *
- * Since SMT siblings share all cache levels, inspecting this limited remote
- * state should be fairly cheap.
- */
-void __update_idle_core(struct rq *rq)
-{
- int core = cpu_of(rq);
- int cpu;
-
- rcu_read_lock();
- if (test_idle_cores(core, true))
- goto unlock;
-
- for_each_cpu(cpu, cpu_smt_mask(core)) {
- if (cpu == core)
- continue;
-
- if (!idle_cpu(cpu))
- goto unlock;
- }
-
- set_idle_cores(core, 1);
-unlock:
- rcu_read_unlock();
-}
-
/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * Scan the LLC domain for idlest cores; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
*/
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ int best_busy = UINT_MAX, best_cpu = -1;
+ struct sched_domain *this_sd;
+ u64 avg_cost, avg_idle;
+ int nr = INT_MAX;
+ u64 time, cost;
int core, cpu;
+ s64 delta;
- if (!static_branch_likely(&sched_smt_present))
+ this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+ if (!this_sd)
return -1;
- if (!test_idle_cores(target, false))
- return -1;
+ avg_idle = this_rq()->avg_idle / 512;
+ avg_cost = this_sd->avg_scan_cost + 1;
+
+ if (sched_feat(SIS_PROP)) {
+ u64 span_avg = sd->span_weight * avg_idle;
+ if (span_avg > 2*avg_cost)
+ nr = div_u64(span_avg, avg_cost);
+ else
+ nr = 2;
+ }
+
+ time = local_clock();
cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
for_each_cpu_wrap(core, cpus, target) {
- bool idle = true;
+ int first_idle = -1;
+ int busy = 0;
for_each_cpu(cpu, cpu_smt_mask(core)) {
cpumask_clear_cpu(cpu, cpus);
if (!idle_cpu(cpu))
- idle = false;
+ busy++;
+ else if (first_idle < 0)
+ first_idle = cpu;
}
- if (idle)
+ if (!busy)
return core;
- }
-
- /*
- * Failed to find an idle core; stop looking for one.
- */
- set_idle_cores(target, 0);
-
- return -1;
-}
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
- int cpu;
-
- if (!static_branch_likely(&sched_smt_present))
- return -1;
+ if (busy < best_busy) {
+ best_busy = busy;
+ best_cpu = cpu;
+ }
- for_each_cpu(cpu, cpu_smt_mask(target)) {
- if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
- continue;
- if (idle_cpu(cpu))
- return cpu;
+ if (!--nr)
+ break;
}
- return -1;
-}
-
-#else /* CONFIG_SCHED_SMT */
-
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
- return -1;
-}
+ time = local_clock() - time;
+ cost = this_sd->avg_scan_cost;
+ // XXX we should normalize @time on @nr
+ delta = (s64)(time - cost) / 8;
+ this_sd->avg_scan_cost += delta;
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
- return -1;
+ return best_cpu;
}
#endif /* CONFIG_SCHED_SMT */
@@ -6451,15 +6399,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
- i = select_idle_cpu(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
+#ifdef CONFIG_SCHED_SMT
+ if (static_branch_likely(&sched_smt_present))
+ i = select_idle_core(p, sd, target);
+ else
+#endif
+ i = select_idle_cpu(p, sd, target);
- i = select_idle_smt(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bddd17b..7ca8e18b0018 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -392,7 +392,6 @@ static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
put_prev_task(rq, prev);
- update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
return rq->idle;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..3f1874c345b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -899,16 +899,6 @@ static inline int cpu_of(struct rq *rq)
extern struct static_key_false sched_smt_present;
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
- if (static_branch_unlikely(&sched_smt_present))
- __update_idle_core(rq);
-}
-
-#else
-static inline void update_idle_core(struct rq *rq) { }
#endif
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
On Wed, Apr 25, 2018 at 05:36:00PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> > On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
>
> > > Why do you need to put a max on? Why isn't the proportional thing
> > > working as is? (is the average no good because of big variance or what)
>
> > Firstly the choosing of 512 seems arbitrary.
>
> It is; it is a crud attempt to deal with big variance. The comment says
> as much.
>
> > Secondly the logic here is that the enqueuing cpu should search up to
> > time it can get work itself. Why is that the optimal amount to
> > search?
>
> 1/512-th of the time in fact, per the above random number, but yes.
> Because searching for longer than we're expecting to be idle for is
> clearly bad, at that point we're inhibiting doing useful work.
>
> But while thinking about all this, I think I've spotted a few more
> issues, aside from the variance:
>
> Firstly, while avg_idle estimates the average duration for _when_ we go
> idle, it doesn't give a good measure when we do not in fact go idle. So
> when we get wakeups while fully busy, avg_idle is a poor measure.
>
> Secondly, the number of wakeups performed is also important. If we have
> a lot of wakeups, we need to look at aggregate wakeup time over a
> period. Not just single wakeup time.
>
> And thirdly, we're sharing the idle duration with newidle balance.
>
> And I think the 512 is a result of me not having recognised these
> additional issues when looking at the traces, I saw variance and left it
> there.
>
>
> This leaves me thinking we need a better estimator for wakeups. Because
> if there really is significant idle time, not looking for idle CPUs to
> run on is bad. Placing that upper limit, especially such a low one, is
> just an indication of failure.
>
>
> I'll see if I can come up with something.
Something like so _could_ work. Again, completely untested. We give idle
time to wake_avg, we subtract select_idle_sibling 'runtime' from
wake_avg, such that when there's lots of wakeups we don't use more time
than there was reported idle time. And we age wake_avg, such that if
there hasn't been idle for a number of ticks (we've been real busy) we
also stop scanning wide.
But it could eat your granny and set your cat on fire :-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..bc910e5776cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1671,6 +1671,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
if (rq->avg_idle > max)
rq->avg_idle = max;
+ rq->wake_stamp = jiffies;
+ rq->wake_avg = rq->avg_idle / 2;
+
rq->idle_stamp = 0;
}
#endif
@@ -6072,6 +6075,8 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->wake_stamp = jiffies;
+ rq->wake_avg = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..fee31dbe15ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6369,7 +6369,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *this_sd;
+ unsigned long now = jiffies;
u64 avg_cost, avg_idle;
+ struct rq *this_rq;
u64 time, cost;
s64 delta;
int cpu, nr = INT_MAX;
@@ -6378,11 +6380,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;
- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- avg_idle = this_rq()->avg_idle / 512;
+ this_rq = this_rq();
+ if (sched_feat(SIS_NEW)) {
+ /* age the remaining idle time */
+ if (unlikely(this_rq->wake_stamp < now)) {
+ while (this_rq->wake_stamp < now && this_rq->wake_avg)
+ this_rq->wake_avg >>= 1;
+ }
+ avg_idle = this_rq->wake_avg;
+ } else {
+ avg_idle = this_rq->avg_idle / 512;
+ }
+
avg_cost = this_sd->avg_scan_cost + 1;
if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
@@ -6412,6 +6421,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
delta = (s64)(time - cost) / 8;
this_sd->avg_scan_cost += delta;
+ /* you can only spend the time once */
+ if (this_rq->wake_avg > time)
+ this_rq->wake_avg -= time;
+ else
+ this_rq->wake_avg = 0;
+
return cpu;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..f5f86a59aac4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_NEW, false)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..c4d6ddf907b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
u64 idle_stamp;
u64 avg_idle;
+ unsigned long wake_stamp;
+ u64 wake_avg;
+
/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
#endif
On 04/25/2018 10:49 AM, Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote:
>> So what you said makes sense in theory but is not borne out by real
>> world results. This indicates that threads of these benchmarks care more
>> about running immediately on any idle cpu rather than spending time to find
>> fully idle core to run on.
> But you only ran on Intel which emunerates siblings far apart in the
> cpuid space. Which is not something we should rely on.
>
>>> So by only doing a linear scan on CPU number you will actually fill
>>> cores instead of equally spreading across cores. Worse still, by
>>> limiting the scan to _4_ you only barely even get onto a next core for
>>> SMT4 hardware, never mind SMT8.
>> Again this doesn't matter for the benchmarks I ran. Most are happy to make
>> the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
>> the scan window is rotated over all cpus, so idle cpus will be found soon.
> You've not been reading well. The Intel machine you tested this on most
> likely doesn't suffer that problem because of the way it happens to
> iterate SMT threads.
>
> How does Sparc iterate its SMT siblings in cpuid space?
SPARC does sequential enumeration of siblings first, although this needs to
be confirmed if non-sequential enumeration on x86 is the reason of the
improvements through tests. I don't have a SPARC test system handy now.
>
> Also, your benchmarks chose an unfortunate nr of threads vs topology.
> The 2^n thing chosen never hits the 100% core case (6,22 resp.).
>
>>> So while I'm not adverse to limiting the empty core search; I do feel it
>>> is important to have. Overloading cores when you don't have to is not
>>> good.
>> Can we have a config or a way for enabling/disabling select_idle_core?
> I like Rohit's suggestion of folding select_idle_core and
> select_idle_cpu much better, then it stays SMT aware.
>
> Something like the completely untested patch below.
I tried both the patches you suggested, the first with merging of
select_idle_core and select_idle_cpu and second with the new way of
calculating avg_idle and finally both combined. I ran the following
benchmarks for each, the merge only patch seems to giving similar
improvements as my original patch for Uperf and Oracle DB tests, but it
regresses for hackbench. If we can fix this I am OK with it. I can do a run
of other benchamrks after that.
I also noticed a possible bug later in the merge code. Shouldn't it be:
if (busy < best_busy) {
best_busy = busy;
best_cpu = first_idle;
}
Unfortunately I noticed it after all runs.
merge:
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.5099 (11.2%) 2.24
2 0.5776 7.87 0.5385 (6.77%) 3.38
4 0.9578 1.12 1.0626 (-10.94%) 1.35
8 1.7018 1.35 1.8615 (-9.38%) 0.73
16 2.9955 1.36 3.2424 (-8.24%) 0.66
32 5.4354 0.59 5.749 (-5.77%) 0.55
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 49.47 0.35 49.98 (1.03%) 1.36
16 95.28 0.77 97.46 (2.29%) 0.11
32 156.77 1.17 167.03 (6.54%) 1.98
48 193.24 0.22 230.96 (19.52%) 2.44
64 216.21 9.33 299.55 (38.54%) 4
128 379.62 10.29 357.87 (-5.73%) 0.85
Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 1.35 0.9919 (-0.81%) 0.14
40 1 0.42 0.9959 (-0.41%) 0.72
60 1 1.54 0.9872 (-1.28%) 1.27
80 1 0.58 0.9925 (-0.75%) 0.5
100 1 0.77 1.0145 (1.45%) 1.29
120 1 0.35 1.0136 (1.36%) 1.15
140 1 0.19 1.0404 (4.04%) 0.91
160 1 0.09 1.0317 (3.17%) 1.41
180 1 0.99 1.0322 (3.22%) 0.51
200 1 1.03 1.0245 (2.45%) 0.95
220 1 1.69 1.0296 (2.96%) 2.83
new avg_idle:
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.5241 (8.73%) 8.26
2 0.5776 7.87 0.5436 (5.89%) 8.53
4 0.9578 1.12 0.989 (-3.26%) 1.9
8 1.7018 1.35 1.7568 (-3.23%) 1.22
16 2.9955 1.36 3.1119 (-3.89%) 0.92
32 5.4354 0.59 5.5889 (-2.82%) 0.64
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 49.47 0.35 48.11 (-2.75%) 0.29
16 95.28 0.77 93.67 (-1.68%) 0.68
32 156.77 1.17 158.28 (0.96%) 0.29
48 193.24 0.22 190.04 (-1.66%) 0.34
64 216.21 9.33 189.45 (-12.38%) 2.05
128 379.62 10.29 326.59 (-13.97%) 13.07
Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 1.35 1.0026 (0.26%) 0.25
40 1 0.42 0.9857 (-1.43%) 1.47
60 1 1.54 0.9903 (-0.97%) 0.99
80 1 0.58 0.9968 (-0.32%) 1.19
100 1 0.77 0.9933 (-0.67%) 0.53
120 1 0.35 0.9919 (-0.81%) 0.9
140 1 0.19 0.9915 (-0.85%) 0.36
160 1 0.09 0.9811 (-1.89%) 1.21
180 1 0.99 1.0002 (0.02%) 0.87
200 1 1.03 1.0037 (0.37%) 2.5
220 1 1.69 0.998 (-0.2%) 0.8
merge + new avg_idle:
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.6522 (-13.58%) 12.53
2 0.5776 7.87 0.7593 (-31.46%) 2.7
4 0.9578 1.12 1.0952 (-14.35%) 1.08
8 1.7018 1.35 1.8722 (-10.01%) 0.68
16 2.9955 1.36 3.2987 (-10.12%) 0.58
32 5.4354 0.59 5.7751 (-6.25%) 0.46
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 49.47 0.35 51.29 (3.69%) 0.86
16 95.28 0.77 98.95 (3.85%) 0.41
32 156.77 1.17 165.76 (5.74%) 0.26
48 193.24 0.22 234.25 (21.22%) 0.63
64 216.21 9.33 306.87 (41.93%) 2.11
128 379.62 10.29 355.93 (-6.24%) 8.28
Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 1.35 1.0085 (0.85%) 0.72
40 1 0.42 1.0017 (0.17%) 0.3
60 1 1.54 0.9974 (-0.26%) 1.18
80 1 0.58 1.0115 (1.15%) 0.93
100 1 0.77 0.9959 (-0.41%) 1.21
120 1 0.35 1.0034 (0.34%) 0.72
140 1 0.19 1.0123 (1.23%) 0.93
160 1 0.09 1.0057 (0.57%) 0.65
180 1 0.99 1.0195 (1.95%) 0.99
200 1 1.03 1.0474 (4.74%) 0.55
220 1 1.69 1.0392 (3.92%) 0.36
On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
> I also noticed a possible bug later in the merge code. Shouldn't it be:
>
> if (busy < best_busy) {
> ??????? best_busy = busy;
> ??????? best_cpu = first_idle;
> }
Uhh, quite. I did say it was completely untested, but yes.. /me dons the
brown paper bag.
On 05/01/2018 11:03 AM, Peter Zijlstra wrote:
> On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
>> I also noticed a possible bug later in the merge code. Shouldn't it be:
>>
>> if (busy < best_busy) {
>> best_busy = busy;
>> best_cpu = first_idle;
>> }
> Uhh, quite. I did say it was completely untested, but yes.. /me dons the
> brown paper bag.
I re-ran the test after fixing that bug but still get similar regressions
for hackbench, while similar improvements on Uperf. I didn't re-run the
Oracle DB tests but my guess is it will show similar improvement.
merge:
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.5131 (10.64%) 4.11
2 0.5776 7.87 0.5387 (6.73%) 2.39
4 0.9578 1.12 1.0549 (-10.14%) 0.85
8 1.7018 1.35 1.8516 (-8.8%) 1.56
16 2.9955 1.36 3.2466 (-8.38%) 0.42
32 5.4354 0.59 5.7738 (-6.23%) 0.38
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 49.47 0.35 51.1 (3.29%) 0.13
16 95.28 0.77 98.45 (3.33%) 0.61
32 156.77 1.17 170.97 (9.06%) 5.62
48 193.24 0.22 245.89 (27.25%) 7.26
64 216.21 9.33 316.43 (46.35%) 0.37
128 379.62 10.29 337.85 (-11%) 3.68
I tried using the next_cpu technique with the merge but didn't help. I am
open to suggestions.
merge + next_cpu:
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.5107 (11.06%) 6.35
2 0.5776 7.87 0.5917 (-2.44%) 11.16
4 0.9578 1.12 1.0761 (-12.35%) 1.1
8 1.7018 1.35 1.8748 (-10.17%) 0.8
16 2.9955 1.36 3.2419 (-8.23%) 0.43
32 5.4354 0.59 5.6958 (-4.79%) 0.58
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 49.47 0.35 51.65 (4.41%) 0.26
16 95.28 0.77 99.8 (4.75%) 1.1
32 156.77 1.17 168.37 (7.4%) 0.6
48 193.24 0.22 228.8 (18.4%) 1.75
64 216.21 9.33 287.11 (32.79%) 10.82
128 379.62 10.29 346.22 (-8.8%) 4.7
Finally there was earlier suggestion by Peter in select_task_rq_fair to
transpose the cpu offset that I had tried earlier but also regressed on
hackbench. Just wanted to mention that so we have closure on that.
transpose cpu offset in select_task_rq_fair:
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5742 21.13 0.5251 (8.55%) 2.57
2 0.5776 7.87 0.5471 (5.28%) 11
4 0.9578 1.12 1.0148 (-5.95%) 1.97
8 1.7018 1.35 1.798 (-5.65%) 0.97
16 2.9955 1.36 3.088 (-3.09%) 2.7
32 5.4354 0.59 5.2815 (2.8%) 1.26
On 05/02/2018 02:58 PM, Subhra Mazumdar wrote:
>
>
> On 05/01/2018 11:03 AM, Peter Zijlstra wrote:
>> On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
>>> I also noticed a possible bug later in the merge code. Shouldn't it be:
>>>
>>> if (busy < best_busy) {
>>> best_busy = busy;
>>> best_cpu = first_idle;
>>> }
>> Uhh, quite. I did say it was completely untested, but yes.. /me dons the
>> brown paper bag.
> I re-ran the test after fixing that bug but still get similar regressions
> for hackbench, while similar improvements on Uperf. I didn't re-run the
> Oracle DB tests but my guess is it will show similar improvement.
>
> merge:
>
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups baseline %stdev patch %stdev
> 1 0.5742 21.13 0.5131 (10.64%) 4.11
> 2 0.5776 7.87 0.5387 (6.73%) 2.39
> 4 0.9578 1.12 1.0549 (-10.14%) 0.85
> 8 1.7018 1.35 1.8516 (-8.8%) 1.56
> 16 2.9955 1.36 3.2466 (-8.38%) 0.42
> 32 5.4354 0.59 5.7738 (-6.23%) 0.38
>
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline %stdev patch %stdev
> 8 49.47 0.35 51.1 (3.29%) 0.13
> 16 95.28 0.77 98.45 (3.33%) 0.61
> 32 156.77 1.17 170.97 (9.06%) 5.62
> 48 193.24 0.22 245.89 (27.25%) 7.26
> 64 216.21 9.33 316.43 (46.35%) 0.37
> 128 379.62 10.29 337.85 (-11%) 3.68
>
> I tried using the next_cpu technique with the merge but didn't help. I am
> open to suggestions.
>
> merge + next_cpu:
>
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups baseline %stdev patch %stdev
> 1 0.5742 21.13 0.5107 (11.06%) 6.35
> 2 0.5776 7.87 0.5917 (-2.44%) 11.16
> 4 0.9578 1.12 1.0761 (-12.35%) 1.1
> 8 1.7018 1.35 1.8748 (-10.17%) 0.8
> 16 2.9955 1.36 3.2419 (-8.23%) 0.43
> 32 5.4354 0.59 5.6958 (-4.79%) 0.58
>
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline %stdev patch %stdev
> 8 49.47 0.35 51.65 (4.41%) 0.26
> 16 95.28 0.77 99.8 (4.75%) 1.1
> 32 156.77 1.17 168.37 (7.4%) 0.6
> 48 193.24 0.22 228.8 (18.4%) 1.75
> 64 216.21 9.33 287.11 (32.79%) 10.82
> 128 379.62 10.29 346.22 (-8.8%) 4.7
>
> Finally there was earlier suggestion by Peter in select_task_rq_fair to
> transpose the cpu offset that I had tried earlier but also regressed on
> hackbench. Just wanted to mention that so we have closure on that.
>
> transpose cpu offset in select_task_rq_fair:
>
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups baseline %stdev patch %stdev
> 1 0.5742 21.13 0.5251 (8.55%) 2.57
> 2 0.5776 7.87 0.5471 (5.28%) 11
> 4 0.9578 1.12 1.0148 (-5.95%) 1.97
> 8 1.7018 1.35 1.798 (-5.65%) 0.97
> 16 2.9955 1.36 3.088 (-3.09%) 2.7
> 32 5.4354 0.59 5.2815 (2.8%) 1.26
I tried a few other combinations including setting nr=2 exactly with the
folding of select_idle_cpu and select_idle_core but still get regressions
with hackbench. Also tried adding select_idle_smt (just for the sake of it
since my patch retained it) but still see regressions with hackbench. In
all these tests Uperf and Oracle DB tests gave similar improvements as my
orignal patch. This kind of indicates that sequential cpu ids hopping cores
(x86) being important for hackbench. In that case can we consciously hop
core for all archs and search limited nr cpus? We can get the diff of
cpu id of target cpu and first cpu in the smt core and apply the diff to
the cpu id of each smt core to get the cpu we want to check. But we need a
O(1) way of zeroing out all the cpus of smt core from the parent mask.
This will work in both kind of enumeration, whether contiguous or
interleaved. Thoughts?
On Wed, May 02, 2018 at 02:58:42PM -0700, Subhra Mazumdar wrote:
> I re-ran the test after fixing that bug but still get similar regressions
> for hackbench
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups? baseline?????? %stdev? patch %stdev
> 1?????? 0.5742???????? 21.13?? 0.5131 (10.64%) 4.11
> 2?????? 0.5776???????? 7.87??? 0.5387 (6.73%) 2.39
> 4?????? 0.9578???????? 1.12??? 1.0549 (-10.14%) 0.85
> 8?????? 1.7018???????? 1.35??? 1.8516 (-8.8%) 1.56
> 16????? 2.9955???????? 1.36??? 3.2466 (-8.38%) 0.42
> 32????? 5.4354???????? 0.59??? 5.7738 (-6.23%) 0.38
On my IVB-EP (2 socket, 10 core/socket, 2 threads/core):
bench:
perf stat --null --repeat 10 -- perf bench sched messaging -g $i -t -l 10000 2>&1 | grep "seconds time elapsed"
config + results:
ORIG (SIS_PROP, shift=9)
1: 0.557325175 seconds time elapsed ( +- 0.83% )
2: 0.620646551 seconds time elapsed ( +- 1.46% )
5: 2.313514786 seconds time elapsed ( +- 2.11% )
10: 3.796233615 seconds time elapsed ( +- 1.57% )
20: 6.319403172 seconds time elapsed ( +- 1.61% )
40: 9.313219134 seconds time elapsed ( +- 1.03% )
PROP+AGE+ONCE shift=0
1: 0.559497993 seconds time elapsed ( +- 0.55% )
2: 0.631549599 seconds time elapsed ( +- 1.73% )
5: 2.195464815 seconds time elapsed ( +- 1.77% )
10: 3.703455811 seconds time elapsed ( +- 1.30% )
20: 6.440869566 seconds time elapsed ( +- 1.23% )
40: 9.537849253 seconds time elapsed ( +- 2.00% )
FOLD+AGE+ONCE+PONIES shift=0
1: 0.558893325 seconds time elapsed ( +- 0.98% )
2: 0.617426276 seconds time elapsed ( +- 1.07% )
5: 2.342727231 seconds time elapsed ( +- 1.34% )
10: 3.850449091 seconds time elapsed ( +- 1.07% )
20: 6.622412262 seconds time elapsed ( +- 0.85% )
40: 9.487138039 seconds time elapsed ( +- 2.88% )
FOLD+AGE+ONCE+PONIES+PONIES2 shift=0
10: 3.695294317 seconds time elapsed ( +- 1.21% )
Which seems to not hurt anymore.. can you confirm?
Also, I didn't run anything other than hackbench on it so far.
(sorry, the code is a right mess, it's what I ended up with after a day
of poking with no cleanups)
---
include/linux/sched/topology.h | 7 ++
kernel/sched/core.c | 8 ++
kernel/sched/fair.c | 201 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 9 ++
kernel/sched/sched.h | 3 +
kernel/sched/topology.c | 13 ++-
6 files changed, 228 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..1a53a805547e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,8 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ unsigned long core_mask[0];
};
struct sched_domain {
@@ -162,6 +164,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
return to_cpumask(sd->span);
}
+static inline struct cpumask *sched_domain_cores(struct sched_domain *sd)
+{
+ return to_cpumask(sd->shared->core_mask);
+}
+
extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b259af4a..2e2ee0df9e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1674,6 +1674,12 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
if (rq->avg_idle > max)
rq->avg_idle = max;
+ rq->wake_stamp = jiffies;
+ rq->wake_avg = rq->avg_idle / 2;
+
+ if (sched_feat(SIS_TRACE))
+ trace_printk("delta: %Lu %Lu %Lu\n", delta, rq->avg_idle, rq->wake_avg);
+
rq->idle_stamp = 0;
}
#endif
@@ -6051,6 +6057,8 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->wake_stamp = jiffies;
+ rq->wake_avg = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..8fe1c2404092 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6361,6 +6361,16 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
#endif /* CONFIG_SCHED_SMT */
+static DEFINE_PER_CPU(int, sis_rotor);
+
+unsigned long sched_sis_shift = 9;
+unsigned long sched_sis_min = 2;
+unsigned long sched_sis_max = INT_MAX;
+
+module_param(sched_sis_shift, ulong, 0644);
+module_param(sched_sis_min, ulong, 0644);
+module_param(sched_sis_max, ulong, 0644);
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6372,17 +6382,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, nr = INT_MAX;
+ int cpu, best_cpu = -1, loops = 0, nr = sched_sis_max;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
return -1;
- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- avg_idle = this_rq()->avg_idle / 512;
+ if (sched_feat(SIS_AGE)) {
+ /* age the remaining idle time */
+ unsigned long now = jiffies;
+ struct rq *this_rq = this_rq();
+
+ if (unlikely(this_rq->wake_stamp < now)) {
+ while (this_rq->wake_stamp < now && this_rq->wake_avg) {
+ this_rq->wake_stamp++;
+ this_rq->wake_avg >>= 1;
+ }
+ }
+
+ avg_idle = this_rq->wake_avg;
+ } else {
+ avg_idle = this_rq()->avg_idle;
+ }
+
+ avg_idle >>= sched_sis_shift;
avg_cost = this_sd->avg_scan_cost + 1;
if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
@@ -6390,29 +6413,170 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
+ if (span_avg > 2*sched_sis_min*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
- nr = 4;
+ nr = 2*sched_sis_min;
}
time = local_clock();
+ if (sched_feat(SIS_ROTOR))
+ target = this_cpu_read(sis_rotor);
+
for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
- if (!--nr)
- return -1;
+ if (loops++ >= nr)
+ break;
+ this_cpu_write(sis_rotor, cpu);
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (available_idle_cpu(cpu))
+ if (available_idle_cpu(cpu)) {
+ best_cpu = cpu;
+ break;
+ }
+ }
+
+ time = local_clock() - time;
+ time = div_u64(time, loops);
+ cost = this_sd->avg_scan_cost;
+ delta = (s64)(time - cost) / 8;
+ this_sd->avg_scan_cost += delta;
+
+ if (sched_feat(SIS_TRACE))
+ trace_printk("cpu: wake_avg: %Lu avg_idle: %Lu avg_idle: %Lu avg_cost: %Lu nr: %d loops: %d best_cpu: %d time: %Lu\n",
+ this_rq()->wake_avg, this_rq()->avg_idle,
+ avg_idle, avg_cost, nr, loops, cpu,
+ time);
+
+ if (sched_feat(SIS_ONCE)) {
+ struct rq *this_rq = this_rq();
+
+ if (this_rq->wake_avg > time)
+ this_rq->wake_avg -= time;
+ else
+ this_rq->wake_avg = 0;
+ }
+
+ return best_cpu;
+}
+
+
+/*
+ * Scan the LLC domain for idlest cores; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
+ */
+static int select_idle_core_fold(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int best_busy = UINT_MAX, best_cpu = -1;
+ struct sched_domain *this_sd;
+ u64 avg_cost, avg_idle;
+ int nr = sched_sis_max, loops = 0;
+ u64 time, cost;
+ int core, cpu;
+ s64 delta;
+ bool has_idle_cores = true;
+
+ this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+ if (!this_sd)
+ return -1;
+
+ if (sched_feat(SIS_ROTOR))
+ target = this_cpu_read(sis_rotor);
+
+ if (sched_feat(SIS_PONIES))
+ has_idle_cores = test_idle_cores(target, false);
+
+ if (sched_feat(SIS_AGE)) {
+ /* age the remaining idle time */
+ unsigned long now = jiffies;
+ struct rq *this_rq = this_rq();
+
+ if (unlikely(this_rq->wake_stamp < now)) {
+ while (this_rq->wake_stamp < now && this_rq->wake_avg) {
+ this_rq->wake_stamp++;
+ this_rq->wake_avg >>= 1;
+ }
+ }
+
+ avg_idle = this_rq->wake_avg;
+ } else {
+ avg_idle = this_rq()->avg_idle;
+ }
+
+ avg_idle >>= sched_sis_shift;
+ avg_cost = this_sd->avg_scan_cost + 1;
+
+ if (sched_feat(SIS_PROP)) {
+ u64 span_avg = sd->span_weight * avg_idle;
+ if (span_avg > sched_sis_min*avg_cost)
+ nr = div_u64(span_avg, avg_cost);
+ else
+ nr = sched_sis_min;
+ }
+
+ time = local_clock();
+
+ for_each_cpu_wrap(core, sched_domain_cores(sd), target) {
+ int first_idle = -1;
+ int busy = 0;
+
+ if (loops++ >= nr)
break;
+
+ this_cpu_write(sis_rotor, core);
+
+ for (cpu = core; cpu < nr_cpumask_bits; cpu = cpumask_next(cpu, cpu_smt_mask(core))) {
+ if (!idle_cpu(cpu))
+ busy++;
+ else if (first_idle < 0 && cpumask_test_cpu(cpu, &p->cpus_allowed)) {
+ if (!has_idle_cores) {
+ best_cpu = cpu;
+ goto out;
+ }
+ first_idle = cpu;
+ }
+ }
+
+ if (first_idle < 0)
+ continue;
+
+ if (!busy) {
+ best_cpu = first_idle;
+ goto out;
+ }
+
+ if (busy < best_busy) {
+ best_busy = busy;
+ best_cpu = first_idle;
+ }
}
+ set_idle_cores(target, 0);
+
+out:
time = local_clock() - time;
+ time = div_u64(time, loops);
cost = this_sd->avg_scan_cost;
delta = (s64)(time - cost) / 8;
this_sd->avg_scan_cost += delta;
- return cpu;
+ if (sched_feat(SIS_TRACE))
+ trace_printk("fold: wake_avg: %Lu avg_idle: %Lu avg_idle: %Lu avg_cost: %Lu nr: %d loops: %d best_cpu: %d time: %Lu\n",
+ this_rq()->wake_avg, this_rq()->avg_idle,
+ avg_idle, avg_cost, nr, loops, best_cpu,
+ time);
+
+ if (sched_feat(SIS_ONCE)) {
+ struct rq *this_rq = this_rq();
+
+ if (this_rq->wake_avg > time)
+ this_rq->wake_avg -= time;
+ else
+ this_rq->wake_avg = 0;
+ }
+
+ return best_cpu;
}
/*
@@ -6451,7 +6615,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
+ if (sched_feat(SIS_FOLD)) {
+ if (sched_feat(SIS_PONIES2) && !test_idle_cores(target, false))
+ i = select_idle_cpu(p, sd, target);
+ else
+ i = select_idle_core_fold(p, sd, target);
+ goto out;
+ }
+
i = select_idle_core(p, sd, target);
+ if (sched_feat(SIS_TRACE))
+ trace_printk("select_idle_core: %d\n", i);
if ((unsigned)i < nr_cpumask_bits)
return i;
@@ -6460,6 +6634,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return i;
i = select_idle_smt(p, sd, target);
+ if (sched_feat(SIS_TRACE))
+ trace_printk("select_idle_smt: %d\n", i);
+out:
if ((unsigned)i < nr_cpumask_bits)
return i;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..bb572f949e5f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -58,6 +58,15 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_FOLD, true)
+SCHED_FEAT(SIS_AGE, true)
+SCHED_FEAT(SIS_ONCE, true)
+SCHED_FEAT(SIS_ROTOR, false)
+SCHED_FEAT(SIS_PONIES, false)
+SCHED_FEAT(SIS_PONIES2, true)
+
+SCHED_FEAT(SIS_TRACE, false)
+
/*
* Issue a WARN when we do multiple update_rq_clock() calls
* in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4d9ac7..240c35a4c6e0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
u64 idle_stamp;
u64 avg_idle;
+ unsigned long wake_stamp;
+ u64 wake_avg;
+
/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 61a1125c1ae4..a47a6fc9796e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1189,9 +1189,20 @@ sd_init(struct sched_domain_topology_level *tl,
* instance.
*/
if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+ int core, smt;
+
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+
+ for_each_cpu(core, sched_domain_span(sd)) {
+ for_each_cpu(smt, cpu_smt_mask(core)) {
+ if (cpumask_test_cpu(smt, sched_domain_span(sd))) {
+ __cpumask_set_cpu(smt, sched_domain_cores(sd));
+ break;
+ }
+ }
+ }
}
sd->private = sdd;
@@ -1537,7 +1548,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
On 05/29/2018 02:36 PM, Peter Zijlstra wrote:
> On Wed, May 02, 2018 at 02:58:42PM -0700, Subhra Mazumdar wrote:
>> I re-ran the test after fixing that bug but still get similar regressions
>> for hackbench
>> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
>> (lower is better):
>> groups baseline %stdev patch %stdev
>> 1 0.5742 21.13 0.5131 (10.64%) 4.11
>> 2 0.5776 7.87 0.5387 (6.73%) 2.39
>> 4 0.9578 1.12 1.0549 (-10.14%) 0.85
>> 8 1.7018 1.35 1.8516 (-8.8%) 1.56
>> 16 2.9955 1.36 3.2466 (-8.38%) 0.42
>> 32 5.4354 0.59 5.7738 (-6.23%) 0.38
> On my IVB-EP (2 socket, 10 core/socket, 2 threads/core):
>
> bench:
>
> perf stat --null --repeat 10 -- perf bench sched messaging -g $i -t -l 10000 2>&1 | grep "seconds time elapsed"
>
> config + results:
>
> ORIG (SIS_PROP, shift=9)
>
> 1: 0.557325175 seconds time elapsed ( +- 0.83% )
> 2: 0.620646551 seconds time elapsed ( +- 1.46% )
> 5: 2.313514786 seconds time elapsed ( +- 2.11% )
> 10: 3.796233615 seconds time elapsed ( +- 1.57% )
> 20: 6.319403172 seconds time elapsed ( +- 1.61% )
> 40: 9.313219134 seconds time elapsed ( +- 1.03% )
>
> PROP+AGE+ONCE shift=0
>
> 1: 0.559497993 seconds time elapsed ( +- 0.55% )
> 2: 0.631549599 seconds time elapsed ( +- 1.73% )
> 5: 2.195464815 seconds time elapsed ( +- 1.77% )
> 10: 3.703455811 seconds time elapsed ( +- 1.30% )
> 20: 6.440869566 seconds time elapsed ( +- 1.23% )
> 40: 9.537849253 seconds time elapsed ( +- 2.00% )
>
> FOLD+AGE+ONCE+PONIES shift=0
>
> 1: 0.558893325 seconds time elapsed ( +- 0.98% )
> 2: 0.617426276 seconds time elapsed ( +- 1.07% )
> 5: 2.342727231 seconds time elapsed ( +- 1.34% )
> 10: 3.850449091 seconds time elapsed ( +- 1.07% )
> 20: 6.622412262 seconds time elapsed ( +- 0.85% )
> 40: 9.487138039 seconds time elapsed ( +- 2.88% )
>
> FOLD+AGE+ONCE+PONIES+PONIES2 shift=0
>
> 10: 3.695294317 seconds time elapsed ( +- 1.21% )
>
>
> Which seems to not hurt anymore.. can you confirm?
>
> Also, I didn't run anything other than hackbench on it so far.
>
> (sorry, the code is a right mess, it's what I ended up with after a day
> of poking with no cleanups)
>
I tested with FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 vs baseline but see some
regression for hackbench and uperf:
hackbench BL stdev% test stdev% %gain
1(40 tasks) 0.5816 8.94 0.5607 2.89 3.593535
2(80 tasks) 0.6428 10.64 0.5984 3.38 6.907280
4(160 tasks) 1.0152 1.99 1.0036 2.03 1.142631
8(320 tasks) 1.8128 1.40 1.7931 0.97 1.086716
16(640 tasks) 3.1666 0.80 3.2332 0.48 -2.103207
32(1280 tasks) 5.6084 0.83 5.8489 0.56 -4.288210
Uperf BL stdev% test stdev% %gain
8 threads 45.36 0.43 45.16 0.49 -0.433536
16 threads 87.81 0.82 88.6 0.38 0.899669
32 threads 151.18 0.01 149.98 0.04 -0.795925
48 threads 190.19 0.21 184.77 0.23 -2.849681
64 threads 190.42 0.35 183.78 0.08 -3.485217
128 threads 323.85 0.27 266.32 0.68 -17.766089
sysbench BL stdev% test stdev% %gain
8 threads 2095.44 1.82 2102.63 0.29 0.343006
16 threads 4218.44 0.06 4179.82 0.49 -0.915413
32 threads 7531.36 0.48 7744.72 0.13 2.832912
48 threads 10206.42 0.20 10144.65 0.19 -0.605163
64 threads 12053.72 0.09 11784.38 0.32 -2.234547
128 threads 14810.33 0.04 14741.78 0.16 -0.462867
I have a patch which is much smaller but seems to work well so far for both
x86 and SPARC across benchmarks I have run so far. It keeps the idle cpu
search between 1 core and 2 core amount of cpus and also puts a new
sched feature of doing idle core search or not. It can be on by default but
for workloads (like Oracle DB on x86) we can turn it off. I plan to send
that after some more testing.
On Wed, May 30, 2018 at 03:08:21PM -0700, Subhra Mazumdar wrote:
> I tested with FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 vs baseline but see some
> regression for hackbench and uperf:
I'm not seeing a hackbench regression myself, but I let it run a whole
lot of stuff over-night and I do indeed see some pain points, including
sysbench-mysql and some schbench results.
> I have a patch which is much smaller but seems to work well so far for both
> x86 and SPARC across benchmarks I have run so far. It keeps the idle cpu
> search between 1 core and 2 core amount of cpus and also puts a new
> sched feature of doing idle core search or not. It can be on by default but
> for workloads (like Oracle DB on x86) we can turn it off. I plan to send
> that after some more testing.
Sure thing..