Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.
This patch solves the scalability problem of potentially searching all
cores or cpus by using a randomized approach called SMT balance. It
maintains a utilization of the SMTs per scheduling group based on the
number of busy CPUs in the group (henceforth referred to as SMT
utilization). This is accounted at the time of context switch. The SMT
utilization is maintained only for levels below the LLC domain, so only for
cores. During context switch each cpu of a core atomically increments or
decrements the SMT utilization variable of that core depending on whether
the cpu is going busy or idle. Since the atomic variable is per core, it
will be updated only by the hyperthreads in a core with minimal contention.
In the fast path of wakeup the scheduler compares the target cpu group of
select_idle_sibling with a random group in the LLC domain w.r.t SMT
utilization to determine a better core to schedule. It chooses the core
which has more SMT capacity (idle cpus) left. The SMT capacity is computed
simply by subtracting SMT utilization from group weight. This comparison
can be done in O(1). Finally it does an idle cpu search only in that core
starting from a random cpu index. The random number generation needs to be
fast and uses a per cpu pseudo random number generator.
Following are the numbers with various benchmarks on a x86 2 socket system
with 22 cores per socket and 2 hyperthreads per core:
hackbench process:
groups baseline-rc6(avg) %stdev patch(avg) %stdev
1 0.4797 15.75 0.4324 (+9.86%) 2.23
2 0.4877 9.99 0.4535 (+7.01%) 3.36
4 0.8603 1.09 0.8376 (+2.64%) 0.95
8 1.496 0.60 1.4516 (+2.97%) 1.38
16 2.6642 0.37 2.5857 (+2.95%) 0.68
32 4.6715 0.40 4.5158 (+3.33%) 0.67
uperf pingpong throughput with loopback interface and message size = 8k:
threads baseline-rc6(avg) %stdev patch(avg) %stdev
8 49.47 0.35 51.16 (+3.42%) 0.53
16 95.28 0.77 101.02 (+6.03%) 0.43
32 156.77 1.17 181.52 (+15.79%) 0.96
48 193.24 0.22 212.90 (+10.17%) 0.45
64 216.21 9.33 264.14 (+22.17%) 0.69
128 379.62 10.29 416.36 (+9.68%) 1.04
Oracle DB TPC-C throughput normalized to baseline:
users baseline-rc6 norm(avg) %stdev patch norm(avg) %stdev
20 1 0.94 1.0071 (+0.71%) 1.03
40 1 0.82 1.0126 (+1.26%) 0.65
60 1 1.10 0.9928 (-0.72%) 0.67
80 1 0.63 1.003 (+0.30%) 0.64
100 1 0.82 0.9957 (-0.43%) 0.15
120 1 0.46 1.0034 (+0.34%) 1.74
140 1 1.44 1.0247 (+2.47%) 0.15
160 1 0.85 1.0445 (+4.45%) 0.81
180 1 0.19 1.0382 (+3.82%) 0.57
200 1 1.40 1.0295 (+2.95%) 0.94
220 1 1.02 1.0242 (+2.42%) 0.85
Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:
function baseline-rc6 %stdev patch %stdev
select_idle_sibling() 0.556 1.72 0.263 (-52.70%) 0.78
Signed-off-by: subhra mazumdar <[email protected]>
---
include/linux/sched/topology.h | 2 +
kernel/sched/core.c | 43 +++++++
kernel/sched/fair.c | 247 ++++++++++++++++++++---------------------
kernel/sched/idle_task.c | 3 +-
kernel/sched/sched.h | 28 ++---
kernel/sched/topology.c | 35 +++++-
6 files changed, 206 insertions(+), 152 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cf257c2..e63e4fb 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -147,6 +147,8 @@ struct sched_domain {
struct sched_domain_shared *shared;
unsigned int span_weight;
+ struct sched_group **sg_array;
+ int sg_num;
/*
* Span of all CPUs in this domain.
*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32a..58f8684 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2752,6 +2752,48 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
put_user(task_pid_vnr(current), current->set_child_tid);
}
+#ifdef CONFIG_SCHED_SMT
+
+/*
+ * From sd_llc downward update the SMT utilization.
+ * Skip the lowest level 0.
+ */
+void smt_util(struct rq *rq, int prev_busy, int next_busy)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg_cpu;
+ int this_cpu = rq->cpu;
+ int util;
+
+ if (rq->curr_util == UTIL_UNINITIALIZED)
+ prev_busy = 0;
+
+ util = next_busy - prev_busy;
+ rcu_read_lock();
+ sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
+ if (util) {
+ for_each_lower_domain(sd) {
+ if (sd->level == 0)
+ break;
+ sg_cpu = sd->groups;
+
+ /* atomically add/subtract the util */
+ if (util > 0)
+ atomic_inc((atomic_t *)&sg_cpu->utilization);
+ else
+ atomic_dec((atomic_t *)&sg_cpu->utilization);
+ }
+ }
+
+ if (sd)
+ rq->curr_util = next_busy;
+ rcu_read_unlock();
+}
+
+#else
+void smt_util(struct rq *rq, int prev_busy, int next_busy) {}
+#endif
+
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
@@ -5943,6 +5985,7 @@ void __init sched_init(void)
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->curr_util = UTIL_UNINITIALIZED;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2fe3aa8..9b0fc2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6009,128 +6009,126 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
#ifdef CONFIG_SCHED_SMT
-static inline void set_idle_cores(int cpu, int val)
+/*
+ * Find an idle cpu in the core starting search from random index.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
{
- struct sched_domain_shared *sds;
+ int i, rand_index, rand_cpu;
+ int this_cpu = smp_processor_id();
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- WRITE_ONCE(sds->has_idle_cores, val);
-}
+ rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
+ rand_cpu = sg->cp_array[rand_index];
-static inline bool test_idle_cores(int cpu, bool def)
-{
- struct sched_domain_shared *sds;
-
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
+ for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) {
+ if (!cpumask_test_cpu(i, &p->cpus_allowed))
+ continue;
+ if (idle_cpu(i))
+ return i;
+ }
- return def;
+ return -1;
}
/*
- * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
- *
- * Since SMT siblings share all cache levels, inspecting this limited remote
- * state should be fairly cheap.
+ * Compare the SMT utilization of the two groups and select the one which
+ * has more capacity left.
*/
-void __update_idle_core(struct rq *rq)
+static int smt_should_migrate(struct sched_group *here,
+ struct sched_group *there, int self_util)
{
- int core = cpu_of(rq);
- int cpu;
-
- rcu_read_lock();
- if (test_idle_cores(core, true))
- goto unlock;
-
- for_each_cpu(cpu, cpu_smt_mask(core)) {
- if (cpu == core)
- continue;
+ int this_cpu = smp_processor_id();
+ int here_util = here->utilization, there_util = there->utilization;
- if (!idle_cpu(cpu))
- goto unlock;
+ /* Discount self utilization if it belongs to here or there */
+ if (self_util > 0) {
+ if (cpumask_test_cpu(this_cpu, sched_group_span(here)))
+ here_util -= self_util;
+ else if (cpumask_test_cpu(this_cpu, sched_group_span(there)))
+ there_util -= self_util;
}
- set_idle_cores(core, 1);
-unlock:
- rcu_read_unlock();
+ /* Return true if other group has more capacity left */
+ return (there->group_weight - there_util >
+ here->group_weight - here_util);
}
/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * SMT balancing works by comparing the target cpu group with a random group
+ * in llc domain. It calls smt_should_migrate to decide which group has more
+ * capacity left. The balancing starts top down fom sd_llc to SMT core level.
+ * Finally idle cpu search is only done in the core.
*/
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static int smt_balance(struct task_struct *p, struct sched_domain *sd, int cpu)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu;
-
- if (!static_branch_likely(&sched_smt_present))
- return -1;
+ struct sched_group *sg, *src_sg, *start_sg, *tgt_sg;
+ struct cpumask *span;
+ int rand_idx, weight;
+ int cpu_orig = cpu;
+ int rand_cpu;
+ int this_cpu = smp_processor_id();
+ struct rq *this_rq = cpu_rq(this_cpu);
+ struct rq *rq = cpu_rq(cpu);
+ int self_util = 0;
+ int balanced = 0;
- if (!test_idle_cores(target, false))
- return -1;
+ if (p == this_rq->curr)
+ self_util = rq->curr_util;
- cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+ for_each_lower_domain(sd) {
+ if (sd->level == 0)
+ break;
- for_each_cpu_wrap(core, cpus, target) {
- bool idle = true;
+ /* Pick a random group that has cpus where the thread can run */
+ src_sg = sd->groups;
+ tgt_sg = NULL;
+ rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % sd->sg_num;
+ start_sg = sd->sg_array[rand_idx];
+ sg = start_sg;
+ do {
+ span = sched_group_span(sg);
+ if (sg != src_sg &&
+ cpumask_intersects(span, &p->cpus_allowed)) {
+ tgt_sg = sg;
+ break;
+ }
+ sg = sg->next;
+ } while (sg != start_sg);
- for_each_cpu(cpu, cpu_smt_mask(core)) {
- cpumask_clear_cpu(cpu, cpus);
- if (!idle_cpu(cpu))
- idle = false;
+ /*
+ * Compare the target group with random group and select the
+ * one which has more SMT capacity left. Choose a random CPU in
+ * the group to spread the load, then find the group's parent sd
+ * so the group's sd is used on the next main loop iteration.
+ */
+ if (tgt_sg && smt_should_migrate(src_sg, tgt_sg, self_util)) {
+ weight = tgt_sg->group_weight;
+ rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % weight;
+ rand_cpu = tgt_sg->cp_array[rand_idx];
+ for_each_cpu_wrap(cpu, span, rand_cpu) {
+ if (cpumask_test_cpu(cpu, &p->cpus_allowed))
+ break;
+ }
+ for_each_domain(cpu, sd) {
+ if (weight < sd->span_weight)
+ break;
+ }
+ balanced = 1;
}
-
- if (idle)
- return core;
}
- /*
- * Failed to find an idle core; stop looking for one.
- */
- set_idle_cores(target, 0);
-
- return -1;
-}
-
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
- int cpu;
-
- if (!static_branch_likely(&sched_smt_present))
- return -1;
-
- for_each_cpu(cpu, cpu_smt_mask(target)) {
- if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
- continue;
- if (idle_cpu(cpu))
+ /* sd is now lowest level SMT core */
+ if (sd->parent && (balanced || !idle_cpu(cpu_orig))) {
+ cpu = select_idle_smt(p, sd->parent->groups);
+ if (cpu >= 0)
return cpu;
}
- return -1;
+ return cpu_orig;
}
#else /* CONFIG_SCHED_SMT */
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
- return -1;
-}
-
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
- return -1;
-}
-
-#endif /* CONFIG_SCHED_SMT */
-
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6146,7 +6144,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
- return -1;
+ return target;
/*
* Due to large variance we need a large fuzz factor; hackbench in
@@ -6156,7 +6154,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
avg_cost = this_sd->avg_scan_cost + 1;
if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
- return -1;
+ return target;
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
@@ -6170,7 +6168,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
if (!--nr)
- return -1;
+ return target;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
if (idle_cpu(cpu))
@@ -6185,41 +6183,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return cpu;
}
-/*
- * Try and locate an idle core/thread in the LLC cache domain.
- */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
-{
- struct sched_domain *sd;
- int i;
-
- if (idle_cpu(target))
- return target;
-
- /*
- * If the previous cpu is cache affine and idle, don't be stupid.
- */
- if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
- return prev;
-
- sd = rcu_dereference(per_cpu(sd_llc, target));
- if (!sd)
- return target;
-
- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
- i = select_idle_cpu(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
- i = select_idle_smt(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
-
- return target;
-}
+#endif /* CONFIG_SCHED_SMT */
/*
* cpu_util returns the amount of capacity of a CPU that is used by CFS
@@ -6302,6 +6266,31 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
return min_cap * 1024 < task_util(p) * capacity_margin;
}
+static int select_idle_sibling(struct task_struct *p, int prev, int target)
+{
+ struct sched_domain *sd;
+
+ if (idle_cpu(target))
+ return target;
+
+ /*
+ * If the previous cpu is cache affine and idle, don't be stupid.
+ */
+ if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+ return prev;
+
+ sd = rcu_dereference(per_cpu(sd_llc, target));
+ if (!sd)
+ return target;
+
+#ifdef CONFIG_SCHED_SMT
+ return smt_balance(p, sd, target);
+#else
+
+ return select_idle_cpu(p, sd, target);
+#endif
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664..22c038a 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,8 +28,8 @@ static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
put_prev_task(rq, prev);
- update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
+ smt_util(rq, 1, 0);
return rq->idle;
}
@@ -49,6 +49,7 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
rq_last_tick_reset(rq);
+ smt_util(rq, 0, 1);
}
static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a2..bd14722 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -48,6 +48,11 @@
struct rq;
struct cpuidle_state;
+#define CPU_PSEUDO_RANDOM(cpu) (cpu_rq(cpu)->rotor++)
+
+/* uninitialized state of the rq */
+#define UTIL_UNINITIALIZED -1
+
/* task_struct::on_rq states: */
#define TASK_ON_RQ_QUEUED 1
#define TASK_ON_RQ_MIGRATING 2
@@ -752,6 +757,8 @@ struct rq {
/* cpu of this runqueue: */
int cpu;
int online;
+ unsigned int rotor;
+ int curr_util;
struct list_head cfs_tasks;
@@ -823,25 +830,8 @@ static inline int cpu_of(struct rq *rq)
#endif
}
-
-#ifdef CONFIG_SCHED_SMT
-
-extern struct static_key_false sched_smt_present;
-
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
- if (static_branch_unlikely(&sched_smt_present))
- __update_idle_core(rq);
-}
-
-#else
-static inline void update_idle_core(struct rq *rq) { }
-#endif
-
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-
+void smt_util(struct rq *rq, int prev_busy, int next_busy);
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() this_cpu_ptr(&runqueues)
#define task_rq(p) cpu_rq(task_cpu(p))
@@ -1095,6 +1085,8 @@ struct sched_group {
unsigned int group_weight;
struct sched_group_capacity *sgc;
int asym_prefer_cpu; /* cpu of highest priority in group */
+ int utilization;
+ int *cp_array;
/*
* The CPUs this group covers.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed..fc3e974 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -341,8 +341,10 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))
kfree(sg->sgc);
- if (atomic_dec_and_test(&sg->ref))
+ if (atomic_dec_and_test(&sg->ref)) {
+ kfree(sg->cp_array);
kfree(sg);
+ }
sg = tmp;
} while (sg != first);
}
@@ -358,6 +360,9 @@ static void destroy_sched_domain(struct sched_domain *sd)
if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
kfree(sd->shared);
+
+ kfree(sd->sg_array);
+
kfree(sd);
}
@@ -903,17 +908,29 @@ build_sched_groups(struct sched_domain *sd, int cpu)
* group having more cpu_capacity will pickup more load compared to the
* group having less cpu_capacity.
*/
-static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
+static void init_sched_groups_capacity(int cpu_orig, struct sched_domain *sd)
{
struct sched_group *sg = sd->groups;
+ int sg_num = 0, i, cp;
WARN_ON(!sg);
do {
int cpu, max_cpu = -1;
+ sg_num++;
sg->group_weight = cpumask_weight(sched_group_span(sg));
+ /* Build the array of cpus for each group */
+ if (!sg->cp_array) {
+ sg->cp_array = kzalloc_node(sg->group_weight *
+ sizeof(int), GFP_KERNEL,
+ cpu_to_node(cpu_orig));
+ i = 0;
+ for_each_cpu(cp, sched_group_span(sg))
+ sg->cp_array[i++] = cp;
+ }
+
if (!(sd->flags & SD_ASYM_PACKING))
goto next;
@@ -929,10 +946,20 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
sg = sg->next;
} while (sg != sd->groups);
- if (cpu != group_balance_cpu(sg))
+ /* Build the array of groups for each domain */
+ sd->sg_array = kzalloc_node(sg_num * sizeof(void *), GFP_KERNEL,
+ cpu_to_node(cpu_orig));
+ sd->sg_num = sg_num;
+ i = 0;
+ do {
+ sd->sg_array[i++] = sg;
+ sg = sg->next;
+ } while (sg != sd->groups);
+
+ if (cpu_orig != group_balance_cpu(sg))
return;
- update_group_capacity(sd, cpu);
+ update_group_capacity(sd, cpu_orig);
}
/*
--
2.9.3
On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
> +#ifdef CONFIG_SCHED_SMT
> +
> +/*
> + * From sd_llc downward update the SMT utilization.
Please don't use utilization for this, we already use that word for
something else.
> + * Skip the lowest level 0.
> + */
> +void smt_util(struct rq *rq, int prev_busy, int next_busy)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg_cpu;
> + int this_cpu = rq->cpu;
> + int util;
> +
> + if (rq->curr_util == UTIL_UNINITIALIZED)
> + prev_busy = 0;
> +
> + util = next_busy - prev_busy;
> + rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
> + if (util) {
> + for_each_lower_domain(sd) {
> + if (sd->level == 0)
> + break;
afaict you really only need this for the core, and here you're assuming
everything below the LLC is cores. Would it not be much clearer if you
introduce sd_core.
As is, for_each_lower_domain includes the starting domain, sd->group
then is the first core group for this cpu. But then you continue to the
smt domain (on Intel, on other architectures there could be a cluster
domain in between) and then you bail using that sd->level == 0 hack
because otherwise things would go *bang*.
Really rather messy this.
I think you want to go allocate sched_domain_shared for the MC level and
use that, much like sd_llc_shared.
> + sg_cpu = sd->groups;
> +
> + /* atomically add/subtract the util */
> + if (util > 0)
> + atomic_inc((atomic_t *)&sg_cpu->utilization);
> + else
> + atomic_dec((atomic_t *)&sg_cpu->utilization);
*sigh*, wth do you need that cast? You didn't get the memo that spurious
casts are where bugs happen and are terrible style at the best of times?
> + }
> + }
> +
> + if (sd)
> + rq->curr_util = next_busy;
> + rcu_read_unlock();
> +}
> + smt_util(rq, 1, 0);
> + smt_util(rq, 0, 1);
That all seems like an overly complex way to write inc/dec. You turned
what should be a boolean state space (2^1) into something like 2^64.
Also, I think if you count idle instead of busy, you'll do away with the
need for that whole uninitialized thing.
> +#define CPU_PSEUDO_RANDOM(cpu) (cpu_rq(cpu)->rotor++)
That's a bit of a stretch calling that pseudo random..
> +/*
> + * Find an idle cpu in the core starting search from random index.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
> {
> + int i, rand_index, rand_cpu;
> + int this_cpu = smp_processor_id();
>
> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
> + rand_cpu = sg->cp_array[rand_index];
Right, so yuck.. I know why you need that, but that extra array and
dereference is the reason I never went there.
How much difference does it really make vs the 'normal' wrapping search
from last CPU ?
This really should be a separate patch with separate performance numbers
on.
> + for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) {
> + if (!cpumask_test_cpu(i, &p->cpus_allowed))
> + continue;
> + if (idle_cpu(i))
> + return i;
> + }
>
> + return -1;
> }
>
> /*
> + * Compare the SMT utilization of the two groups and select the one which
> + * has more capacity left.
> */
> +static int smt_should_migrate(struct sched_group *here,
> + struct sched_group *there, int self_util)
> {
> + int this_cpu = smp_processor_id();
> + int here_util = here->utilization, there_util = there->utilization;
>
> + /* Discount self utilization if it belongs to here or there */
> + if (self_util > 0) {
> + if (cpumask_test_cpu(this_cpu, sched_group_span(here)))
> + here_util -= self_util;
> + else if (cpumask_test_cpu(this_cpu, sched_group_span(there)))
> + there_util -= self_util;
> }
>
> + /* Return true if other group has more capacity left */
> + return (there->group_weight - there_util >
> + here->group_weight - here_util);
> }
OK, so this effectively picks the least-busiest/idlest SMT sibling of
two.
> /*
> + * SMT balancing works by comparing the target cpu group with a random group
> + * in llc domain. It calls smt_should_migrate to decide which group has more
> + * capacity left. The balancing starts top down fom sd_llc to SMT core level.
> + * Finally idle cpu search is only done in the core.
> */
> +static int smt_balance(struct task_struct *p, struct sched_domain *sd, int cpu)
> {
> + struct sched_group *sg, *src_sg, *start_sg, *tgt_sg;
> + struct cpumask *span;
> + int rand_idx, weight;
> + int cpu_orig = cpu;
> + int rand_cpu;
> + int this_cpu = smp_processor_id();
> + struct rq *this_rq = cpu_rq(this_cpu);
> + struct rq *rq = cpu_rq(cpu);
> + int self_util = 0;
> + int balanced = 0;
>
> + if (p == this_rq->curr)
> + self_util = rq->curr_util;
>
> + for_each_lower_domain(sd) {
Again, I don't think we want to do that. You want to iterate all cores,
and this is a rather cumbersome way to go about doing that.
Esp. if there's a domain in between. Imagine an ARM bit.little with
shared L3 growing SMT or something along those lines.
> + if (sd->level == 0)
> + break;
>
> + /* Pick a random group that has cpus where the thread can run */
> + src_sg = sd->groups;
> + tgt_sg = NULL;
> + rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % sd->sg_num;
> + start_sg = sd->sg_array[rand_idx];
> + sg = start_sg;
> + do {
> + span = sched_group_span(sg);
> + if (sg != src_sg &&
> + cpumask_intersects(span, &p->cpus_allowed)) {
> + tgt_sg = sg;
> + break;
> + }
> + sg = sg->next;
> + } while (sg != start_sg);
OK, so this picks a 'random' group that has CPUs where our task is
allowed to run (per the above this group could be a cluster, not a
core).
> + /*
> + * Compare the target group with random group and select the
> + * one which has more SMT capacity left. Choose a random CPU in
> + * the group to spread the load, then find the group's parent sd
> + * so the group's sd is used on the next main loop iteration.
> + */
> + if (tgt_sg && smt_should_migrate(src_sg, tgt_sg, self_util)) {
> + weight = tgt_sg->group_weight;
> + rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % weight;
> + rand_cpu = tgt_sg->cp_array[rand_idx];
> + for_each_cpu_wrap(cpu, span, rand_cpu) {
> + if (cpumask_test_cpu(cpu, &p->cpus_allowed))
> + break;
> + }
> + for_each_domain(cpu, sd) {
> + if (weight < sd->span_weight)
> + break;
> + }
> + balanced = 1;
> }
I really wonder how much that fake random stuff yields you vs the rest
of this.
In any case, this will not do for facebook I'm afraid, as is they
already disable SIS_AVG_CPU and SIS_PROP (iirc) and always take the hit
of doing a full LLC search.
Yes that is expensive, but they really want to keep that tail latency
down.
Also, only ever considering _one_ other core might affect other
workloads. If you look at those two features above, we should already
reduce the amount of searching we do when there's not a lot of idle
time.
> }
>
> + /* sd is now lowest level SMT core */
> + if (sd->parent && (balanced || !idle_cpu(cpu_orig))) {
> + cpu = select_idle_smt(p, sd->parent->groups);
> + if (cpu >= 0)
> return cpu;
> }
>
> + return cpu_orig;
> }
>
> @@ -6302,6 +6266,31 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> return min_cap * 1024 < task_util(p) * capacity_margin;
> }
>
> +static int select_idle_sibling(struct task_struct *p, int prev, int target)
> +{
> + struct sched_domain *sd;
> +
> + if (idle_cpu(target))
> + return target;
> +
> + /*
> + * If the previous cpu is cache affine and idle, don't be stupid.
> + */
> + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
> + return prev;
> +
> + sd = rcu_dereference(per_cpu(sd_llc, target));
> + if (!sd)
> + return target;
> +
> +#ifdef CONFIG_SCHED_SMT
> + return smt_balance(p, sd, target);
> +#else
> +
> + return select_idle_cpu(p, sd, target);
> +#endif
And that is just wrong....
> +}
So while there are things in there that _might_ work, I really don't
want to see this one giant horrible patch again.
On Thu, Feb 01, 2018 at 01:33:35PM +0100, Peter Zijlstra wrote:
> I think you want to go allocate sched_domain_shared for the MC level and
> use that, much like sd_llc_shared.
Also, you'd want to try and get performance numbers for something like
Power8, which has SMT8 and fairly expensive atomic ops.
On 2/1/2018 7:33 AM, Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
>> +#ifdef CONFIG_SCHED_SMT
>> +
>> +/*
>> + * From sd_llc downward update the SMT utilization.
>
> Please don't use utilization for this, we already use that word for
> something else.
>
>> + * Skip the lowest level 0.
>> + */
>> +void smt_util(struct rq *rq, int prev_busy, int next_busy)
>> +{
>> + struct sched_domain *sd;
>> + struct sched_group *sg_cpu;
>> + int this_cpu = rq->cpu;
>> + int util;
>> +
>> + if (rq->curr_util == UTIL_UNINITIALIZED)
>> + prev_busy = 0;
>> +
>> + util = next_busy - prev_busy;
>> + rcu_read_lock();
>> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
>> + if (util) {
>> + for_each_lower_domain(sd) {
>> + if (sd->level == 0)
>> + break;
>
> afaict you really only need this for the core, and here you're assuming
> everything below the LLC is cores. Would it not be much clearer if you
> introduce sd_core.
>
> As is, for_each_lower_domain includes the starting domain, sd->group
> then is the first core group for this cpu. But then you continue to the
> smt domain (on Intel, on other architectures there could be a cluster
> domain in between) and then you bail using that sd->level == 0 hack
> because otherwise things would go *bang*.
Hi Peter,
The code here and in smt_balance intentionally visits each level between
the llc and smt, including core-cluster on architectures that define it.
smt_balance thus has the chance to randomly pick a better cluster,
and then within that cluster randomly pick a better core. It makes sense,
as resources are shared within a cluster, and choosing a less loaded cluster
should give better performance. As you suggest in a few other places,
it would be nice to see performance results for this case. We have
SPARC processors with core clusters.
> Really rather messy this.
>
> I think you want to go allocate sched_domain_shared for the MC level and
> use that, much like sd_llc_shared.
>
>> + sg_cpu = sd->groups;
>> +
>> + /* atomically add/subtract the util */
>> + if (util > 0)
>> + atomic_inc((atomic_t *)&sg_cpu->utilization);
>> + else
>> + atomic_dec((atomic_t *)&sg_cpu->utilization);
>
> *sigh*, wth do you need that cast? You didn't get the memo that spurious
> casts are where bugs happen and are terrible style at the best of times?
>
>> + }
>> + }
>> +
>> + if (sd)
>> + rq->curr_util = next_busy;
>> + rcu_read_unlock();
>> +}
>
>> + smt_util(rq, 1, 0);
>
>> + smt_util(rq, 0, 1);
>
> That all seems like an overly complex way to write inc/dec. You turned
> what should be a boolean state space (2^1) into something like 2^64.
>
> Also, I think if you count idle instead of busy, you'll do away with the
> need for that whole uninitialized thing.
>
>
>> +#define CPU_PSEUDO_RANDOM(cpu) (cpu_rq(cpu)->rotor++)
>
> That's a bit of a stretch calling that pseudo random..
>
>> +/*
>> + * Find an idle cpu in the core starting search from random index.
>> + */
>> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
>> {
>> + int i, rand_index, rand_cpu;
>> + int this_cpu = smp_processor_id();
>>
>> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>> + rand_cpu = sg->cp_array[rand_index];
>
> Right, so yuck.. I know why you need that, but that extra array and
> dereference is the reason I never went there.
>
> How much difference does it really make vs the 'normal' wrapping search
> from last CPU ?
>
> This really should be a separate patch with separate performance numbers
> on.
For the benefit of other readers, if we always search and choose starting from
the first CPU in a core, then later searches will often need to traverse the first
N busy CPU's to find the first idle CPU. Choosing a random starting point avoids
such bias. It is probably a win for processors with 4 to 8 CPUs per core, and
a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
to see performance data for this as a separate patch to decide. We have SPARC
systems with 8 CPUs per core.
>> + for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) {
>> + if (!cpumask_test_cpu(i, &p->cpus_allowed))
>> + continue;
>> + if (idle_cpu(i))
>> + return i;
>> + }
>>
>> + return -1;
>> }
>>
>> /*
>> + * Compare the SMT utilization of the two groups and select the one which
>> + * has more capacity left.
>> */
>> +static int smt_should_migrate(struct sched_group *here,
>> + struct sched_group *there, int self_util)
>> {
>> + int this_cpu = smp_processor_id();
>> + int here_util = here->utilization, there_util = there->utilization;
>>
>> + /* Discount self utilization if it belongs to here or there */
>> + if (self_util > 0) {
>> + if (cpumask_test_cpu(this_cpu, sched_group_span(here)))
>> + here_util -= self_util;
>> + else if (cpumask_test_cpu(this_cpu, sched_group_span(there)))
>> + there_util -= self_util;
>> }
>>
>> + /* Return true if other group has more capacity left */
>> + return (there->group_weight - there_util >
>> + here->group_weight - here_util);
>> }
>
> OK, so this effectively picks the least-busiest/idlest SMT sibling of
> two.
>
>> /*
>> + * SMT balancing works by comparing the target cpu group with a random group
>> + * in llc domain. It calls smt_should_migrate to decide which group has more
>> + * capacity left. The balancing starts top down fom sd_llc to SMT core level.
>> + * Finally idle cpu search is only done in the core.
>> */
>> +static int smt_balance(struct task_struct *p, struct sched_domain *sd, int cpu)
>> {
>> + struct sched_group *sg, *src_sg, *start_sg, *tgt_sg;
>> + struct cpumask *span;
>> + int rand_idx, weight;
>> + int cpu_orig = cpu;
>> + int rand_cpu;
>> + int this_cpu = smp_processor_id();
>> + struct rq *this_rq = cpu_rq(this_cpu);
>> + struct rq *rq = cpu_rq(cpu);
>> + int self_util = 0;
>> + int balanced = 0;
>>
>> + if (p == this_rq->curr)
>> + self_util = rq->curr_util;
>>
>> + for_each_lower_domain(sd) {
>
> Again, I don't think we want to do that. You want to iterate all cores,
> and this is a rather cumbersome way to go about doing that.
>
> Esp. if there's a domain in between. Imagine an ARM bit.little with
> shared L3 growing SMT or something along those lines.
>
>> + if (sd->level == 0)
>> + break;
>>
>> + /* Pick a random group that has cpus where the thread can run */
>> + src_sg = sd->groups;
>> + tgt_sg = NULL;
>> + rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % sd->sg_num;
>> + start_sg = sd->sg_array[rand_idx];
>
>> + sg = start_sg;
>> + do {
>> + span = sched_group_span(sg);
>> + if (sg != src_sg &&
>> + cpumask_intersects(span, &p->cpus_allowed)) {
>> + tgt_sg = sg;
>> + break;
>> + }
>> + sg = sg->next;
>> + } while (sg != start_sg);
>
> OK, so this picks a 'random' group that has CPUs where our task is
> allowed to run (per the above this group could be a cluster, not a
> core).
>
>> + /*
>> + * Compare the target group with random group and select the
>> + * one which has more SMT capacity left. Choose a random CPU in
>> + * the group to spread the load, then find the group's parent sd
>> + * so the group's sd is used on the next main loop iteration.
>> + */
>> + if (tgt_sg && smt_should_migrate(src_sg, tgt_sg, self_util)) {
>> + weight = tgt_sg->group_weight;
>> + rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % weight;
>> + rand_cpu = tgt_sg->cp_array[rand_idx];
>> + for_each_cpu_wrap(cpu, span, rand_cpu) {
>> + if (cpumask_test_cpu(cpu, &p->cpus_allowed))
>> + break;
>> + }
>> + for_each_domain(cpu, sd) {
>> + if (weight < sd->span_weight)
>> + break;
>> + }
>> + balanced = 1;
>> }
>
> I really wonder how much that fake random stuff yields you vs the rest
> of this.
>
> In any case, this will not do for facebook I'm afraid, as is they
> already disable SIS_AVG_CPU and SIS_PROP (iirc) and always take the hit
> of doing a full LLC search.
>
> Yes that is expensive, but they really want to keep that tail latency
> down.
They might be happier with this new patch. In the tests we ran, it improves
CPU utilization and also reduces searching cost. However, I agree we should
keep the option to search all CPUs when SIS_PROP and SIS_AVG_CPU are disabled.
> Also, only ever considering _one_ other core might affect other
> workloads. If you look at those two features above, we should already
> reduce the amount of searching we do when there's not a lot of idle
> time.
It might be interesting to add a tunable for the number of random choices to
make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
Or, choose a random starting point and then search for nr sequential
candidates; possibly limited by a tunable.
The current (pre-patch) search is biased. select_idle_cpu will not find an
idle CPU hiding behind the first nr candidates.
>> }
>>
>> + /* sd is now lowest level SMT core */
>> + if (sd->parent && (balanced || !idle_cpu(cpu_orig))) {
>> + cpu = select_idle_smt(p, sd->parent->groups);
>> + if (cpu >= 0)
>> return cpu;
>> }
>>
>> + return cpu_orig;
>> }
>>
>
>> @@ -6302,6 +6266,31 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>> return min_cap * 1024 < task_util(p) * capacity_margin;
>> }
>>
>> +static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> +{
>> + struct sched_domain *sd;
>> +
>> + if (idle_cpu(target))
>> + return target;
>> +
>> + /*
>> + * If the previous cpu is cache affine and idle, don't be stupid.
>> + */
>> + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
>> + return prev;
>> +
>> + sd = rcu_dereference(per_cpu(sd_llc, target));
>> + if (!sd)
>> + return target;
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> + return smt_balance(p, sd, target);
>> +#else
>> +
>> + return select_idle_cpu(p, sd, target);
>> +#endif
>
> And that is just wrong....
>
>> +}
>
> So while there are things in there that _might_ work, I really don't
> want to see this one giant horrible patch again.
Subhra, I suggest roughly this refactoring into multiple patches:
* add sg_array and cp_array
* add curr_util (renamed as peter asks) and the code that updates it.
* select_idle_smt as peter suggests
* add random balancing
- Steve
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> It might be interesting to add a tunable for the number of random choices to
> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
This needs a fairly complicated PRNG for it would need to visit each
possible CPU once before looping. A LFSR does that, but requires 2^n-1
elements and we have topology masks that don't match that.. The trivial
example is something with 6 cores.
> Or, choose a random starting point and then search for nr sequential
> candidates; possibly limited by a tunable.
And this is basically what we already do. Except with the task-cpu
instead of a per-cpu rotor.
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> >> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
> >> {
> >> + int i, rand_index, rand_cpu;
> >> + int this_cpu = smp_processor_id();
> >>
> >> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
> >> + rand_cpu = sg->cp_array[rand_index];
> >
> > Right, so yuck.. I know why you need that, but that extra array and
> > dereference is the reason I never went there.
> >
> > How much difference does it really make vs the 'normal' wrapping search
> > from last CPU ?
> >
> > This really should be a separate patch with separate performance numbers
> > on.
>
> For the benefit of other readers, if we always search and choose starting from
> the first CPU in a core, then later searches will often need to traverse the first
> N busy CPU's to find the first idle CPU. Choosing a random starting point avoids
> such bias. It is probably a win for processors with 4 to 8 CPUs per core, and
> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
> to see performance data for this as a separate patch to decide. We have SPARC
> systems with 8 CPUs per core.
Which is why the current code already doesn't start from the first cpu
in the mask. We start at whatever CPU the task ran last on, which is
effectively 'random' if the system is busy.
So how is a per-cpu rotor better than that?
On 2/2/2018 12:17 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>>>> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
>>>> {
>>>> + int i, rand_index, rand_cpu;
>>>> + int this_cpu = smp_processor_id();
>>>>
>>>> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>>>> + rand_cpu = sg->cp_array[rand_index];
>>>
>>> Right, so yuck.. I know why you need that, but that extra array and
>>> dereference is the reason I never went there.
>>>
>>> How much difference does it really make vs the 'normal' wrapping search
>>> from last CPU ?
>>>
>>> This really should be a separate patch with separate performance numbers
>>> on.
>>
>> For the benefit of other readers, if we always search and choose starting from
>> the first CPU in a core, then later searches will often need to traverse the first
>> N busy CPU's to find the first idle CPU. Choosing a random starting point avoids
>> such bias. It is probably a win for processors with 4 to 8 CPUs per core, and
>> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
>> to see performance data for this as a separate patch to decide. We have SPARC
>> systems with 8 CPUs per core.
>
> Which is why the current code already doesn't start from the first cpu
> in the mask. We start at whatever CPU the task ran last on, which is
> effectively 'random' if the system is busy.
>
> So how is a per-cpu rotor better than that?
The current code is:
for_each_cpu(cpu, cpu_smt_mask(target)) {
For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask.
8 different tasks will traverse the mask in the same order.
- Steve
On 2/2/18 9:17 AM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>>>> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
>>>> {
>>>> + int i, rand_index, rand_cpu;
>>>> + int this_cpu = smp_processor_id();
>>>>
>>>> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>>>> + rand_cpu = sg->cp_array[rand_index];
>>> Right, so yuck.. I know why you need that, but that extra array and
>>> dereference is the reason I never went there.
>>>
>>> How much difference does it really make vs the 'normal' wrapping search
>>> from last CPU ?
>>>
>>> This really should be a separate patch with separate performance numbers
>>> on.
>> For the benefit of other readers, if we always search and choose starting from
>> the first CPU in a core, then later searches will often need to traverse the first
>> N busy CPU's to find the first idle CPU. Choosing a random starting point avoids
>> such bias. It is probably a win for processors with 4 to 8 CPUs per core, and
>> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
>> to see performance data for this as a separate patch to decide. We have SPARC
>> systems with 8 CPUs per core.
> Which is why the current code already doesn't start from the first cpu
> in the mask. We start at whatever CPU the task ran last on, which is
> effectively 'random' if the system is busy.
>
> So how is a per-cpu rotor better than that?
In the scheme of SMT balance, if the idle cpu search is done _not_ in
the last run core, then we need a random cpu to start from. If the idle
cpu search is done in the last run core we can start the search from
last run cpu. Since we need the random index for the first case I just
did it for both.
Thanks,
Subhra
On 2/2/2018 12:21 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>> It might be interesting to add a tunable for the number of random choices to
>> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
>
> This needs a fairly complicated PRNG for it would need to visit each
> possible CPU once before looping. A LFSR does that, but requires 2^n-1
> elements and we have topology masks that don't match that.. The trivial
> example is something with 6 cores.
Or keep it simple and accept the possibility of choosing the same candidate
more than once.
>> Or, choose a random starting point and then search for nr sequential
>> candidates; possibly limited by a tunable.
>
> And this is basically what we already do. Except with the task-cpu
> instead of a per-cpu rotor.
Righto. Disregard this suggestion.
- Steve
On 2/2/2018 12:39 PM, Steven Sistare wrote:
> On 2/2/2018 12:21 PM, Peter Zijlstra wrote:
>> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>>> It might be interesting to add a tunable for the number of random choices to
>>> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
>>
>> This needs a fairly complicated PRNG for it would need to visit each
>> possible CPU once before looping. A LFSR does that, but requires 2^n-1
>> elements and we have topology masks that don't match that.. The trivial
>> example is something with 6 cores.
>
> Or keep it simple and accept the possibility of choosing the same candidate
> more than once.
>
>>> Or, choose a random starting point and then search for nr sequential
>>> candidates; possibly limited by a tunable.
>>
>> And this is basically what we already do. Except with the task-cpu
>> instead of a per-cpu rotor.
>
> Righto. Disregard this suggestion.
Actually, I take back my take back. I suspect the primary benefit
of random selection is that it breaks up resonance states where
CPUs that are busy tend to stay busy, and CPUs that are idle tend
to stay idle, which is reinforced by starting the search at target =
last cpu ran.
Or, a quantitative argument: if sampling a single random CPU
gives better results (and the data says it does), then sampling
a random cpu and searching nr from it should give better results,
since it has nr - 1 more chances to find an idle CPU.
- Steve
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> On 2/1/2018 7:33 AM, Peter Zijlstra wrote:
> > On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
> >> + rcu_read_lock();
> >> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
> >> + if (util) {
> >> + for_each_lower_domain(sd) {
> >> + if (sd->level == 0)
> >> + break;
> >
> > afaict you really only need this for the core, and here you're assuming
> > everything below the LLC is cores. Would it not be much clearer if you
> > introduce sd_core.
> >
> > As is, for_each_lower_domain includes the starting domain, sd->group
> > then is the first core group for this cpu. But then you continue to the
> > smt domain (on Intel, on other architectures there could be a cluster
> > domain in between) and then you bail using that sd->level == 0 hack
> > because otherwise things would go *bang*.
>
> Hi Peter,
>
> The code here and in smt_balance intentionally visits each level between
> the llc and smt, including core-cluster on architectures that define it.
> smt_balance thus has the chance to randomly pick a better cluster,
> and then within that cluster randomly pick a better core. It makes sense,
> as resources are shared within a cluster, and choosing a less loaded cluster
> should give better performance. As you suggest in a few other places,
> it would be nice to see performance results for this case. We have
> SPARC processors with core clusters.
>
But then you get that atomic crud to contend on the cluster level, which
is even worse than it contending on the core level.
On Fri, Feb 02, 2018 at 12:36:47PM -0500, Steven Sistare wrote:
> On 2/2/2018 12:17 PM, Peter Zijlstra wrote:
> > On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> >>>> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
> >>>> {
> >>>> + int i, rand_index, rand_cpu;
> >>>> + int this_cpu = smp_processor_id();
> >>>>
> >>>> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
> >>>> + rand_cpu = sg->cp_array[rand_index];
> >>>
> >>> Right, so yuck.. I know why you need that, but that extra array and
> >>> dereference is the reason I never went there.
> >>>
> >>> How much difference does it really make vs the 'normal' wrapping search
> >>> from last CPU ?
> >>>
> >>> This really should be a separate patch with separate performance numbers
> >>> on.
> >>
> >> For the benefit of other readers, if we always search and choose starting from
> >> the first CPU in a core, then later searches will often need to traverse the first
> >> N busy CPU's to find the first idle CPU. Choosing a random starting point avoids
> >> such bias. It is probably a win for processors with 4 to 8 CPUs per core, and
> >> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
> >> to see performance data for this as a separate patch to decide. We have SPARC
> >> systems with 8 CPUs per core.
> >
> > Which is why the current code already doesn't start from the first cpu
> > in the mask. We start at whatever CPU the task ran last on, which is
> > effectively 'random' if the system is busy.
> >
> > So how is a per-cpu rotor better than that?
>
> The current code is:
> for_each_cpu(cpu, cpu_smt_mask(target)) {
>
> For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask.
> 8 different tasks will traverse the mask in the same order.
Ooh, the SMT loop.. yes that can be improved. But look at the other
ones, they do:
for_each_cpu_wrap(cpu, sched_domain_span(), target)
so we look for an idle cpu in the LLC domain, and start iteration at
@target, which will (on average) be different for different CPUs, and
thus hopefully find different idle CPUs.
You could simple change the SMT loop to something like:
for_each_cpu_wrap(cpu, cpu_smt_mask(target), target)
and see what that does.
On Fri, Feb 02, 2018 at 01:34:58PM -0500, Steven Sistare wrote:
> Actually, I take back my take back. I suspect the primary benefit
> of random selection is that it breaks up resonance states where
> CPUs that are busy tend to stay busy, and CPUs that are idle tend
> to stay idle, which is reinforced by starting the search at target =
> last cpu ran.
Which, according to there here patches:
https://lkml.kernel.org/r/[email protected]
is a good thing, because of power management.
On 2/2/2018 2:58 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 12:36:47PM -0500, Steven Sistare wrote:
>> On 2/2/2018 12:17 PM, Peter Zijlstra wrote:
>>> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>>>>>> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
>>>>>> {
>>>>>> + int i, rand_index, rand_cpu;
>>>>>> + int this_cpu = smp_processor_id();
>>>>>>
>>>>>> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>>>>>> + rand_cpu = sg->cp_array[rand_index];
>>>>>
>>>>> Right, so yuck.. I know why you need that, but that extra array and
>>>>> dereference is the reason I never went there.
>>>>>
>>>>> How much difference does it really make vs the 'normal' wrapping search
>>>>> from last CPU ?
>>>>>
>>>>> This really should be a separate patch with separate performance numbers
>>>>> on.
>>>>
>>>> For the benefit of other readers, if we always search and choose starting from
>>>> the first CPU in a core, then later searches will often need to traverse the first
>>>> N busy CPU's to find the first idle CPU. Choosing a random starting point avoids
>>>> such bias. It is probably a win for processors with 4 to 8 CPUs per core, and
>>>> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
>>>> to see performance data for this as a separate patch to decide. We have SPARC
>>>> systems with 8 CPUs per core.
>>>
>>> Which is why the current code already doesn't start from the first cpu
>>> in the mask. We start at whatever CPU the task ran last on, which is
>>> effectively 'random' if the system is busy.
>>>
>>> So how is a per-cpu rotor better than that?
>>
>> The current code is:
>> for_each_cpu(cpu, cpu_smt_mask(target)) {
>>
>> For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask.
>> 8 different tasks will traverse the mask in the same order.
>
> Ooh, the SMT loop.. yes that can be improved. But look at the other
> ones, they do:
>
> for_each_cpu_wrap(cpu, sched_domain_span(), target)
>
> so we look for an idle cpu in the LLC domain, and start iteration at
> @target, which will (on average) be different for different CPUs, and
> thus hopefully find different idle CPUs.
>
> You could simple change the SMT loop to something like:
>
> for_each_cpu_wrap(cpu, cpu_smt_mask(target), target)
>
> and see what that does.
Good idea - Steve
On 2/2/2018 2:59 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>> On 2/1/2018 7:33 AM, Peter Zijlstra wrote:
>>> On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
>>>> + rcu_read_lock();
>>>> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
>>>> + if (util) {
>>>> + for_each_lower_domain(sd) {
>>>> + if (sd->level == 0)
>>>> + break;
>>>
>>> afaict you really only need this for the core, and here you're assuming
>>> everything below the LLC is cores. Would it not be much clearer if you
>>> introduce sd_core.
>>>
>>> As is, for_each_lower_domain includes the starting domain, sd->group
>>> then is the first core group for this cpu. But then you continue to the
>>> smt domain (on Intel, on other architectures there could be a cluster
>>> domain in between) and then you bail using that sd->level == 0 hack
>>> because otherwise things would go *bang*.
>>
>> Hi Peter,
>>
>> The code here and in smt_balance intentionally visits each level between
>> the llc and smt, including core-cluster on architectures that define it.
>> smt_balance thus has the chance to randomly pick a better cluster,
>> and then within that cluster randomly pick a better core. It makes sense,
>> as resources are shared within a cluster, and choosing a less loaded cluster
>> should give better performance. As you suggest in a few other places,
>> it would be nice to see performance results for this case. We have
>> SPARC processors with core clusters.
>>
>
> But then you get that atomic crud to contend on the cluster level, which
> is even worse than it contending on the core level.
True, but it can still be a net win if we make better scheduling decisions.
A saving grace is that the atomic counter is only updated if the cpu
makes a transition from idle to busy or vice versa.
We need data for this type of system, showing improvements for normal
workloads, and showing little downside for a high context switch rate
torture test.
- Steve
On 2/2/2018 3:04 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 01:34:58PM -0500, Steven Sistare wrote:
>> Actually, I take back my take back. I suspect the primary benefit
>> of random selection is that it breaks up resonance states where
>> CPUs that are busy tend to stay busy, and CPUs that are idle tend
>> to stay idle, which is reinforced by starting the search at target =
>> last cpu ran.
>
> Which, according to there here patches:
>
> https://lkml.kernel.org/r/[email protected]
>
> is a good thing, because of power management.
Yes, but it's a bad thing if ready to run tasks pile on busy CPUs and idle
CPUs go unused. Stating the obvious, when the search for idle fails, the
thread goes on a busy CPU. The existing logic that checks and uses the
initial target if it is idle reduces unnecessary spreading and is power
friendly (indeed, added in the patch you reference). Subhra's patch does
not change that.
- Steve
On Fri, 2018-02-02 at 13:34 -0500, Steven Sistare wrote:
> On 2/2/2018 12:39 PM, Steven Sistare wrote:
> > On 2/2/2018 12:21 PM, Peter Zijlstra wrote:
> >> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> >>> It might be interesting to add a tunable for the number of random choices to
> >>> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
> >>
> >> This needs a fairly complicated PRNG for it would need to visit each
> >> possible CPU once before looping. A LFSR does that, but requires 2^n-1
> >> elements and we have topology masks that don't match that.. The trivial
> >> example is something with 6 cores.
> >
> > Or keep it simple and accept the possibility of choosing the same candidate
> > more than once.
> >
> >>> Or, choose a random starting point and then search for nr sequential
> >>> candidates; possibly limited by a tunable.
> >>
> >> And this is basically what we already do. Except with the task-cpu
> >> instead of a per-cpu rotor.
> >
> > Righto. Disregard this suggestion.
>
> Actually, I take back my take back. I suspect the primary benefit
> of random selection is that it breaks up resonance states where
> CPUs that are busy tend to stay busy, and CPUs that are idle tend
> to stay idle, which is reinforced by starting the search at target =
> last cpu ran.
I suspect the primary benefit is reduction of bouncing. The absolutely
maddening thing about SIS is that some stuff out there (like FB's load)
doesn't give a rats ass about anything other than absolute minimum
sched latency while other stuff notices cache going missing. ?Joy.
-Mike
On Fri, Feb 02, 2018 at 09:37:02AM -0800, Subhra Mazumdar wrote:
> In the scheme of SMT balance, if the idle cpu search is done _not_ in the
> last run core, then we need a random cpu to start from. If the idle cpu
> search is done in the last run core we can start the search from last run
> cpu. Since we need the random index for the first case I just did it for
> both.
That shouldn't be too hard to fix. I think we can simply transpose the
CPU number. That is, something like:
cpu' = core'_id + (cpu - core_id)
should work for most sane cases. We don't give any guarantees this will
in fact work, but (almost) all actual CPU enumeration schemes I've seen
this should work for.
And if it doesn't work, we're not worse of than we are now.
I just couldn't readily find a place where we need to do this for cores
with the current code. But I think we have one place between LLCs where
it can be done:
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b6535987500..eb8b8d0a026c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6109,7 +6109,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
if (!static_branch_likely(&sched_smt_present))
return -1;
- for_each_cpu(cpu, cpu_smt_mask(target)) {
+ for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
if (idle_cpu(cpu))
@@ -6357,8 +6357,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (cpu == prev_cpu)
goto pick_cpu;
- if (wake_affine(affine_sd, p, prev_cpu, sync))
- new_cpu = cpu;
+ if (wake_affine(affine_sd, p, prev_cpu, sync)) {
+ /*
+ * Transpose prev_cpu's offset into this cpu's
+ * LLC domain to retain the 'random' search offset
+ * for for_each_cpu_wrap().
+ */
+ new_cpu = per_cpu(sd_llc_id, cpu) +
+ (prev_cpu - per_cpu(sd_llc_id, prev_cpu));
+ if (unlikely(!cpus_share_cache(new_cpu, cpu)))
+ new_cpu = cpu;
+ }
}
if (sd && !(sd_flag & SD_BALANCE_FORK)) {
On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote:
> On 2/2/2018 2:59 PM, Peter Zijlstra wrote:
> > But then you get that atomic crud to contend on the cluster level, which
> > is even worse than it contending on the core level.
>
> True, but it can still be a net win if we make better scheduling decisions.
> A saving grace is that the atomic counter is only updated if the cpu
> makes a transition from idle to busy or vice versa.
Which can still be a very high rate for some workloads. I always forget
which, but there are plenty workloads that have very frequenct very
short idle times. Mike, do you remember what comes apart when we take
out the sysctl_sched_migration_cost test in idle_balance()?
> We need data for this type of system, showing improvements for normal
> workloads, and showing little downside for a high context switch rate
> torture test.
So core-wide atomics should, on architectures that can do atomics in L1,
be relatively fast. Once you leave L1, atomic contention goes up a fair
bit. And then there's architectures that simply don't do atomics in L1
(like Power).
Testing on my SKL desktop, atomics contending between SMT siblings is
basically free (weirdly enough my test says its cheaper), atomics
contending on the L3 is 10x as expensive, and this is with only 4 cores.
If I test it on my 10 core IVB, I'm up to 20x, and I can't imagine that
getting any better with bigger core count (my IVB does not show SMT
contention as lower, but not particularly more expensive either).
So while I see the point of tracking these numbers (for SMT>2), I don't
think its worth doing outside of the core, and then we still need some
powerpc (or any other architecture with abysmal atomics) tested.
So what we can do is make this counting crud conditional on SMT>2 and
possibly part of the topology flags such that an architecture can
opt-out.
Then select_idle_core can be augmented to remember the least-loaded core
it encounters in its traversal, and go with that.
On Mon, 2018-02-05 at 13:48 +0100, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote:
> > On 2/2/2018 2:59 PM, Peter Zijlstra wrote:
>
> > > But then you get that atomic crud to contend on the cluster level, which
> > > is even worse than it contending on the core level.
> >
> > True, but it can still be a net win if we make better scheduling decisions.
> > A saving grace is that the atomic counter is only updated if the cpu
> > makes a transition from idle to busy or vice versa.
>
> Which can still be a very high rate for some workloads. I always forget
> which, but there are plenty workloads that have very frequenct very
> short idle times. Mike, do you remember what comes apart when we take
> out the sysctl_sched_migration_cost test in idle_balance()?
Used to be anything scheduling cross-core heftily suffered, ie pretty
much any localhost communication heavy load. ?I just tried disabling it
in 4.13 though (pre pti cliff), tried tbench, and it made zip squat
difference. ?I presume that's due to the meanwhile added this_rq->rd-
>overload and/or?curr_cost checks. ?I don't recall the original cost
details beyond it having been "a sh*tload".
-Mike
On Mon, Feb 05, 2018 at 01:48:54PM +0100, Peter Zijlstra wrote:
> So while I see the point of tracking these numbers (for SMT>2), I don't
> think its worth doing outside of the core, and then we still need some
> powerpc (or any other architecture with abysmal atomics) tested.
FWIW Power has another 'fun' feature, their cores have asymmetric SMT.
Their cores have a static power level, based on _which_ SMT sibling is
running, not how many. A single SMT2 runs (much) slower than a single
SMT0.
So that random selection stuff really doesn't work well for them. Now
'sadly' x86 can also have ASYM_PACKING set on its SMT domain, so I'm
going to have to figure out what to do about all that.
On 02/05/2018 04:19 AM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 09:37:02AM -0800, Subhra Mazumdar wrote:
>> In the scheme of SMT balance, if the idle cpu search is done _not_ in the
>> last run core, then we need a random cpu to start from. If the idle cpu
>> search is done in the last run core we can start the search from last run
>> cpu. Since we need the random index for the first case I just did it for
>> both.
> That shouldn't be too hard to fix. I think we can simply transpose the
> CPU number. That is, something like:
>
> cpu' = core'_id + (cpu - core_id)
>
> should work for most sane cases. We don't give any guarantees this will
> in fact work, but (almost) all actual CPU enumeration schemes I've seen
> this should work for.
>
> And if it doesn't work, we're not worse of than we are now.
>
> I just couldn't readily find a place where we need to do this for cores
> with the current code. But I think we have one place between LLCs where
> it can be done:
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7b6535987500..eb8b8d0a026c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6109,7 +6109,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> if (!static_branch_likely(&sched_smt_present))
> return -1;
>
> - for_each_cpu(cpu, cpu_smt_mask(target)) {
> + for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> continue;
> if (idle_cpu(cpu))
> @@ -6357,8 +6357,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> if (cpu == prev_cpu)
> goto pick_cpu;
>
> - if (wake_affine(affine_sd, p, prev_cpu, sync))
> - new_cpu = cpu;
> + if (wake_affine(affine_sd, p, prev_cpu, sync)) {
> + /*
> + * Transpose prev_cpu's offset into this cpu's
> + * LLC domain to retain the 'random' search offset
> + * for for_each_cpu_wrap().
> + */
> + new_cpu = per_cpu(sd_llc_id, cpu) +
> + (prev_cpu - per_cpu(sd_llc_id, prev_cpu));
> + if (unlikely(!cpus_share_cache(new_cpu, cpu)))
> + new_cpu = cpu;
> + }
> }
>
> if (sd && !(sd_flag & SD_BALANCE_FORK)) {
The pseudo random is also used for choosing a random core to compare
with, how will transposing achieve that?
Thanks,
Subhra
On 02/05/2018 09:03 AM, Peter Zijlstra wrote:
> On Mon, Feb 05, 2018 at 01:48:54PM +0100, Peter Zijlstra wrote:
>> So while I see the point of tracking these numbers (for SMT>2), I don't
>> think its worth doing outside of the core, and then we still need some
>> powerpc (or any other architecture with abysmal atomics) tested.
> FWIW Power has another 'fun' feature, their cores have asymmetric SMT.
>
> Their cores have a static power level, based on _which_ SMT sibling is
> running, not how many. A single SMT2 runs (much) slower than a single
> SMT0.
>
> So that random selection stuff really doesn't work well for them. Now
> 'sadly' x86 can also have ASYM_PACKING set on its SMT domain, so I'm
> going to have to figure out what to do about all that.
Even the existing code doesn't handle that. The SMT balancing compares
the remaining SMT capacity so even with asymmetric cores should work OK.
Thanks,
Subhra
On Mon, Feb 05, 2018 at 02:09:11PM -0800, Subhra Mazumdar wrote:
> The pseudo random is also used for choosing a random core to compare with,
> how will transposing achieve that?
Not entirely sure what your point is. Current code doesn't compare to
just _one_ other core, and I don't think we'd ever want to do that.
So currently select_idle_core() will, if there is an idle core, iterate
the whole thing trying to find it. If it fails, it clears the
'have_idle_core' state.
select_idle_cpu, which we'll fall back to, will limit the scanning based
on the average idle time.
The crucial point however, is that concurrent wakeups will not, on
average, do the same iteration because of the target offset.
On 02/06/2018 01:12 AM, Peter Zijlstra wrote:
> On Mon, Feb 05, 2018 at 02:09:11PM -0800, Subhra Mazumdar wrote:
>> The pseudo random is also used for choosing a random core to compare with,
>> how will transposing achieve that?
> Not entirely sure what your point is. Current code doesn't compare to
> just _one_ other core, and I don't think we'd ever want to do that.
>
> So currently select_idle_core() will, if there is an idle core, iterate
> the whole thing trying to find it. If it fails, it clears the
> 'have_idle_core' state.
>
> select_idle_cpu, which we'll fall back to, will limit the scanning based
> on the average idle time.
>
>
> The crucial point however, is that concurrent wakeups will not, on
> average, do the same iteration because of the target offset.
I meant the SMT balance patch. That does comparison with only one other
random core and takes the decision in O(1). Any potential scan of all cores
or cpus is O(n) and doesn't scale and will only get worse in future. That
applies to both select_idle_core() and select_idle_cpu().
Is there any reason this randomized approach is not acceptable even if
benchmarks show improvement? Are there other benchmarks I should try?
Also your suggestion to keep the SMT utilization but still do a
traversal of cores
in select_idle_core() while remembering the least loaded core will still
have
the problem of potentially traversing all cores. I can compare this with
a core
level only SMT balancing, is that useful to decide? I will also test on
SPARC
machines with higher degree of SMT.
You had also mentioned to do it for only SMT >2, not sure I understand why
as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
the scalability problem.
Thanks,
Subhra
On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:
> I meant the SMT balance patch. That does comparison with only one other
> random core and takes the decision in O(1). Any potential scan of all cores
> or cpus is O(n) and doesn't scale and will only get worse in future. That
> applies to both select_idle_core() and select_idle_cpu().
We only do the full scan if we think to know there is indeed an idle
core to be had, and if there are idle cores the machine isn't terribly
busy.
If there are no idle cores, we do not in fact scan everything. We limit
the amount of scanning based on the average idle time with a minimum of
4.
(Except when you switch off things like SIS_PROP, then you scan
unconditionally and reduce tail latency at the cost of throughput --
like said, some people want this).
O(1) sounds nice, but it has horrible worst case latencies.
And like I said, I don't see how a rotor is particularly more random
than the previous cpu the task you happen to be waking ran on.
> Is there any reason this randomized approach is not acceptable even if
> benchmarks show improvement? Are there other benchmarks I should try?
Looking at just one other core has a fairly high chance of not finding
idle. I really cannot remember all the benchmarks, but Mike did
something a little less random but still O(1) a few years back and that
crashed and burned.
> Also your suggestion to keep the SMT utilization but still do a traversal of
> cores
> in select_idle_core() while remembering the least loaded core will still
> have
> the problem of potentially traversing all cores. I can compare this with a
> core
> level only SMT balancing, is that useful to decide? I will also test on
> SPARC
> machines with higher degree of SMT.
Please learn to use your email client, that's near unreadable for
whitespace damage, time is really too precious to try and untangle crap
like that.
> You had also mentioned to do it for only SMT >2, not sure I understand why
> as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
> the scalability problem.
For SMT2 you don't need the occupation counter with atomic crud, a
simple !atomic core-idle works just fine. And your 'patch' had soo many
moving parts it was too hard to say which aspect changed what.
hackbench really isn't _that_ interesting a benchmark and its fairly
easy to make it go faster, but doing so typically wrecks something else.
There's a whole array of netperf benchmarks which you have to run at
different utilization levels, there's the sysbench oltp stuff, which you
can run on MariaDB and PostgreSQL, again at different utilization levels
(both databases behave quite differently). There's ebizzy, tbench,
dbench and others.
There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.
There's the facebook schbench, which is a bit finnicky to set up.
And I'm sure I'm forgetting half, look at the patches and regression
reports for more clues, that's what Google is for. You could have
figured all that out yourself, much of these are listed in the
changelogs and if you'd bothered to find lkml regression reports you
could find more.
On 02/07/2018 12:42 AM, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:
>
>> I meant the SMT balance patch. That does comparison with only one other
>> random core and takes the decision in O(1). Any potential scan of all cores
>> or cpus is O(n) and doesn't scale and will only get worse in future. That
>> applies to both select_idle_core() and select_idle_cpu().
> We only do the full scan if we think to know there is indeed an idle
> core to be had, and if there are idle cores the machine isn't terribly
> busy.
>
> If there are no idle cores, we do not in fact scan everything. We limit
> the amount of scanning based on the average idle time with a minimum of
> 4.
This logic may not be working as well as you may think it to be. I had
sent the cost of select_idle_sibling() w/ and w/o my patch and there was
huge difference:
Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:
function baseline-rc6 %stdev patch %stdev
select_idle_sibling() 0.556 1.72 0.263 (-52.70%) 0.78
>
> (Except when you switch off things like SIS_PROP, then you scan
> unconditionally and reduce tail latency at the cost of throughput --
> like said, some people want this).
>
> O(1) sounds nice, but it has horrible worst case latencies.
>
> And like I said, I don't see how a rotor is particularly more random
> than the previous cpu the task you happen to be waking ran on.
The rotor randomness is not the main point here. I think the benchmark
improvements come from the fact that select_idle_sibling() cost has reduced
a lot. To reduce it while still maintaining good spread of threads can be
achieved by this SMT balance scheme which in turn requires a fast decent
random number generator and rotor is just an easy way to achieve that.
>> Is there any reason this randomized approach is not acceptable even if
>> benchmarks show improvement? Are there other benchmarks I should try?
> Looking at just one other core has a fairly high chance of not finding
> idle. I really cannot remember all the benchmarks, but Mike did
> something a little less random but still O(1) a few years back and that
> crashed and burned.
>
>> Also your suggestion to keep the SMT utilization but still do a traversal of
>> cores
>> in select_idle_core() while remembering the least loaded core will still
>> have
>> the problem of potentially traversing all cores. I can compare this with a
>> core
>> level only SMT balancing, is that useful to decide? I will also test on
>> SPARC
>> machines with higher degree of SMT.
> Please learn to use your email client, that's near unreadable for
> whitespace damage, time is really too precious to try and untangle crap
> like that.
Sorry about that
>
>> You had also mentioned to do it for only SMT >2, not sure I understand why
>> as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
>> the scalability problem.
> For SMT2 you don't need the occupation counter with atomic crud, a
> simple !atomic core-idle works just fine. And your 'patch' had soo many
> moving parts it was too hard to say which aspect changed what.
>
> hackbench really isn't _that_ interesting a benchmark and its fairly
> easy to make it go faster, but doing so typically wrecks something else.
I also had uperf and Oracle DB TPC-C numbers in the patch which showed
improvements
>
> There's a whole array of netperf benchmarks which you have to run at
> different utilization levels, there's the sysbench oltp stuff, which you
> can run on MariaDB and PostgreSQL, again at different utilization levels
> (both databases behave quite differently). There's ebizzy, tbench,
> dbench and others.
>
> There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.
>
> There's the facebook schbench, which is a bit finnicky to set up.
>
> And I'm sure I'm forgetting half, look at the patches and regression
> reports for more clues, that's what Google is for. You could have
> figured all that out yourself, much of these are listed in the
> changelogs and if you'd bothered to find lkml regression reports you
> could find more.
Ok, will find out and run them.
Thanks,
Subhra