2022-05-09 04:44:03

by Abel Wu

[permalink] [raw]
Subject: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

Try to improve searching efficiency of SIS by filtering out the
overloaded cpus, and as a result the more overloaded the system
is, the less cpus will be searched.

The overloaded cpus are tracked through LLC shared domain. To
regulate accesses to the shared data, the update happens mainly
at the tick. But in order to make it more accurate, we also take
the task migrations into consideration during load balancing which
can be quite frequent due to short running workload causing newly-
idle. Since an overloaded runqueue requires at least 2 non-idle
tasks runnable, we could have more faith on the "frequent newly-
idle" case.

Benchmark
=========

Tests are done in an Intel(R) Xeon(R) Platinum 8260 [email protected]
machine with 2 NUMA nodes each of which has 24 cores with SMT2
enabled, so 96 CPUs in total.

All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled.

Based on tip sched/core 089c02ae2771 (v5.18-rc1).

Results
=======

hackbench-process-pipes
vanilla filter
Amean 1 0.2537 ( 0.00%) 0.2330 ( 8.15%)
Amean 4 0.7090 ( 0.00%) 0.7440 * -4.94%*
Amean 7 0.9153 ( 0.00%) 0.9040 ( 1.24%)
Amean 12 1.1473 ( 0.00%) 1.0857 * 5.37%*
Amean 21 2.7210 ( 0.00%) 2.2320 * 17.97%*
Amean 30 4.8263 ( 0.00%) 3.6170 * 25.06%*
Amean 48 7.4107 ( 0.00%) 6.1130 * 17.51%*
Amean 79 9.2120 ( 0.00%) 8.2350 * 10.61%*
Amean 110 10.1647 ( 0.00%) 8.8043 * 13.38%*
Amean 141 11.5713 ( 0.00%) 10.5867 * 8.51%*
Amean 172 13.7963 ( 0.00%) 12.8120 * 7.13%*
Amean 203 15.9283 ( 0.00%) 14.8703 * 6.64%*
Amean 234 17.8737 ( 0.00%) 17.1053 * 4.30%*
Amean 265 19.8443 ( 0.00%) 18.7870 * 5.33%*
Amean 296 22.4147 ( 0.00%) 21.3943 * 4.55%*

There is a regression in 4-groups test in which case lots of busy
cpus can be found in the system. The busy cpus are not recorded in
the overloaded cpu mask, so we trade overhead for nothing in SIS.
This is the worst case of this feature.

hackbench-process-sockets
vanilla filter
Amean 1 0.5343 ( 0.00%) 0.5270 ( 1.37%)
Amean 4 1.4500 ( 0.00%) 1.4273 * 1.56%*
Amean 7 2.4813 ( 0.00%) 2.4383 * 1.73%*
Amean 12 4.1357 ( 0.00%) 4.0827 * 1.28%*
Amean 21 6.9707 ( 0.00%) 6.9290 ( 0.60%)
Amean 30 9.8373 ( 0.00%) 9.6730 * 1.67%*
Amean 48 15.6233 ( 0.00%) 15.3213 * 1.93%*
Amean 79 26.2763 ( 0.00%) 25.3293 * 3.60%*
Amean 110 36.6170 ( 0.00%) 35.8920 * 1.98%*
Amean 141 47.0720 ( 0.00%) 45.8773 * 2.54%*
Amean 172 57.0580 ( 0.00%) 56.1627 * 1.57%*
Amean 203 67.2040 ( 0.00%) 66.4323 * 1.15%*
Amean 234 77.8897 ( 0.00%) 76.6320 * 1.61%*
Amean 265 88.0437 ( 0.00%) 87.1400 ( 1.03%)
Amean 296 98.2387 ( 0.00%) 96.8633 * 1.40%*

hackbench-thread-pipes
vanilla filter
Amean 1 0.2693 ( 0.00%) 0.2800 * -3.96%*
Amean 4 0.7843 ( 0.00%) 0.7680 ( 2.08%)
Amean 7 0.9287 ( 0.00%) 0.9217 ( 0.75%)
Amean 12 1.4443 ( 0.00%) 1.3680 * 5.29%*
Amean 21 3.5150 ( 0.00%) 3.1107 * 11.50%*
Amean 30 6.3997 ( 0.00%) 5.2160 * 18.50%*
Amean 48 8.4183 ( 0.00%) 7.8477 * 6.78%*
Amean 79 10.0713 ( 0.00%) 9.2240 * 8.41%*
Amean 110 10.9940 ( 0.00%) 10.1280 * 7.88%*
Amean 141 13.6347 ( 0.00%) 11.9387 * 12.44%*
Amean 172 15.0523 ( 0.00%) 14.4117 ( 4.26%)
Amean 203 18.0710 ( 0.00%) 17.3533 ( 3.97%)
Amean 234 19.7413 ( 0.00%) 19.8453 ( -0.53%)
Amean 265 23.1820 ( 0.00%) 22.8223 ( 1.55%)
Amean 296 25.3820 ( 0.00%) 24.2397 ( 4.50%)

hackbench-thread-sockets
vanilla filter
Amean 1 0.5893 ( 0.00%) 0.5750 * 2.43%*
Amean 4 1.4853 ( 0.00%) 1.4727 ( 0.85%)
Amean 7 2.5353 ( 0.00%) 2.5047 * 1.21%*
Amean 12 4.3003 ( 0.00%) 4.1910 * 2.54%*
Amean 21 7.1930 ( 0.00%) 7.1533 ( 0.55%)
Amean 30 10.0983 ( 0.00%) 9.9690 * 1.28%*
Amean 48 15.9853 ( 0.00%) 15.6963 * 1.81%*
Amean 79 26.7537 ( 0.00%) 25.9497 * 3.01%*
Amean 110 37.3850 ( 0.00%) 36.6793 * 1.89%*
Amean 141 47.7730 ( 0.00%) 47.0967 * 1.42%*
Amean 172 58.4280 ( 0.00%) 57.5513 * 1.50%*
Amean 203 69.3093 ( 0.00%) 67.7680 * 2.22%*
Amean 234 80.0190 ( 0.00%) 78.2633 * 2.19%*
Amean 265 90.7237 ( 0.00%) 89.1027 * 1.79%*
Amean 296 101.1153 ( 0.00%) 99.2693 * 1.83%*

schbench
vanilla filter
Lat 50.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 75.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 90.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 95.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 99.0th-qrtle-1 6.00 ( 0.00%) 6.00 ( 0.00%)
Lat 99.5th-qrtle-1 7.00 ( 0.00%) 6.00 ( 14.29%)
Lat 99.9th-qrtle-1 7.00 ( 0.00%) 7.00 ( 0.00%)
Lat 50.0th-qrtle-2 6.00 ( 0.00%) 6.00 ( 0.00%)
Lat 75.0th-qrtle-2 7.00 ( 0.00%) 7.00 ( 0.00%)
Lat 90.0th-qrtle-2 7.00 ( 0.00%) 7.00 ( 0.00%)
Lat 95.0th-qrtle-2 7.00 ( 0.00%) 7.00 ( 0.00%)
Lat 99.0th-qrtle-2 9.00 ( 0.00%) 8.00 ( 11.11%)
Lat 99.5th-qrtle-2 9.00 ( 0.00%) 9.00 ( 0.00%)
Lat 99.9th-qrtle-2 12.00 ( 0.00%) 11.00 ( 8.33%)
Lat 50.0th-qrtle-4 8.00 ( 0.00%) 8.00 ( 0.00%)
Lat 75.0th-qrtle-4 10.00 ( 0.00%) 10.00 ( 0.00%)
Lat 90.0th-qrtle-4 10.00 ( 0.00%) 11.00 ( -10.00%)
Lat 95.0th-qrtle-4 11.00 ( 0.00%) 11.00 ( 0.00%)
Lat 99.0th-qrtle-4 12.00 ( 0.00%) 13.00 ( -8.33%)
Lat 99.5th-qrtle-4 16.00 ( 0.00%) 14.00 ( 12.50%)
Lat 99.9th-qrtle-4 17.00 ( 0.00%) 15.00 ( 11.76%)
Lat 50.0th-qrtle-8 13.00 ( 0.00%) 13.00 ( 0.00%)
Lat 75.0th-qrtle-8 16.00 ( 0.00%) 16.00 ( 0.00%)
Lat 90.0th-qrtle-8 18.00 ( 0.00%) 18.00 ( 0.00%)
Lat 95.0th-qrtle-8 19.00 ( 0.00%) 18.00 ( 5.26%)
Lat 99.0th-qrtle-8 24.00 ( 0.00%) 21.00 ( 12.50%)
Lat 99.5th-qrtle-8 28.00 ( 0.00%) 26.00 ( 7.14%)
Lat 99.9th-qrtle-8 33.00 ( 0.00%) 32.00 ( 3.03%)
Lat 50.0th-qrtle-16 20.00 ( 0.00%) 20.00 ( 0.00%)
Lat 75.0th-qrtle-16 28.00 ( 0.00%) 28.00 ( 0.00%)
Lat 90.0th-qrtle-16 32.00 ( 0.00%) 32.00 ( 0.00%)
Lat 95.0th-qrtle-16 34.00 ( 0.00%) 34.00 ( 0.00%)
Lat 99.0th-qrtle-16 40.00 ( 0.00%) 40.00 ( 0.00%)
Lat 99.5th-qrtle-16 44.00 ( 0.00%) 44.00 ( 0.00%)
Lat 99.9th-qrtle-16 53.00 ( 0.00%) 67.00 ( -26.42%)
Lat 50.0th-qrtle-32 39.00 ( 0.00%) 36.00 ( 7.69%)
Lat 75.0th-qrtle-32 57.00 ( 0.00%) 52.00 ( 8.77%)
Lat 90.0th-qrtle-32 69.00 ( 0.00%) 61.00 ( 11.59%)
Lat 95.0th-qrtle-32 76.00 ( 0.00%) 64.00 ( 15.79%)
Lat 99.0th-qrtle-32 88.00 ( 0.00%) 74.00 ( 15.91%)
Lat 99.5th-qrtle-32 91.00 ( 0.00%) 80.00 ( 12.09%)
Lat 99.9th-qrtle-32 115.00 ( 0.00%) 107.00 ( 6.96%)
Lat 50.0th-qrtle-47 63.00 ( 0.00%) 55.00 ( 12.70%)
Lat 75.0th-qrtle-47 93.00 ( 0.00%) 80.00 ( 13.98%)
Lat 90.0th-qrtle-47 116.00 ( 0.00%) 97.00 ( 16.38%)
Lat 95.0th-qrtle-47 129.00 ( 0.00%) 106.00 ( 17.83%)
Lat 99.0th-qrtle-47 148.00 ( 0.00%) 123.00 ( 16.89%)
Lat 99.5th-qrtle-47 157.00 ( 0.00%) 132.00 ( 15.92%)
Lat 99.9th-qrtle-47 387.00 ( 0.00%) 164.00 ( 57.62%)

netperf-udp
vanilla filter
Hmean send-64 183.09 ( 0.00%) 182.28 ( -0.44%)
Hmean send-128 364.68 ( 0.00%) 363.12 ( -0.43%)
Hmean send-256 715.38 ( 0.00%) 716.57 ( 0.17%)
Hmean send-1024 2764.76 ( 0.00%) 2779.17 ( 0.52%)
Hmean send-2048 5282.93 ( 0.00%) 5220.41 * -1.18%*
Hmean send-3312 8282.26 ( 0.00%) 8121.78 * -1.94%*
Hmean send-4096 10108.12 ( 0.00%) 10042.98 ( -0.64%)
Hmean send-8192 16868.49 ( 0.00%) 16826.99 ( -0.25%)
Hmean send-16384 26230.44 ( 0.00%) 26271.85 ( 0.16%)
Hmean recv-64 183.09 ( 0.00%) 182.28 ( -0.44%)
Hmean recv-128 364.68 ( 0.00%) 363.12 ( -0.43%)
Hmean recv-256 715.38 ( 0.00%) 716.57 ( 0.17%)
Hmean recv-1024 2764.76 ( 0.00%) 2779.17 ( 0.52%)
Hmean recv-2048 5282.93 ( 0.00%) 5220.39 * -1.18%*
Hmean recv-3312 8282.26 ( 0.00%) 8121.78 * -1.94%*
Hmean recv-4096 10108.12 ( 0.00%) 10042.97 ( -0.64%)
Hmean recv-8192 16868.47 ( 0.00%) 16826.93 ( -0.25%)
Hmean recv-16384 26230.44 ( 0.00%) 26271.75 ( 0.16%)

The overhead this feature added to the scheduler can be unfriendly
to the fast context-switching workloads like netperf/tbench. But
the test result seems fine.

netperf-tcp
vanilla filter
Hmean 64 863.35 ( 0.00%) 1176.11 * 36.23%*
Hmean 128 1674.32 ( 0.00%) 2223.37 * 32.79%*
Hmean 256 3151.03 ( 0.00%) 4109.64 * 30.42%*
Hmean 1024 10281.94 ( 0.00%) 12799.28 * 24.48%*
Hmean 2048 16906.05 ( 0.00%) 20129.91 * 19.07%*
Hmean 3312 21246.21 ( 0.00%) 24747.24 * 16.48%*
Hmean 4096 23690.57 ( 0.00%) 26596.35 * 12.27%*
Hmean 8192 28758.29 ( 0.00%) 30423.10 * 5.79%*
Hmean 16384 33071.06 ( 0.00%) 34262.39 * 3.60%*

The suspicious improvement (and regression in tbench4-128) needs
further digging..

tbench4 Throughput
vanilla filter
Hmean 1 293.71 ( 0.00%) 298.89 * 1.76%*
Hmean 2 583.25 ( 0.00%) 596.00 * 2.19%*
Hmean 4 1162.40 ( 0.00%) 1176.73 * 1.23%*
Hmean 8 2309.28 ( 0.00%) 2332.89 * 1.02%*
Hmean 16 4517.23 ( 0.00%) 4587.60 * 1.56%*
Hmean 32 7458.54 ( 0.00%) 7550.19 * 1.23%*
Hmean 64 9041.62 ( 0.00%) 9192.69 * 1.67%*
Hmean 128 19983.62 ( 0.00%) 12228.91 * -38.81%*
Hmean 256 20054.12 ( 0.00%) 20997.33 * 4.70%*
Hmean 384 19137.11 ( 0.00%) 20331.14 * 6.24%*

v3:
- removed sched-idle balance feature and focus on SIS
- take non-CFS tasks into consideration
- several fixes/improvement suggested by Josh Don

v2:
- several optimizations on sched-idle balancing
- ignore asym topos in can_migrate_task
- add more benchmarks including SIS efficiency
- re-organize patch as suggested by Mel

v1: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 12 ++++++++++
kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 4 ++++
kernel/sched/topology.c | 4 +++-
6 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..95c7ad1e05b5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,8 +81,20 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ /*
+ * Tracking of the overloaded cpus can be heavy, so start
+ * a new cacheline to avoid false sharing.
+ */
+ atomic_t nr_overloaded_cpus ____cacheline_aligned;
+ unsigned long overloaded_cpus[]; /* Must be last */
};

+static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->overloaded_cpus);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51efaabac3e4..a29801c8b363 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5320,6 +5320,42 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
#endif /* CONFIG_SCHED_DEBUG */

+#ifdef CONFIG_SMP
+static inline bool rq_overloaded(struct rq *rq)
+{
+ return rq->nr_running - rq->cfs.idle_h_nr_running > 1;
+}
+
+void update_overloaded_rq(struct rq *rq)
+{
+ struct sched_domain_shared *sds;
+ bool overloaded = rq_overloaded(rq);
+ int cpu = cpu_of(rq);
+
+ lockdep_assert_rq_held(rq);
+
+ if (rq->overloaded == overloaded)
+ return;
+
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (unlikely(!sds))
+ goto unlock;
+
+ if (overloaded) {
+ cpumask_set_cpu(cpu, sdo_mask(sds));
+ atomic_inc(&sds->nr_overloaded_cpus);
+ } else {
+ cpumask_clear_cpu(cpu, sdo_mask(sds));
+ atomic_dec(&sds->nr_overloaded_cpus);
+ }
+
+ rq->overloaded = overloaded;
+unlock:
+ rcu_read_unlock();
+}
+#endif
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -5346,6 +5382,7 @@ void scheduler_tick(void)
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
+ update_overloaded_rq(rq);

rq_unlock(rq, &rf);

@@ -9578,6 +9615,7 @@ void __init sched_init(void)
rq->wake_stamp = jiffies;
rq->wake_avg_idle = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->overloaded = 0;

INIT_LIST_HEAD(&rq->cfs_tasks);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..79b4ff24faee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ struct sched_domain_shared *sds = sd->shared;
+ int nr, nro, weight = sd->span_weight;
+ int i, cpu, idle_cpu = -1;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
struct sched_domain *this_sd;
@@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;

+ nro = atomic_read(&sds->nr_overloaded_cpus);
+ if (nro == weight)
+ goto out;
+
+ nr = min_t(int, weight, p->nr_cpus_allowed);
+
+ /*
+ * It's unlikely to find an idle cpu if the system is under
+ * heavy pressure, so skip searching to save a few cycles
+ * and relieve cache traffic.
+ */
+ if (weight - nro < (nr >> 4) && !has_idle_core)
+ return -1;
+
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ if (nro > 1)
+ cpumask_andnot(cpus, cpus, sdo_mask(sds));

if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
@@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
avg_idle = this_rq->wake_avg_idle;
avg_cost = this_sd->avg_scan_cost + 1;

- span_avg = sd->span_weight * avg_idle;
+ span_avg = weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
@@ -6378,9 +6396,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}

- if (has_idle_core)
- set_idle_cores(target, false);
-
if (sched_feat(SIS_PROP) && !has_idle_core) {
time = cpu_clock(this) - time;

@@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool

update_avg(&this_sd->avg_scan_cost, time);
}
+out:
+ if (has_idle_core)
+ WRITE_ONCE(sds->has_idle_cores, 0);

return idle_cpu;
}
@@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
continue;

detach_task(p, env);
+ update_overloaded_rq(env->src_rq);

/*
* Right now, this is only the second place where
@@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
list_move(&p->se.group_node, tasks);
}

+ if (detached)
+ update_overloaded_rq(env->src_rq);
+
/*
* Right now, this is one of only two places we collect this stat
* so we can safely collect detach_one_task() stats here rather
@@ -8080,6 +8102,7 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
rq_lock(rq, &rf);
update_rq_clock(rq);
attach_task(rq, p);
+ update_overloaded_rq(rq);
rq_unlock(rq, &rf);
}

@@ -8090,20 +8113,22 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
static void attach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->tasks;
+ struct rq *rq = env->dst_rq;
struct task_struct *p;
struct rq_flags rf;

- rq_lock(env->dst_rq, &rf);
- update_rq_clock(env->dst_rq);
+ rq_lock(rq, &rf);
+ update_rq_clock(rq);

while (!list_empty(tasks)) {
p = list_first_entry(tasks, struct task_struct, se.group_node);
list_del_init(&p->se.group_node);

- attach_task(env->dst_rq, p);
+ attach_task(rq, p);
}

- rq_unlock(env->dst_rq, &rf);
+ update_overloaded_rq(rq);
+ rq_unlock(rq, &rf);
}

#ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ecb0d7052877..7b65c9046a75 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -433,6 +433,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
{
update_idle_core(rq);
+ update_overloaded_rq(rq);
schedstat_inc(rq->sched_goidle);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8dccb34eb190..d2b6e65cc336 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -997,6 +997,7 @@ struct rq {

unsigned char nohz_idle_balance;
unsigned char idle_balance;
+ unsigned char overloaded;

unsigned long misfit_task_load;

@@ -1830,8 +1831,11 @@ extern int sched_update_scaling(void);

extern void flush_smp_call_function_from_idle(void);

+extern void update_overloaded_rq(struct rq *rq);
+
#else /* !CONFIG_SMP: */
static inline void flush_smp_call_function_from_idle(void) { }
+static inline void update_overloaded_rq(struct rq *rq) { }
#endif

#include "stats.h"
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 810750e62118..6d5291875275 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1620,6 +1620,8 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ atomic_set(&sd->shared->nr_overloaded_cpus, 0);
+ cpumask_clear(sdo_mask(sd->shared));
}

sd->private = sdd;
@@ -2085,7 +2087,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)

*per_cpu_ptr(sdd->sd, j) = sd;

- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.31.1



2022-05-10 03:26:24

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

Hi Abel,

Overall this looks good, just a couple of comments.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..79b4ff24faee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + struct sched_domain_shared *sds = sd->shared;
> + int nr, nro, weight = sd->span_weight;
> + int i, cpu, idle_cpu = -1;
> struct rq *this_rq = this_rq();
> int this = smp_processor_id();
> struct sched_domain *this_sd;
> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> if (!this_sd)
> return -1;
>
> + nro = atomic_read(&sds->nr_overloaded_cpus);
> + if (nro == weight)
> + goto out;

This assumes that the sd we're operating on here is the LLC domain
(true for current use). Perhaps to catch future bugs from changing
this assumption, we could WARN_ON_ONCE(nro > weight).

> +
> + nr = min_t(int, weight, p->nr_cpus_allowed);
> +
> + /*
> + * It's unlikely to find an idle cpu if the system is under
> + * heavy pressure, so skip searching to save a few cycles
> + * and relieve cache traffic.
> + */
> + if (weight - nro < (nr >> 4) && !has_idle_core)
> + return -1;

nit: nr / 16 is easier to read and the compiler will do the shifting for you.

Was < intentional vs <= ? With <= you'll be able to skip the search in
the case where both sides evaluate to 0 (can happen frequently if we
have no idle cpus, and a task with a small affinity mask).

This will also get a bit confused in the case where the task has many
cpus allowed, but almost all of them on a different LLC than the one
we're considering here. Apart from caching the per-LLC
nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
only do this in the !has_idle_core case to reduce calls to
cpumask_weight()).

> +
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + if (nro > 1)
> + cpumask_andnot(cpus, cpus, sdo_mask(sds));

Just
if (nro)
?

> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>
> update_avg(&this_sd->avg_scan_cost, time);
> }
> +out:
> + if (has_idle_core)
> + WRITE_ONCE(sds->has_idle_cores, 0);

nit: use set_idle_cores() instead (or, if you really want to avoid the
extra sds dereference, add a __set_idle_cores(sds, val) helper you can
call directly.

> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
> continue;
>
> detach_task(p, env);
> + update_overloaded_rq(env->src_rq);
>
> /*
> * Right now, this is only the second place where
> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
> list_move(&p->se.group_node, tasks);
> }
>
> + if (detached)
> + update_overloaded_rq(env->src_rq);
> +

Thinking about this more, I don't see an issue with moving the
update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
in the attach/detach_task paths. Overloaded state only changes when we
pass the boundary of 2 runnable non-idle tasks, so thashing of the
overloaded mask is a lot less worrisome than if it were updated on the
boundary of 1 runnable task. The attach/detach_task paths run as part
of load balancing, which can be on a millisecond time scale.

Best,
Josh

2022-05-10 13:25:11

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

Hi Josh,

On 5/10/22 9:14 AM, Josh Don Wrote:
> Hi Abel,
>
> Overall this looks good, just a couple of comments.
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..79b4ff24faee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>> {
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> + struct sched_domain_shared *sds = sd->shared;
>> + int nr, nro, weight = sd->span_weight;
>> + int i, cpu, idle_cpu = -1;
>> struct rq *this_rq = this_rq();
>> int this = smp_processor_id();
>> struct sched_domain *this_sd;
>> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> if (!this_sd)
>> return -1;
>>
>> + nro = atomic_read(&sds->nr_overloaded_cpus);
>> + if (nro == weight)
>> + goto out;
>
> This assumes that the sd we're operating on here is the LLC domain
> (true for current use). Perhaps to catch future bugs from changing
> this assumption, we could WARN_ON_ONCE(nro > weight).

The @sds comes from sd->shared, so I don't think the condition will
break once we operate at other level domains. But a quick check on
sds != NULL may be needed then since domains can have no sds attached.

>
>> +
>> + nr = min_t(int, weight, p->nr_cpus_allowed);
>> +
>> + /*
>> + * It's unlikely to find an idle cpu if the system is under
>> + * heavy pressure, so skip searching to save a few cycles
>> + * and relieve cache traffic.
>> + */
>> + if (weight - nro < (nr >> 4) && !has_idle_core)
>> + return -1;
>
> nit: nr / 16 is easier to read and the compiler will do the shifting for you.

Agreed.

>
> Was < intentional vs <= ? With <= you'll be able to skip the search in
> the case where both sides evaluate to 0 (can happen frequently if we
> have no idle cpus, and a task with a small affinity mask).

It's intentional, the idea is to unconditionally pass when there are
less than 16 cpus to search which seems scalability is not an issue.
But I made a mistake that (weight - nro) couldn't be 0 here, so it's
not appropriate to use "<".

BTW, I think Chen Yu's proposal[1] on search depth limitation is a
better idea and more reasonable. And he is doing some benchmark on
the mixture of our work.

[1]
https://lore.kernel.org/lkml/[email protected]/

>
> This will also get a bit confused in the case where the task has many
> cpus allowed, but almost all of them on a different LLC than the one
> we're considering here. Apart from caching the per-LLC
> nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
> only do this in the !has_idle_core case to reduce calls to
> cpumask_weight()).

Yes the task might have many cpus allowed on another LLC, the idea is
to use @nr as a worst case boundary. And with Chen's work, I think we
can get rid of nr_cpus_allowed.

>
>> +
>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> + if (nro > 1)
>> + cpumask_andnot(cpus, cpus, sdo_mask(sds));
>
> Just
> if (nro)
> ?

I think it's just not worthy to touch sdo_mask(sds) which causes heavy
cache traffic, if it only contains one cpu.

>
>> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>
>> update_avg(&this_sd->avg_scan_cost, time);
>> }
>> +out:
>> + if (has_idle_core)
>> + WRITE_ONCE(sds->has_idle_cores, 0);
>
> nit: use set_idle_cores() instead (or, if you really want to avoid the
> extra sds dereference, add a __set_idle_cores(sds, val) helper you can
> call directly.

OK, will do.

>
>> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>> continue;
>>
>> detach_task(p, env);
>> + update_overloaded_rq(env->src_rq);
>>
>> /*
>> * Right now, this is only the second place where
>> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
>> list_move(&p->se.group_node, tasks);
>> }
>>
>> + if (detached)
>> + update_overloaded_rq(env->src_rq);
>> +
>
> Thinking about this more, I don't see an issue with moving the
> update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
> in the attach/detach_task paths. Overloaded state only changes when we
> pass the boundary of 2 runnable non-idle tasks, so thashing of the
> overloaded mask is a lot less worrisome than if it were updated on the
> boundary of 1 runnable task. The attach/detach_task paths run as part
> of load balancing, which can be on a millisecond time scale.

It's really hard to say which one is better, and I think it's more like
workload-specific. It's common in our cloud servers that a long running
workload co-exists with a short running workload which could flip the
status frequently.

Thanks & BR,
Abel

2022-05-20 11:53:18

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

On Thu, 2022-05-05 at 20:23 +0800, Abel Wu wrote:
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 12 ++++++++++
> kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
> kernel/sched/idle.c | 1 +
> kernel/sched/sched.h | 4 ++++
> kernel/sched/topology.c | 4 +++-
> 6 files changed, 92 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..95c7ad1e05b5 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> +
> + /*
> + * Tracking of the overloaded cpus can be heavy, so start
> + * a new cacheline to avoid false sharing.
> + */
> + atomic_t nr_overloaded_cpus ____cacheline_aligned;

Abel,

This is nice work. I have one comment.

The update and reading of nr_overloaded_cpus will incur cache bouncing cost.
As far as I can tell, this counter is used to determine if we should bail out
from the search for an idle CPUs if the system is heavily loaded. And I
hope we can avoid using atomic counter in these heavily used scheduler paths.
The logic to filter overloaded CPUs only need overloaded_cpus[]
mask and not the nr_overloaded_cpus counter.

So I recommend that you break out the logic of using nr_overloaded_cpus
atomic counter to detect LLC has heavy load as a second patch so that
it can be evaluated on its own merit.

That functionality overlaps with Chen Yu's patch to limit search depending
on load, so it will be easier to compare the two approaches if it is separated.

Otherwise, the logic in the patch to use overloaded_cpus[]
mask to filter out the overloaded cpus looks fine and complements
Chen Yu's patch.

Thanks.

Tim

> + unsigned long overloaded_cpus[]; /* Must be last */ };
>
>


2022-05-21 18:54:32

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

Hello Abel,

We tested this patch on our systems.

tl;dr

- We observed some regressions in schbench with 128 workers in
NPS4 mode.
- tbench shows regression for 32 workers in NPS2 mode and 64 workers
in NPS2 and NPS4 mode.
- Great improvements in schbench for low worker count.
- Overall, the performance seems to be comparable to the tip.

Below are the detailed numbers for each benchmark.

On 5/5/2022 5:53 PM, Abel Wu wrote:
> Try to improve searching efficiency of SIS by filtering out the
> overloaded cpus, and as a result the more overloaded the system
> is, the less cpus will be searched.
>
> The overloaded cpus are tracked through LLC shared domain. To
> regulate accesses to the shared data, the update happens mainly
> at the tick. But in order to make it more accurate, we also take
> the task migrations into consideration during load balancing which
> can be quite frequent due to short running workload causing newly-
> idle. Since an overloaded runqueue requires at least 2 non-idle
> tasks runnable, we could have more faith on the "frequent newly-
> idle" case.
>
> Benchmark
> =========
>
> Tests are done in an Intel(R) Xeon(R) Platinum 8260 [email protected]
> machine with 2 NUMA nodes each of which has 24 cores with SMT2
> enabled, so 96 CPUs in total.
>
> All of the benchmarks are done inside a normal cpu cgroup in a
> clean environment with cpu turbo disabled.
>
> Based on tip sched/core 089c02ae2771 (v5.18-rc1).
>
> Results
> =======
>
> hackbench-process-pipes
> vanilla filter
> Amean 1 0.2537 ( 0.00%) 0.2330 ( 8.15%)
> Amean 4 0.7090 ( 0.00%) 0.7440 * -4.94%*
> Amean 7 0.9153 ( 0.00%) 0.9040 ( 1.24%)
> Amean 12 1.1473 ( 0.00%) 1.0857 * 5.37%*
> Amean 21 2.7210 ( 0.00%) 2.2320 * 17.97%*
> Amean 30 4.8263 ( 0.00%) 3.6170 * 25.06%*
> Amean 48 7.4107 ( 0.00%) 6.1130 * 17.51%*
> Amean 79 9.2120 ( 0.00%) 8.2350 * 10.61%*
> Amean 110 10.1647 ( 0.00%) 8.8043 * 13.38%*
> Amean 141 11.5713 ( 0.00%) 10.5867 * 8.51%*
> Amean 172 13.7963 ( 0.00%) 12.8120 * 7.13%*
> Amean 203 15.9283 ( 0.00%) 14.8703 * 6.64%*
> Amean 234 17.8737 ( 0.00%) 17.1053 * 4.30%*
> Amean 265 19.8443 ( 0.00%) 18.7870 * 5.33%*
> Amean 296 22.4147 ( 0.00%) 21.3943 * 4.55%*
>
> There is a regression in 4-groups test in which case lots of busy
> cpus can be found in the system. The busy cpus are not recorded in
> the overloaded cpu mask, so we trade overhead for nothing in SIS.
> This is the worst case of this feature.
>
> hackbench-process-sockets
> vanilla filter
> Amean 1 0.5343 ( 0.00%) 0.5270 ( 1.37%)
> Amean 4 1.4500 ( 0.00%) 1.4273 * 1.56%*
> Amean 7 2.4813 ( 0.00%) 2.4383 * 1.73%*
> Amean 12 4.1357 ( 0.00%) 4.0827 * 1.28%*
> Amean 21 6.9707 ( 0.00%) 6.9290 ( 0.60%)
> Amean 30 9.8373 ( 0.00%) 9.6730 * 1.67%*
> Amean 48 15.6233 ( 0.00%) 15.3213 * 1.93%*
> Amean 79 26.2763 ( 0.00%) 25.3293 * 3.60%*
> Amean 110 36.6170 ( 0.00%) 35.8920 * 1.98%*
> Amean 141 47.0720 ( 0.00%) 45.8773 * 2.54%*
> Amean 172 57.0580 ( 0.00%) 56.1627 * 1.57%*
> Amean 203 67.2040 ( 0.00%) 66.4323 * 1.15%*
> Amean 234 77.8897 ( 0.00%) 76.6320 * 1.61%*
> Amean 265 88.0437 ( 0.00%) 87.1400 ( 1.03%)
> Amean 296 98.2387 ( 0.00%) 96.8633 * 1.40%*
>
> hackbench-thread-pipes
> vanilla filter
> Amean 1 0.2693 ( 0.00%) 0.2800 * -3.96%*
> Amean 4 0.7843 ( 0.00%) 0.7680 ( 2.08%)
> Amean 7 0.9287 ( 0.00%) 0.9217 ( 0.75%)
> Amean 12 1.4443 ( 0.00%) 1.3680 * 5.29%*
> Amean 21 3.5150 ( 0.00%) 3.1107 * 11.50%*
> Amean 30 6.3997 ( 0.00%) 5.2160 * 18.50%*
> Amean 48 8.4183 ( 0.00%) 7.8477 * 6.78%*
> Amean 79 10.0713 ( 0.00%) 9.2240 * 8.41%*
> Amean 110 10.9940 ( 0.00%) 10.1280 * 7.88%*
> Amean 141 13.6347 ( 0.00%) 11.9387 * 12.44%*
> Amean 172 15.0523 ( 0.00%) 14.4117 ( 4.26%)
> Amean 203 18.0710 ( 0.00%) 17.3533 ( 3.97%)
> Amean 234 19.7413 ( 0.00%) 19.8453 ( -0.53%)
> Amean 265 23.1820 ( 0.00%) 22.8223 ( 1.55%)
> Amean 296 25.3820 ( 0.00%) 24.2397 ( 4.50%)
>
> hackbench-thread-sockets
> vanilla filter
> Amean 1 0.5893 ( 0.00%) 0.5750 * 2.43%*
> Amean 4 1.4853 ( 0.00%) 1.4727 ( 0.85%)
> Amean 7 2.5353 ( 0.00%) 2.5047 * 1.21%*
> Amean 12 4.3003 ( 0.00%) 4.1910 * 2.54%*
> Amean 21 7.1930 ( 0.00%) 7.1533 ( 0.55%)
> Amean 30 10.0983 ( 0.00%) 9.9690 * 1.28%*
> Amean 48 15.9853 ( 0.00%) 15.6963 * 1.81%*
> Amean 79 26.7537 ( 0.00%) 25.9497 * 3.01%*
> Amean 110 37.3850 ( 0.00%) 36.6793 * 1.89%*
> Amean 141 47.7730 ( 0.00%) 47.0967 * 1.42%*
> Amean 172 58.4280 ( 0.00%) 57.5513 * 1.50%*
> Amean 203 69.3093 ( 0.00%) 67.7680 * 2.22%*
> Amean 234 80.0190 ( 0.00%) 78.2633 * 2.19%*
> Amean 265 90.7237 ( 0.00%) 89.1027 * 1.79%*
> Amean 296 101.1153 ( 0.00%) 99.2693 * 1.83%*
>
> schbench
> vanilla filter
> Lat 50.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
> Lat 75.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
> Lat 90.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
> Lat 95.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
> Lat 99.0th-qrtle-1 6.00 ( 0.00%) 6.00 ( 0.00%)
> Lat 99.5th-qrtle-1 7.00 ( 0.00%) 6.00 ( 14.29%)
> Lat 99.9th-qrtle-1 7.00 ( 0.00%) 7.00 ( 0.00%)
> Lat 50.0th-qrtle-2 6.00 ( 0.00%) 6.00 ( 0.00%)
> Lat 75.0th-qrtle-2 7.00 ( 0.00%) 7.00 ( 0.00%)
> Lat 90.0th-qrtle-2 7.00 ( 0.00%) 7.00 ( 0.00%)
> Lat 95.0th-qrtle-2 7.00 ( 0.00%) 7.00 ( 0.00%)
> Lat 99.0th-qrtle-2 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 99.5th-qrtle-2 9.00 ( 0.00%) 9.00 ( 0.00%)
> Lat 99.9th-qrtle-2 12.00 ( 0.00%) 11.00 ( 8.33%)
> Lat 50.0th-qrtle-4 8.00 ( 0.00%) 8.00 ( 0.00%)
> Lat 75.0th-qrtle-4 10.00 ( 0.00%) 10.00 ( 0.00%)
> Lat 90.0th-qrtle-4 10.00 ( 0.00%) 11.00 ( -10.00%)
> Lat 95.0th-qrtle-4 11.00 ( 0.00%) 11.00 ( 0.00%)
> Lat 99.0th-qrtle-4 12.00 ( 0.00%) 13.00 ( -8.33%)
> Lat 99.5th-qrtle-4 16.00 ( 0.00%) 14.00 ( 12.50%)
> Lat 99.9th-qrtle-4 17.00 ( 0.00%) 15.00 ( 11.76%)
> Lat 50.0th-qrtle-8 13.00 ( 0.00%) 13.00 ( 0.00%)
> Lat 75.0th-qrtle-8 16.00 ( 0.00%) 16.00 ( 0.00%)
> Lat 90.0th-qrtle-8 18.00 ( 0.00%) 18.00 ( 0.00%)
> Lat 95.0th-qrtle-8 19.00 ( 0.00%) 18.00 ( 5.26%)
> Lat 99.0th-qrtle-8 24.00 ( 0.00%) 21.00 ( 12.50%)
> Lat 99.5th-qrtle-8 28.00 ( 0.00%) 26.00 ( 7.14%)
> Lat 99.9th-qrtle-8 33.00 ( 0.00%) 32.00 ( 3.03%)
> Lat 50.0th-qrtle-16 20.00 ( 0.00%) 20.00 ( 0.00%)
> Lat 75.0th-qrtle-16 28.00 ( 0.00%) 28.00 ( 0.00%)
> Lat 90.0th-qrtle-16 32.00 ( 0.00%) 32.00 ( 0.00%)
> Lat 95.0th-qrtle-16 34.00 ( 0.00%) 34.00 ( 0.00%)
> Lat 99.0th-qrtle-16 40.00 ( 0.00%) 40.00 ( 0.00%)
> Lat 99.5th-qrtle-16 44.00 ( 0.00%) 44.00 ( 0.00%)
> Lat 99.9th-qrtle-16 53.00 ( 0.00%) 67.00 ( -26.42%)
> Lat 50.0th-qrtle-32 39.00 ( 0.00%) 36.00 ( 7.69%)
> Lat 75.0th-qrtle-32 57.00 ( 0.00%) 52.00 ( 8.77%)
> Lat 90.0th-qrtle-32 69.00 ( 0.00%) 61.00 ( 11.59%)
> Lat 95.0th-qrtle-32 76.00 ( 0.00%) 64.00 ( 15.79%)
> Lat 99.0th-qrtle-32 88.00 ( 0.00%) 74.00 ( 15.91%)
> Lat 99.5th-qrtle-32 91.00 ( 0.00%) 80.00 ( 12.09%)
> Lat 99.9th-qrtle-32 115.00 ( 0.00%) 107.00 ( 6.96%)
> Lat 50.0th-qrtle-47 63.00 ( 0.00%) 55.00 ( 12.70%)
> Lat 75.0th-qrtle-47 93.00 ( 0.00%) 80.00 ( 13.98%)
> Lat 90.0th-qrtle-47 116.00 ( 0.00%) 97.00 ( 16.38%)
> Lat 95.0th-qrtle-47 129.00 ( 0.00%) 106.00 ( 17.83%)
> Lat 99.0th-qrtle-47 148.00 ( 0.00%) 123.00 ( 16.89%)
> Lat 99.5th-qrtle-47 157.00 ( 0.00%) 132.00 ( 15.92%)
> Lat 99.9th-qrtle-47 387.00 ( 0.00%) 164.00 ( 57.62%)
>
> netperf-udp
> vanilla filter
> Hmean send-64 183.09 ( 0.00%) 182.28 ( -0.44%)
> Hmean send-128 364.68 ( 0.00%) 363.12 ( -0.43%)
> Hmean send-256 715.38 ( 0.00%) 716.57 ( 0.17%)
> Hmean send-1024 2764.76 ( 0.00%) 2779.17 ( 0.52%)
> Hmean send-2048 5282.93 ( 0.00%) 5220.41 * -1.18%*
> Hmean send-3312 8282.26 ( 0.00%) 8121.78 * -1.94%*
> Hmean send-4096 10108.12 ( 0.00%) 10042.98 ( -0.64%)
> Hmean send-8192 16868.49 ( 0.00%) 16826.99 ( -0.25%)
> Hmean send-16384 26230.44 ( 0.00%) 26271.85 ( 0.16%)
> Hmean recv-64 183.09 ( 0.00%) 182.28 ( -0.44%)
> Hmean recv-128 364.68 ( 0.00%) 363.12 ( -0.43%)
> Hmean recv-256 715.38 ( 0.00%) 716.57 ( 0.17%)
> Hmean recv-1024 2764.76 ( 0.00%) 2779.17 ( 0.52%)
> Hmean recv-2048 5282.93 ( 0.00%) 5220.39 * -1.18%*
> Hmean recv-3312 8282.26 ( 0.00%) 8121.78 * -1.94%*
> Hmean recv-4096 10108.12 ( 0.00%) 10042.97 ( -0.64%)
> Hmean recv-8192 16868.47 ( 0.00%) 16826.93 ( -0.25%)
> Hmean recv-16384 26230.44 ( 0.00%) 26271.75 ( 0.16%)
>
> The overhead this feature added to the scheduler can be unfriendly
> to the fast context-switching workloads like netperf/tbench. But
> the test result seems fine.
>
> netperf-tcp
> vanilla filter
> Hmean 64 863.35 ( 0.00%) 1176.11 * 36.23%*
> Hmean 128 1674.32 ( 0.00%) 2223.37 * 32.79%*
> Hmean 256 3151.03 ( 0.00%) 4109.64 * 30.42%*
> Hmean 1024 10281.94 ( 0.00%) 12799.28 * 24.48%*
> Hmean 2048 16906.05 ( 0.00%) 20129.91 * 19.07%*
> Hmean 3312 21246.21 ( 0.00%) 24747.24 * 16.48%*
> Hmean 4096 23690.57 ( 0.00%) 26596.35 * 12.27%*
> Hmean 8192 28758.29 ( 0.00%) 30423.10 * 5.79%*
> Hmean 16384 33071.06 ( 0.00%) 34262.39 * 3.60%*
>
> The suspicious improvement (and regression in tbench4-128) needs
> further digging..
>
> tbench4 Throughput
> vanilla filter
> Hmean 1 293.71 ( 0.00%) 298.89 * 1.76%*
> Hmean 2 583.25 ( 0.00%) 596.00 * 2.19%*
> Hmean 4 1162.40 ( 0.00%) 1176.73 * 1.23%*
> Hmean 8 2309.28 ( 0.00%) 2332.89 * 1.02%*
> Hmean 16 4517.23 ( 0.00%) 4587.60 * 1.56%*
> Hmean 32 7458.54 ( 0.00%) 7550.19 * 1.23%*
> Hmean 64 9041.62 ( 0.00%) 9192.69 * 1.67%*
> Hmean 128 19983.62 ( 0.00%) 12228.91 * -38.81%*
> Hmean 256 20054.12 ( 0.00%) 20997.33 * 4.70%*
> Hmean 384 19137.11 ( 0.00%) 20331.14 * 6.24%*

Following are the results from testing on a dual socket Zen3 system
(2 x 64C/128T) in different NPS modes.

Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Kernel versions:
- tip: 5.18-rc1 tip sched/core
- Filter Overloaded: 5.18-rc1 tip sched/core + this patch

When we began testing, we recorded the tip at:

commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"

Following are the results from the benchmark:

Note: Results marked with * are data points of concern. A rerun
for the data point has been provided on both the tip and the
patched kernel to check for any run to run variation.

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test: tip Filter Overloaded
1-groups: 4.64 (0.00 pct) 4.74 (-2.15 pct)
2-groups: 5.38 (0.00 pct) 5.55 (-3.15 pct)
4-groups: 6.15 (0.00 pct) 6.20 (-0.81 pct)
8-groups: 7.42 (0.00 pct) 7.47 (-0.67 pct)
16-groups: 10.70 (0.00 pct) 10.60 (0.93 pct)

NPS2

Test: tip Filter Overloaded
1-groups: 4.70 (0.00 pct) 4.68 (0.42 pct)
2-groups: 5.45 (0.00 pct) 5.46 (-0.18 pct)
4-groups: 6.13 (0.00 pct) 6.11 (0.32 pct)
8-groups: 7.30 (0.00 pct) 7.23 (0.95 pct)
16-groups: 10.30 (0.00 pct) 10.38 (-0.77 pct)

NPS4

Test: tip Filter Overloaded
1-groups: 4.60 (0.00 pct) 4.66 (-1.30 pct)
2-groups: 5.41 (0.00 pct) 5.53 (-2.21 pct)
4-groups: 6.12 (0.00 pct) 6.16 (-0.65 pct)
8-groups: 7.22 (0.00 pct) 7.28 (-0.83 pct)
16-groups: 10.24 (0.00 pct) 10.26 (-0.19 pct)

~~~~~~~~
schbench
~~~~~~~~

NPS1

#workers: tip Filter Overloaded
1: 29.00 (0.00 pct) 29.00 (0.00 pct)
2: 28.00 (0.00 pct) 27.00 (3.57 pct)
4: 31.50 (0.00 pct) 33.00 (-4.76 pct)
8: 42.00 (0.00 pct) 42.50 (-1.19 pct)
16: 56.50 (0.00 pct) 56.00 (0.88 pct)
32: 94.50 (0.00 pct) 96.50 (-2.11 pct)
64: 176.00 (0.00 pct) 178.50 (-1.42 pct)
128: 404.00 (0.00 pct) 418.00 (-3.46 pct)
256: 869.00 (0.00 pct) 900.00 (-3.56 pct)
512: 58432.00 (0.00 pct) 56256.00 (3.72 pct)

NPS2

#workers: tip Filter Overloaded
1: 26.50 (0.00 pct) 14.00 (47.16 pct)
2: 26.50 (0.00 pct) 14.50 (45.28 pct)
4: 34.50 (0.00 pct) 18.00 (47.82 pct)
8: 45.00 (0.00 pct) 30.50 (32.22 pct)
16: 56.50 (0.00 pct) 57.00 (-0.88 pct)
32: 95.50 (0.00 pct) 94.00 (1.57 pct)
64: 179.00 (0.00 pct) 176.00 (1.67 pct)
128: 369.00 (0.00 pct) 411.00 (-11.38 pct) *
128: 400.60 (0.00 pct) 412.90 (-3.07 pct) [Verification Run]
256: 898.00 (0.00 pct) 850.00 (5.34 pct)
512: 56256.00 (0.00 pct) 59456.00 (-5.68 pct)

NPS4

#workers: tip Filter Overloaded
1: 25.00 (0.00 pct) 24.50 (2.00 pct)
2: 28.00 (0.00 pct) 24.00 (14.28 pct)
4: 29.50 (0.00 pct) 28.50 (3.38 pct)
8: 41.00 (0.00 pct) 36.50 (10.97 pct)
16: 65.50 (0.00 pct) 59.00 (9.92 pct)
32: 93.00 (0.00 pct) 95.50 (-2.68 pct)
64: 170.50 (0.00 pct) 182.00 (-6.74 pct) *
64: 175.00 (0.00 pct) 176.00 (-0.57 pct) [Verification Run]
128: 377.00 (0.00 pct) 409.50 (-8.62 pct) *
128: 372.00 (0.00 pct) 401.00 (-7.79 pct) [Verification Run]
256: 867.00 (0.00 pct) 940.00 (-8.41 pct) *
256: 925.00 (0.00 pct) 892.00 (+3.45 pct) [Verification Run]
512: 58048.00 (0.00 pct) 59456.00 (-2.42 pct)

~~~~~~
tbench
~~~~~~

NPS1

Clients: tip Filter Overloaded
1 443.31 (0.00 pct) 466.32 (5.19 pct)
2 877.32 (0.00 pct) 891.87 (1.65 pct)
4 1665.11 (0.00 pct) 1727.98 (3.77 pct)
8 3016.68 (0.00 pct) 3125.12 (3.59 pct)
16 5374.30 (0.00 pct) 5414.02 (0.73 pct)
32 8763.86 (0.00 pct) 8599.72 (-1.87 pct)
64 15786.93 (0.00 pct) 14095.47 (-10.71 pct) *
64 15441.33 (0.00 pct) 15148.00 (-1.89 pct) [Verification Run]
128 26826.08 (0.00 pct) 27837.07 (3.76 pct)
256 24207.35 (0.00 pct) 23769.48 (-1.80 pct)
512 51740.58 (0.00 pct) 53369.28 (3.14 pct)
1024 51177.82 (0.00 pct) 51928.06 (1.46 pct)

NPS2

Clients: tip Filter Overloaded
1 449.49 (0.00 pct) 464.65 (3.37 pct)
2 867.28 (0.00 pct) 898.85 (3.64 pct)
4 1643.60 (0.00 pct) 1691.53 (2.91 pct)
8 3047.35 (0.00 pct) 3010.65 (-1.20 pct)
16 5340.77 (0.00 pct) 5242.42 (-1.84 pct)
32 10536.85 (0.00 pct) 8978.74 (-14.78 pct) *
32 10417.46 /90.00 pct) 8375.55 (-19.60 pct) [Verification Run]
64 16543.23 (0.00 pct) 15357.35 (-7.16 pct) *
64 17101.56 (0.00 pct) 15465.73 (-9.56 pct) [Verification Run]
128 26400.40 (0.00 pct) 26637.87 (0.89 pct)
256 23436.75 (0.00 pct) 24324.08 (3.78 pct)
512 50902.60 (0.00 pct) 49159.14 (-3.42 pct)
1024 50216.10 (0.00 pct) 50218.74 (0.00 pct)

NPS4

Clients: tip Filter Overloaded
1 443.82 (0.00 pct) 458.65 (3.34 pct)
2 849.14 (0.00 pct) 883.79 (4.08 pct)
4 1603.26 (0.00 pct) 1658.89 (3.46 pct)
8 2972.37 (0.00 pct) 3087.76 (3.88 pct)
16 5277.13 (0.00 pct) 5472.11 (3.69 pct)
32 9744.73 (0.00 pct) 9666.67 (-0.80 pct)
64 15854.80 (0.00 pct) 13778.51 (-13.09 pct) *
64 15732.56 (0.00 pct) 14397.83 (-8.48 pct) [Verification Run]
128 26116.97 (0.00 pct) 25966.86 (-0.57 pct)
256 22403.25 (0.00 pct) 22634.04 (1.03 pct)
512 48317.20 (0.00 pct) 47123.73 (-2.47 pct)
1024 50445.41 (0.00 pct) 48934.56 (-2.99 pct)

Note: tbench resuts for 256 workers are known to have
a great amount of run to run variation on the test
machine. Any regression seen for the data point can
be safely ignored.

~~~~~~
Stream
~~~~~~

- 10 runs

NPS1

Test: tip Filter Overloaded
Copy: 189113.11 (0.00 pct) 184006.43 (-2.70 pct)
Scale: 201190.61 (0.00 pct) 197663.80 (-1.75 pct)
Add: 232654.21 (0.00 pct) 223226.88 (-4.05 pct)
Triad: 226583.57 (0.00 pct) 218920.27 (-3.38 pct)

NPS2

Test: tip Filter Overloaded
Copy: 155347.14 (0.00 pct) 150710.93 (-2.98 pct)
Scale: 191701.53 (0.00 pct) 181436.61 (-5.35 pct)
Add: 210013.97 (0.00 pct) 200397.89 (-4.57 pct)
Triad: 207602.00 (0.00 pct) 198247.25 (-4.50 pct)

NPS4

Test: tip Filter Overloaded
Copy: 136421.15 (0.00 pct) 137608.05 (0.87 pct)
Scale: 191217.59 (0.00 pct) 189674.77 (-0.80 pct)
Add: 189229.52 (0.00 pct) 188682.22 (-0.28 pct)
Triad: 188052.99 (0.00 pct) 188946.75 (0.47 pct)

- 100 runs

NPS1

Test: tip Filter Overloaded
Copy: 244693.32 (0.00 pct) 235089.48 (-3.92 pct)
Scale: 221874.99 (0.00 pct) 217716.94 (-1.87 pct)
Add: 268363.89 (0.00 pct) 266529.22 (-0.68 pct)
Triad: 260945.24 (0.00 pct) 252780.93 (-3.12 pct)

NPS2

Test: tip Filter Overloaded
Copy: 211262.00 (0.00 pct) 240922.15 (14.03 pct)
Scale: 222493.34 (0.00 pct) 220122.09 (-1.06 pct)
Add: 280277.17 (0.00 pct) 278002.19 (-0.81 pct)
Triad: 265860.49 (0.00 pct) 264231.43 (-0.61 pct)

NPS4

Test: tip Filter Overloaded
Copy: 250171.40 (0.00 pct) 243512.01 (-2.66 pct)
Scale: 222293.56 (0.00 pct) 224911.55 (1.17 pct)
Add: 279222.16 (0.00 pct) 280700.91 (0.52 pct)
Triad: 262013.92 (0.00 pct) 265729.44 (1.41 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip: 303718.33 (var: 1.31)
NUMA Bal: 309396.00 (var: 1.24) (+1.83%)

NPS2

sched-tip: 304536.33 (var: 2.46)
NUMA Bal: 305209.00 (var: 0.80) (+0.22%)

NPS4

sched-tip: 301192.33 (var: 1.81)
NUMA Bal: 304248.00 (var: 2.05) (+1.00%)

~~~~~
Notes
~~~~~

- schbench shows regression for 128 workers in NPS4 mode.
The rerun shows stable results for both tip and patched
kernel.
- tbench shows regression for 64 workers in NPS2 and NPS4
mode. In NPS2 mode, the tip shows some run to run variance
however the median of 10 runs reported is lower for the
patched kernel.
- tbench shows regression for 32 workers in NPS2 mode. The
tip seems to report stable values here but there is a
slight run to run variation observed in the patched kernel.

- Overall, the performance is comparable to that of the tip.
- schbench low worker count improvements has the load balancer
coming into the picture. We still have to do deeper analysis
to see if and how this patch is helping.

I haven't run the mmtest as a part of this testing. I've made
a note of the configs you ran the numbers for. I'll try to
get numbers for same.

>
> v3:
> - removed sched-idle balance feature and focus on SIS
> - take non-CFS tasks into consideration
> - several fixes/improvement suggested by Josh Don
>
> v2:
> - several optimizations on sched-idle balancing
> - ignore asym topos in can_migrate_task
> - add more benchmarks including SIS efficiency
> - re-organize patch as suggested by Mel
>
> v1: https://lore.kernel.org/lkml/[email protected]/
> v2: https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 12 ++++++++++
> kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
> kernel/sched/idle.c | 1 +
> kernel/sched/sched.h | 4 ++++
> kernel/sched/topology.c | 4 +++-
> 6 files changed, 92 insertions(+), 10 deletions(-)
>
> [..snip..]

Let me know if there is some additional data you would like
me to gather on the test machine.
--
Thanks and Regards,
Prateek

2022-05-23 06:19:39

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

Hi Prateek, thanks very much for your test!

On 5/20/22 2:48 PM, K Prateek Nayak Wrote:
> Hello Abel,
>
> We tested this patch on our systems.
>
> tl;dr
>
> - We observed some regressions in schbench with 128 workers in
> NPS4 mode.
> - tbench shows regression for 32 workers in NPS2 mode and 64 workers
> in NPS2 and NPS4 mode.
> - Great improvements in schbench for low worker count.
> - Overall, the performance seems to be comparable to the tip.
>
> Following are the results from testing on a dual socket Zen3 system
> (2 x 64C/128T) in different NPS modes.
>
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255

I remember you replied in Chen's patch that the number of CPUs in
LLC domain is always 16 irrespective of the NPS mode. This reminds
me of something that Josh previously concerned about. The piece of
code below in SIS may hurt the performance badly:

if (weight - nro < (nr >> 4) && !has_idle_core)
return -1;

where nr is no more than LLC weight which is 16 in your test. So
this condition never stands, and a following cpumask_andnot on the
LLC shared sds->overloaded_cpus[] will most probably occur, causing
heavier cache traffic to negate a lot of the benefit it brought.

The early bailing out needs carefully re-designed, and luckily Chen
is working on that. Besides, I also have some major improvements on
this SIS filter under testing and I will post v4 in one or two weeks.

Thanks again for your test & best regards,
Abel

>
> Kernel versions:
> - tip: 5.18-rc1 tip sched/core
> - Filter Overloaded: 5.18-rc1 tip sched/core + this patch
>
> When we began testing, we recorded the tip at:
>
> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
>
> Following are the results from the benchmark:
>
> Note: Results marked with * are data points of concern. A rerun
> for the data point has been provided on both the tip and the
> patched kernel to check for any run to run variation.
>
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
>
> NPS1
>
> Test: tip Filter Overloaded
> 1-groups: 4.64 (0.00 pct) 4.74 (-2.15 pct)
> 2-groups: 5.38 (0.00 pct) 5.55 (-3.15 pct)
> 4-groups: 6.15 (0.00 pct) 6.20 (-0.81 pct)
> 8-groups: 7.42 (0.00 pct) 7.47 (-0.67 pct)
> 16-groups: 10.70 (0.00 pct) 10.60 (0.93 pct)
>
> NPS2
>
> Test: tip Filter Overloaded
> 1-groups: 4.70 (0.00 pct) 4.68 (0.42 pct)
> 2-groups: 5.45 (0.00 pct) 5.46 (-0.18 pct)
> 4-groups: 6.13 (0.00 pct) 6.11 (0.32 pct)
> 8-groups: 7.30 (0.00 pct) 7.23 (0.95 pct)
> 16-groups: 10.30 (0.00 pct) 10.38 (-0.77 pct)
>
> NPS4
>
> Test: tip Filter Overloaded
> 1-groups: 4.60 (0.00 pct) 4.66 (-1.30 pct)
> 2-groups: 5.41 (0.00 pct) 5.53 (-2.21 pct)
> 4-groups: 6.12 (0.00 pct) 6.16 (-0.65 pct)
> 8-groups: 7.22 (0.00 pct) 7.28 (-0.83 pct)
> 16-groups: 10.24 (0.00 pct) 10.26 (-0.19 pct)
>
> ~~~~~~~~
> schbench
> ~~~~~~~~
>
> NPS1
>
> #workers: tip Filter Overloaded
> 1: 29.00 (0.00 pct) 29.00 (0.00 pct)
> 2: 28.00 (0.00 pct) 27.00 (3.57 pct)
> 4: 31.50 (0.00 pct) 33.00 (-4.76 pct)
> 8: 42.00 (0.00 pct) 42.50 (-1.19 pct)
> 16: 56.50 (0.00 pct) 56.00 (0.88 pct)
> 32: 94.50 (0.00 pct) 96.50 (-2.11 pct)
> 64: 176.00 (0.00 pct) 178.50 (-1.42 pct)
> 128: 404.00 (0.00 pct) 418.00 (-3.46 pct)
> 256: 869.00 (0.00 pct) 900.00 (-3.56 pct)
> 512: 58432.00 (0.00 pct) 56256.00 (3.72 pct)
>
> NPS2
>
> #workers: tip Filter Overloaded
> 1: 26.50 (0.00 pct) 14.00 (47.16 pct)
> 2: 26.50 (0.00 pct) 14.50 (45.28 pct)
> 4: 34.50 (0.00 pct) 18.00 (47.82 pct)
> 8: 45.00 (0.00 pct) 30.50 (32.22 pct)
> 16: 56.50 (0.00 pct) 57.00 (-0.88 pct)
> 32: 95.50 (0.00 pct) 94.00 (1.57 pct)
> 64: 179.00 (0.00 pct) 176.00 (1.67 pct)
> 128: 369.00 (0.00 pct) 411.00 (-11.38 pct) *
> 128: 400.60 (0.00 pct) 412.90 (-3.07 pct) [Verification Run]
> 256: 898.00 (0.00 pct) 850.00 (5.34 pct)
> 512: 56256.00 (0.00 pct) 59456.00 (-5.68 pct)
>
> NPS4
>
> #workers: tip Filter Overloaded
> 1: 25.00 (0.00 pct) 24.50 (2.00 pct)
> 2: 28.00 (0.00 pct) 24.00 (14.28 pct)
> 4: 29.50 (0.00 pct) 28.50 (3.38 pct)
> 8: 41.00 (0.00 pct) 36.50 (10.97 pct)
> 16: 65.50 (0.00 pct) 59.00 (9.92 pct)
> 32: 93.00 (0.00 pct) 95.50 (-2.68 pct)
> 64: 170.50 (0.00 pct) 182.00 (-6.74 pct) *
> 64: 175.00 (0.00 pct) 176.00 (-0.57 pct) [Verification Run]
> 128: 377.00 (0.00 pct) 409.50 (-8.62 pct) *
> 128: 372.00 (0.00 pct) 401.00 (-7.79 pct) [Verification Run]
> 256: 867.00 (0.00 pct) 940.00 (-8.41 pct) *
> 256: 925.00 (0.00 pct) 892.00 (+3.45 pct) [Verification Run]
> 512: 58048.00 (0.00 pct) 59456.00 (-2.42 pct)
>
> ~~~~~~
> tbench
> ~~~~~~
>
> NPS1
>
> Clients: tip Filter Overloaded
> 1 443.31 (0.00 pct) 466.32 (5.19 pct)
> 2 877.32 (0.00 pct) 891.87 (1.65 pct)
> 4 1665.11 (0.00 pct) 1727.98 (3.77 pct)
> 8 3016.68 (0.00 pct) 3125.12 (3.59 pct)
> 16 5374.30 (0.00 pct) 5414.02 (0.73 pct)
> 32 8763.86 (0.00 pct) 8599.72 (-1.87 pct)
> 64 15786.93 (0.00 pct) 14095.47 (-10.71 pct) *
> 64 15441.33 (0.00 pct) 15148.00 (-1.89 pct) [Verification Run]
> 128 26826.08 (0.00 pct) 27837.07 (3.76 pct)
> 256 24207.35 (0.00 pct) 23769.48 (-1.80 pct)
> 512 51740.58 (0.00 pct) 53369.28 (3.14 pct)
> 1024 51177.82 (0.00 pct) 51928.06 (1.46 pct)
>
> NPS2
>
> Clients: tip Filter Overloaded
> 1 449.49 (0.00 pct) 464.65 (3.37 pct)
> 2 867.28 (0.00 pct) 898.85 (3.64 pct)
> 4 1643.60 (0.00 pct) 1691.53 (2.91 pct)
> 8 3047.35 (0.00 pct) 3010.65 (-1.20 pct)
> 16 5340.77 (0.00 pct) 5242.42 (-1.84 pct)
> 32 10536.85 (0.00 pct) 8978.74 (-14.78 pct) *
> 32 10417.46 /90.00 pct) 8375.55 (-19.60 pct) [Verification Run]
> 64 16543.23 (0.00 pct) 15357.35 (-7.16 pct) *
> 64 17101.56 (0.00 pct) 15465.73 (-9.56 pct) [Verification Run]
> 128 26400.40 (0.00 pct) 26637.87 (0.89 pct)
> 256 23436.75 (0.00 pct) 24324.08 (3.78 pct)
> 512 50902.60 (0.00 pct) 49159.14 (-3.42 pct)
> 1024 50216.10 (0.00 pct) 50218.74 (0.00 pct)
>
> NPS4
>
> Clients: tip Filter Overloaded
> 1 443.82 (0.00 pct) 458.65 (3.34 pct)
> 2 849.14 (0.00 pct) 883.79 (4.08 pct)
> 4 1603.26 (0.00 pct) 1658.89 (3.46 pct)
> 8 2972.37 (0.00 pct) 3087.76 (3.88 pct)
> 16 5277.13 (0.00 pct) 5472.11 (3.69 pct)
> 32 9744.73 (0.00 pct) 9666.67 (-0.80 pct)
> 64 15854.80 (0.00 pct) 13778.51 (-13.09 pct) *
> 64 15732.56 (0.00 pct) 14397.83 (-8.48 pct) [Verification Run]
> 128 26116.97 (0.00 pct) 25966.86 (-0.57 pct)
> 256 22403.25 (0.00 pct) 22634.04 (1.03 pct)
> 512 48317.20 (0.00 pct) 47123.73 (-2.47 pct)
> 1024 50445.41 (0.00 pct) 48934.56 (-2.99 pct)
>
> Note: tbench resuts for 256 workers are known to have
> a great amount of run to run variation on the test
> machine. Any regression seen for the data point can
> be safely ignored.
>
> ~~~~~~
> Stream
> ~~~~~~
>
> - 10 runs
>
> NPS1
>
> Test: tip Filter Overloaded
> Copy: 189113.11 (0.00 pct) 184006.43 (-2.70 pct)
> Scale: 201190.61 (0.00 pct) 197663.80 (-1.75 pct)
> Add: 232654.21 (0.00 pct) 223226.88 (-4.05 pct)
> Triad: 226583.57 (0.00 pct) 218920.27 (-3.38 pct)
>
> NPS2
>
> Test: tip Filter Overloaded
> Copy: 155347.14 (0.00 pct) 150710.93 (-2.98 pct)
> Scale: 191701.53 (0.00 pct) 181436.61 (-5.35 pct)
> Add: 210013.97 (0.00 pct) 200397.89 (-4.57 pct)
> Triad: 207602.00 (0.00 pct) 198247.25 (-4.50 pct)
>
> NPS4
>
> Test: tip Filter Overloaded
> Copy: 136421.15 (0.00 pct) 137608.05 (0.87 pct)
> Scale: 191217.59 (0.00 pct) 189674.77 (-0.80 pct)
> Add: 189229.52 (0.00 pct) 188682.22 (-0.28 pct)
> Triad: 188052.99 (0.00 pct) 188946.75 (0.47 pct)
>
> - 100 runs
>
> NPS1
>
> Test: tip Filter Overloaded
> Copy: 244693.32 (0.00 pct) 235089.48 (-3.92 pct)
> Scale: 221874.99 (0.00 pct) 217716.94 (-1.87 pct)
> Add: 268363.89 (0.00 pct) 266529.22 (-0.68 pct)
> Triad: 260945.24 (0.00 pct) 252780.93 (-3.12 pct)
>
> NPS2
>
> Test: tip Filter Overloaded
> Copy: 211262.00 (0.00 pct) 240922.15 (14.03 pct)
> Scale: 222493.34 (0.00 pct) 220122.09 (-1.06 pct)
> Add: 280277.17 (0.00 pct) 278002.19 (-0.81 pct)
> Triad: 265860.49 (0.00 pct) 264231.43 (-0.61 pct)
>
> NPS4
>
> Test: tip Filter Overloaded
> Copy: 250171.40 (0.00 pct) 243512.01 (-2.66 pct)
> Scale: 222293.56 (0.00 pct) 224911.55 (1.17 pct)
> Add: 279222.16 (0.00 pct) 280700.91 (0.52 pct)
> Triad: 262013.92 (0.00 pct) 265729.44 (1.41 pct)
>
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
>
> NPS1
>
> sched-tip: 303718.33 (var: 1.31)
> NUMA Bal: 309396.00 (var: 1.24) (+1.83%)
>
> NPS2
>
> sched-tip: 304536.33 (var: 2.46)
> NUMA Bal: 305209.00 (var: 0.80) (+0.22%)
>
> NPS4
>
> sched-tip: 301192.33 (var: 1.81)
> NUMA Bal: 304248.00 (var: 2.05) (+1.00%)
>
> ~~~~~
> Notes
> ~~~~~
>
> - schbench shows regression for 128 workers in NPS4 mode.
> The rerun shows stable results for both tip and patched
> kernel.
> - tbench shows regression for 64 workers in NPS2 and NPS4
> mode. In NPS2 mode, the tip shows some run to run variance
> however the median of 10 runs reported is lower for the
> patched kernel.
> - tbench shows regression for 32 workers in NPS2 mode. The
> tip seems to report stable values here but there is a
> slight run to run variation observed in the patched kernel.
>
> - Overall, the performance is comparable to that of the tip.
> - schbench low worker count improvements has the load balancer
> coming into the picture. We still have to do deeper analysis
> to see if and how this patch is helping.
>
> I haven't run the mmtest as a part of this testing. I've made
> a note of the configs you ran the numbers for. I'll try to
> get numbers for same.
>
>>
>> v3:
>> - removed sched-idle balance feature and focus on SIS
>> - take non-CFS tasks into consideration
>> - several fixes/improvement suggested by Josh Don
>>
>> v2:
>> - several optimizations on sched-idle balancing
>> - ignore asym topos in can_migrate_task
>> - add more benchmarks including SIS efficiency
>> - re-organize patch as suggested by Mel
>>
>> v1: https://lore.kernel.org/lkml/[email protected]/
>> v2: https://lore.kernel.org/lkml/[email protected]/
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> include/linux/sched/topology.h | 12 ++++++++++
>> kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
>> kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
>> kernel/sched/idle.c | 1 +
>> kernel/sched/sched.h | 4 ++++
>> kernel/sched/topology.c | 4 +++-
>> 6 files changed, 92 insertions(+), 10 deletions(-)
>>
>> [..snip..]
>
> Let me know if there is some additional data you would like
> me to gather on the test machine.
> --
> Thanks and Regards,
> Prateek

2022-05-23 07:48:18

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS


On 5/20/22 6:16 AM, Tim Chen Wrote:
> On Thu, 2022-05-05 at 20:23 +0800, Abel Wu wrote:
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> include/linux/sched/topology.h | 12 ++++++++++
>> kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
>> kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
>> kernel/sched/idle.c | 1 +
>> kernel/sched/sched.h | 4 ++++
>> kernel/sched/topology.c | 4 +++-
>> 6 files changed, 92 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..95c7ad1e05b5 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,8 +81,20 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> +
>> + /*
>> + * Tracking of the overloaded cpus can be heavy, so start
>> + * a new cacheline to avoid false sharing.
>> + */
>> + atomic_t nr_overloaded_cpus ____cacheline_aligned;
>
> Abel,
>
> This is nice work. I have one comment.
>
> The update and reading of nr_overloaded_cpus will incur cache bouncing cost.
> As far as I can tell, this counter is used to determine if we should bail out
> from the search for an idle CPUs if the system is heavily loaded. And I
> hope we can avoid using atomic counter in these heavily used scheduler paths.
> The logic to filter overloaded CPUs only need overloaded_cpus[]
> mask and not the nr_overloaded_cpus counter.
>
> So I recommend that you break out the logic of using nr_overloaded_cpus
> atomic counter to detect LLC has heavy load as a second patch so that
> it can be evaluated on its own merit.
>
> That functionality overlaps with Chen Yu's patch to limit search depending
> on load, so it will be easier to compare the two approaches if it is separated.
>
> Otherwise, the logic in the patch to use overloaded_cpus[]
> mask to filter out the overloaded cpus looks fine and complements
> Chen Yu's patch.

OK, will do. And ideally the nr_overloaded_cpus atomic can be replaced
by Chen's patch.

Thanks & BR
Abel