Subject: [RFC PATCH v5 0/4] scheduler: expose the topology of clusters and add cluster scheduler

ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster has
local L3 tag. On the other hand, each cluster will share some internal system
bus. This means cache is much more affine inside one cluster than across
clusters.

+-----------------------------------+ +---------+
| +------+ +------+ +---------------------------+ |
| | CPU0 | | cpu1 | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ cluster | | tag | | |
| | CPU2 | | CPU3 | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +----+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | L3 |
| data |
+-----------------------------------+ | |
| +------+ +------+ | +-----------+ | |
| | | | | | | | | |
| +------+ +------+ +----+ L3 | | |
| | | tag | | |
| +------+ +------+ | | | | |
| | | | | ++ +-----------+ | |
| +------+ +------+ |---------------------------+ |
+-----------------------------------| | |
+-----------------------------------| | |
| +------+ +------+ +---------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ | | tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |


There is a similar need for clustering in x86. Some x86 cores could share L2 caches
that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters
of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3).

Having a sched_domain for clusters will bring two aspects of improvement:
1. spreading unrelated tasks among clusters, which decreases the contention of resources
and improve the throughput.
unrelated tasks might be put randomly without cluster sched_domain:
+-------------------+ +-----------------+
| +----+ +----+ | | |
| |task| |task| | | |
| |1 | |2 | | | |
| +----+ +----+ | | |
| | | |
| cluster1 | | cluster2 |
+-------------------+ +-----------------+

but with cluster sched_domain, they are likely to spread due to LB:
+-------------------+ +-----------------+
| +----+ | | +----+ |
| |task| | | |task| |
| |1 | | | |2 | |
| +----+ | | +----+ |
| | | |
| cluster1 | | cluster2 |
+-------------------+ +-----------------+

2. gathering related tasks within a cluster, which improves the cache affinity of tasks
talking with each other.
Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have
relationship as below:
Task1 wakes up task4
Task2 wakes up task5
Task3 wakes up task6
Task4 wakes up task7
With the tuning of select_idle_cpu() to scan local cluster first, those tasks might
get a chance to be gathered like:
+---------------------------+ +----------------------+
| +----+ +-----+ | | +----+ +-----+ |
| |task| |task | | | |task| |task | |
| |1 | | 4 | | | |2 | |5 | |
| +----+ +-----+ | | +----+ +-----+ |
| | | |
| cluster1 | | cluster2 |
| | | |
| | | |
| +-----+ +------+ | | +-----+ +------+ |
| |task | | task | | | |task | |task | |
| |3 | | 6 | | | |4 | |8 | |
| +-----+ +------+ | | +-----+ +------+ |
+---------------------------+ +----------------------+
Otherwise, the result might be:
+---------------------------+ +----------------------+
| +----+ +-----+ | | +----+ +-----+ |
| |task| |task | | | |task| |task | |
| |1 | | 2 | | | |5 | |6 | |
| +----+ +-----+ | | +----+ +-----+ |
| | | |
| cluster1 | | cluster2 |
| | | |
| | | |
| +-----+ +------+ | | +-----+ +------+ |
| |task | | task | | | |task | |task | |
| |3 | | 4 | | | |7 | |8 | |
| +-----+ +------+ | | +-----+ +------+ |
+---------------------------+ +----------------------+

-v5:
* split "add scheduler level for clusters" into two patches to evaluate the
impact of spreading and gathering separately;
* add a tracepoint of select_idle_cpu for debug purpose; add bcc script in
commit log;
* add cluster_id = -1 in reset_cpu_topology()
* rebased to tip/sched/core

-v4:
* rebased to tip/sched/core with the latest unified code of select_idle_cpu
* added Tim's patch for x86 Jacobsville
* also added benchmark data of spreading unrelated tasks
* avoided the iteration of sched_domain by moving to static_key(addressing
Vincent's comment
* used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment)

Barry Song (2):
scheduler: add scheduler level for clusters
scheduler: scan idle cpu in cluster before scanning the whole llc

Jonathan Cameron (1):
topology: Represent clusters of CPUs within a die

Tim Chen (1):
scheduler: Add cluster scheduler level for x86

Documentation/admin-guide/cputopology.rst | 26 +++++++++++--
arch/arm64/Kconfig | 7 ++++
arch/arm64/kernel/topology.c | 2 +
arch/x86/Kconfig | 8 ++++
arch/x86/include/asm/smp.h | 7 ++++
arch/x86/include/asm/topology.h | 1 +
arch/x86/kernel/cpu/cacheinfo.c | 1 +
arch/x86/kernel/cpu/common.c | 3 ++
arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++-
drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++
drivers/base/arch_topology.c | 15 ++++++++
drivers/base/topology.c | 10 +++++
include/linux/acpi.h | 5 +++
include/linux/arch_topology.h | 5 +++
include/linux/sched/cluster.h | 19 ++++++++++
include/linux/sched/topology.h | 7 ++++
include/linux/topology.h | 13 +++++++
include/trace/events/sched.h | 22 +++++++++++
kernel/sched/core.c | 20 ++++++++++
kernel/sched/fair.c | 36 +++++++++++++++++-
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 5 +++
22 files changed, 313 insertions(+), 6 deletions(-)
create mode 100644 include/linux/sched/cluster.h

--
1.8.3.1


Subject: [RFC PATCH v5 2/4] scheduler: add scheduler level for clusters

ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.

This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.

This will help spread unrelated tasks among clusters, thus decrease the
contention and improve the throughput, for example, stream benchmark can
improve 20%+ while parallelism is 6 and improve around 5% while paralle-
lism is 12:

(1) -P <parallelism> 6
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 6 -M 1024M -N 5

w/o patch:
STREAM copy latency: 2.46 nanoseconds
STREAM copy bandwidth: 39096.28 MB/sec
STREAM scale latency: 2.46 nanoseconds
STREAM scale bandwidth: 38970.26 MB/sec
STREAM add latency: 4.45 nanoseconds
STREAM add bandwidth: 32332.04 MB/sec
STREAM triad latency: 4.07 nanoseconds
STREAM triad bandwidth: 35387.69 MB/sec

w/ patch:
STREAM copy latency: 2.02 nanoseconds
STREAM copy bandwidth: 47604.47 MB/sec +21.7%
STREAM scale latency: 2.04 nanoseconds
STREAM scale bandwidth: 47066.84 MB/sec +20.8%
STREAM add latency: 3.35 nanoseconds
STREAM add bandwidth: 42942.15 MB/sec +32.8%
STREAM triad latency: 3.16 nanoseconds
STREAM triad bandwidth: 45619.18 MB/sec +28.9%

On the other hand,stream result could change significantly during different
tests without the patch, eg:
a.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 44448.45 MB/sec
STREAM scale latency: 2.17 nanoseconds
STREAM scale bandwidth: 44320.77 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38230.54 MB/sec
STREAM triad latency: 3.88 nanoseconds
STREAM triad bandwidth: 37072.10 MB/sec

b.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 44403.22 MB/sec
STREAM scale latency: 2.39 nanoseconds
STREAM scale bandwidth: 40173.69 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38232.56 MB/sec
STREAM triad latency: 3.38 nanoseconds
STREAM triad bandwidth: 42592.04 MB/sec

Obviously it is because the 6 threads are put randomly in 6 cores. Sometimes
they are packed in clusters, sometimes they are spread widely.

(2) -P <parallelism> 12
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5

w/o patch:
STREAM copy latency: 3.37 nanoseconds
STREAM copy bandwidth: 57008.80 MB/sec
STREAM scale latency: 3.38 nanoseconds
STREAM scale bandwidth: 56848.47 MB/sec
STREAM add latency: 5.50 nanoseconds
STREAM add bandwidth: 52398.62 MB/sec
STREAM triad latency: 5.09 nanoseconds
STREAM triad bandwidth: 56591.60 MB/sec

w/ patch:
STREAM copy latency: 3.24 nanoseconds
STREAM copy bandwidth: 59338.60 MB/sec +4.1%
STREAM scale latency: 3.25 nanoseconds
STREAM scale bandwidth: 58993.23 MB/sec +3.7%
STREAM add latency: 5.19 nanoseconds
STREAM add bandwidth: 55517.45 MB/sec +5.9%
STREAM triad latency: 4.86 nanoseconds
STREAM triad bandwidth: 59245.34 MB/sec +4.7%

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 2
to 14, for each different g, we run the command 10 times and get the
average time:
$ numactl -N 0 hackbench -p -T -l 20000 -g $1

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 20000 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 20000 messages of 100 bytes

The below is the result of hackbench w/ and w/o the patch:
g= 2 4 6 8 10 12 14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.8396 3.8250 5.4780 7.3442 9.0172 10.5950 11.9113

Obviously this patch doesn't impact hackbench too much.

Signed-off-by: Barry Song <[email protected]>
---
arch/arm64/Kconfig | 7 +++++++
include/linux/sched/cluster.h | 19 +++++++++++++++++++
include/linux/sched/topology.h | 7 +++++++
include/linux/topology.h | 7 +++++++
kernel/sched/core.c | 20 ++++++++++++++++++++
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 5 +++++
8 files changed, 70 insertions(+)
create mode 100644 include/linux/sched/cluster.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1f212b4..9432a30 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -977,6 +977,13 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.

+config SCHED_CLUSTER
+ bool "Cluster scheduler support"
+ help
+ Cluster scheduler support improves the CPU scheduler's decision
+ making when dealing with machines that have clusters(sharing internal
+ bus or sharing LLC cache tag). If unsure say N here.
+
config SCHED_SMT
bool "SMT scheduler support"
help
diff --git a/include/linux/sched/cluster.h b/include/linux/sched/cluster.h
new file mode 100644
index 0000000..ea6c475
--- /dev/null
+++ b/include/linux/sched/cluster.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SCHED_CLUSTER_H
+#define _LINUX_SCHED_CLUSTER_H
+
+#include <linux/static_key.h>
+
+#ifdef CONFIG_SCHED_CLUSTER
+extern struct static_key_false sched_cluster_present;
+
+static __always_inline bool sched_cluster_active(void)
+{
+ return static_branch_likely(&sched_cluster_present);
+}
+#else
+static inline bool sched_cluster_active(void) { return false; }
+
+#endif
+
+#endif
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778..2f9166f 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
}
#endif

+#ifdef CONFIG_SCHED_CLUSTER
+static inline int cpu_cluster_flags(void)
+{
+ return SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
#ifdef CONFIG_SCHED_MC
static inline int cpu_core_flags(void)
{
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 80d27d7..0b3704a 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
}
#endif

+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
+static inline const struct cpumask *cpu_cluster_mask(int cpu)
+{
+ return topology_cluster_cpumask(cpu);
+}
+#endif
+
static inline const struct cpumask *cpu_cpu_mask(int cpu)
{
return cpumask_of_node(cpu_to_node(cpu));
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 28c4df6..19e2536 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7840,6 +7840,17 @@ int sched_cpu_activate(unsigned int cpu)
if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
static_branch_inc_cpuslocked(&sched_smt_present);
#endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+ /*
+ * When going up, increment the number of cluster cpus with
+ * cluster present.
+ */
+ if (cpumask_weight(cpu_cluster_mask(cpu)) > cpumask_weight(cpu_smt_mask(cpu)) &&
+ cpumask_weight(cpu_cluster_mask(cpu)) < cpumask_weight(cpu_coregroup_mask(cpu)))
+ static_branch_inc_cpuslocked(&sched_cluster_present);
+#endif
+
set_cpu_active(cpu, true);

if (sched_smp_initialized) {
@@ -7916,6 +7927,15 @@ int sched_cpu_deactivate(unsigned int cpu)
static_branch_dec_cpuslocked(&sched_smt_present);
#endif

+#ifdef CONFIG_SCHED_CLUSTER
+ /*
+ * When going down, decrement the number of cpus with cluster present.
+ */
+ if (cpumask_weight(cpu_cluster_mask(cpu)) > cpumask_weight(cpu_smt_mask(cpu)) &&
+ cpumask_weight(cpu_cluster_mask(cpu)) < cpumask_weight(cpu_coregroup_mask(cpu)))
+ static_branch_dec_cpuslocked(&sched_cluster_present);
+#endif
+
if (!sched_smp_initialized)
return 0;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e2ab1e..c92ad9f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6021,6 +6021,10 @@ static inline int __select_idle_cpu(int cpu)
return -1;
}

+#ifdef CONFIG_SCHED_CLUSTER
+DEFINE_STATIC_KEY_FALSE(sched_cluster_present);
+#endif
+
#ifdef CONFIG_SCHED_SMT
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2e09a6..73f7406 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@

#include <linux/sched/autogroup.h>
#include <linux/sched/clock.h>
+#include <linux/sched/cluster.h>
#include <linux/sched/coredump.h>
#include <linux/sched/cpufreq.h>
#include <linux/sched/cputime.h>
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 12f8058..ae1fa00 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1511,6 +1511,11 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
#endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+ { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
--
1.8.3.1

Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc

On kunpeng920, cpus within one cluster can communicate wit each other
much faster than cpus across different clusters. A simple hackbench
can prove that.
hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters shows a large contrast:
(1) within a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285

(2) across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524

This inspires us to change the wake_affine path to scan cluster before
scanning the whole LLC to try to gatter related tasks in one cluster,
which is done by this patch.

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 2
to 14, for each different g, we run the command 10 times and get the
average time:
$ numactl -N 0 hackbench -p -T -l 20000 -g $1

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 20000 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 20000 messages of 100 bytes

The below is the result of hackbench w/ and w/o cluster patch:
g= 2 4 6 8 10 12 14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608

Obviously some recent commits have improved the hackbench. So the change
in wake_affine path brings less increase on hackbench compared to what
we got in RFC v4.
And obviously it is much more tricky to leverage wake_affine compared to
leveraging the scatter of tasks in the previous patch as load balance
might pull tasks which have been compact in a cluster so alternative
suggestions welcome.

In order to figure out how many times cpu is picked from the cluster and
how many times cpu is picked out of the cluster, a tracepoint for debug
purpose is added in this patch. And an userspace bcc script to print the
histogram of the result of select_idle_cpu():
#!/usr/bin/python
#
# selectidlecpu.py select idle cpu histogram.
#
# A Ctrl-C will print the gathered histogram then exit.
#
# 18-March-2021 Barry Song Created this.

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""

BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
u32 e;
if (args->idle / 4 == args->target/4)
e = 0; /* idle cpu from same cluster */
else if (args->idle != -1)
e = 1; /* idle cpu from different clusters */
else
e = 2; /* no idle cpu */

dist.increment(e);
return 0;
}
""")

# header
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
sleep(99999999)
except KeyboardInterrupt:
print()

# output

print("\nlinear histogram")
print("~~~~~~~~~~~~~~~~")
b["dist"].print_linear_hist("idle")

Even while g=14 and the system is quite busy, we can see there are some
chances idle cpu is picked from local cluster:
linear histogram
~~~~~~~~~~~~~~
idle : count distribution
0 : 15234281 |*********** |
1 : 18494 | |
2 : 53066152 |****************************************|

0: local cluster
1: out of the cluster
2: select_idle_cpu() returns -1

Signed-off-by: Barry Song <[email protected]>
---
include/trace/events/sched.h | 22 ++++++++++++++++++++++
kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cbe3e15..86608cf 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -136,6 +136,28 @@
);

/*
+ * Tracepoint for select_idle_cpu:
+ */
+TRACE_EVENT(sched_select_idle_cpu,
+
+ TP_PROTO(int target, int idle),
+
+ TP_ARGS(target, idle),
+
+ TP_STRUCT__entry(
+ __field( int, target )
+ __field( int, idle )
+ ),
+
+ TP_fast_assign(
+ __entry->target = target;
+ __entry->idle = idle;
+ ),
+
+ TP_printk("target=%d idle=%d", __entry->target, __entry->idle)
+);
+
+/*
* Tracepoint for waking up a task:
*/
DECLARE_EVENT_CLASS(sched_wakeup_template,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c92ad9f2..3892d42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6150,7 +6150,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ if (!sched_cluster_active())
+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+#ifdef CONFIG_SCHED_CLUSTER
+ if (sched_cluster_active())
+ cpumask_and(cpus, cpu_cluster_mask(target), p->cpus_ptr);
+#endif

if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
@@ -6171,6 +6176,29 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
time = cpu_clock(this);
}

+#ifdef CONFIG_SCHED_CLUSTER
+ if (sched_cluster_active()) {
+ for_each_cpu_wrap(cpu, cpus, target) {
+ if (smt) {
+ i = select_idle_core(p, cpu, cpus, &idle_cpu);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+
+ } else {
+ if (!--nr)
+ return -1;
+ idle_cpu = __select_idle_cpu(cpu);
+ if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+ goto done;
+ }
+ }
+ }
+
+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ cpumask_andnot(cpus, cpus, cpu_cluster_mask(target));
+ }
+#endif
+
for_each_cpu_wrap(cpu, cpus, target) {
if (smt) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6186,6 +6214,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}
}

+done:
if (smt)
set_idle_cores(this, false);

@@ -6324,6 +6353,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return target;

i = select_idle_cpu(p, sd, target);
+ trace_sched_select_idle_cpu(target, i);
if ((unsigned)i < nr_cpumask_bits)
return i;

--
1.8.3.1

Subject: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

From: Tim Chen <[email protected]>

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
is shared among a cluster of cores instead of being exclusive
to one single core.

To prevent oversubscription of L2 cache, load should be
balanced between such L2 clusters, especially for tasks with
no shared data.

Also with cluster scheduling policy where tasks are woken up
in the same L2 cluster, we will benefit from keeping tasks
related to each other and likely sharing data in the same L2
cluster.

Add CPU masks of CPUs sharing the L2 cache so we can build such
L2 cluster scheduler domain.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
arch/x86/Kconfig | 8 ++++++++
arch/x86/include/asm/smp.h | 7 +++++++
arch/x86/include/asm/topology.h | 1 +
arch/x86/kernel/cpu/cacheinfo.c | 1 +
arch/x86/kernel/cpu/common.c | 3 +++
arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++++++++++++++++++++++-
6 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879..d597de2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1002,6 +1002,14 @@ config NR_CPUS
This is purely to save memory: each supported CPU adds about 8KB
to the kernel image.

+config SCHED_CLUSTER
+ bool "Cluster scheduler support"
+ default n
+ help
+ Cluster scheduler support improves the CPU scheduler's decision
+ making when dealing with machines that have clusters of CPUs
+ sharing L2 cache. If unsure say N here.
+
config SCHED_SMT
def_bool y if SMP

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index c0538f8..9cbc4ae 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
/* cpus sharing the last level cache: */
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);

static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
}

+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+ return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..2a11ccc 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { }
#include <asm-generic/topology.h>

extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);

#define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id)
#define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id)
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index 3ca9be4..0d03a71 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
l2 = new_l2;
#ifdef CONFIG_SMP
per_cpu(cpu_llc_id, cpu) = l2_id;
+ per_cpu(cpu_l2c_id, cpu) = l2_id;
#endif
}

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ab640ab..0ba282d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -78,6 +78,9 @@
/* Last level cache ID of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;

+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
/* correctly size the local cpu masks */
void __init setup_cpu_local_masks(void)
{
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 02813a7..c85ffa8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@

DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);

+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
/* Per CPU bogomips and other parameters */
DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -501,6 +503,21 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
return topology_sane(c, o, "llc");
}

+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+ int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+ /* Do not match if we do not have a valid APICID for cpu: */
+ if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+ return false;
+
+ /* Do not match if L2 cache id does not match: */
+ if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
+ return false;
+
+ return topology_sane(c, o, "l2c");
+}
+
/*
* Unlike the other levels, we do not enforce keeping a
* multicore group inside a NUMA node. If this happens, we will
@@ -522,7 +539,7 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
}


-#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
static inline int x86_sched_itmt_flags(void)
{
return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@@ -540,12 +557,21 @@ static int x86_smt_flags(void)
return cpu_smt_flags() | x86_sched_itmt_flags();
}
#endif
+#ifdef CONFIG_SCHED_CLUSTER
+static int x86_cluster_flags(void)
+{
+ return cpu_cluster_flags() | x86_sched_itmt_flags();
+}
+#endif
#endif

static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
+#ifdef CONFIG_SCHED_CLUSTER
+ { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
@@ -556,6 +582,9 @@ static int x86_smt_flags(void)
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
+#ifdef CONFIG_SCHED_CLUSTER
+ { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
@@ -583,6 +612,7 @@ void set_cpu_sibling_map(int cpu)
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+ cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
c->booted_cores = 1;
@@ -598,6 +628,8 @@ void set_cpu_sibling_map(int cpu)
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);

+ if ((i == cpu) || (has_mp && match_l2c(c, o)))
+ link_mask(cpu_l2c_shared_mask, cpu, i);
}

/*
@@ -649,6 +681,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
return cpu_llc_shared_mask(cpu);
}

+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+ return cpu_l2c_shared_mask(cpu);
+}
+
static void impress_friends(void)
{
int cpu;
@@ -1332,6 +1369,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+ zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
}

/*
@@ -1556,7 +1594,10 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+ for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
+ cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
+ cpumask_clear(cpu_l2c_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
cpumask_clear(topology_core_cpumask(cpu));
cpumask_clear(topology_die_cpumask(cpu));
--
1.8.3.1

Subject: RE: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Friday, March 19, 2021 5:16 PM
> To: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; Jonathan Cameron <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> xuwei (O) <[email protected]>; Zengtao (B) <[email protected]>;
> [email protected]; yangyicong <[email protected]>; Liguozhu (Kenneth)
> <[email protected]>; [email protected]; [email protected]; Song Bao Hua
> (Barry Song) <[email protected]>
> Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning
> the whole llc
>
> On kunpeng920, cpus within one cluster can communicate wit each other
> much faster than cpus across different clusters. A simple hackbench
> can prove that.
> hackbench running on 4 cpus in single one cluster and 4 cpus in
> different clusters shows a large contrast:
> (1) within a cluster:
> root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 20000 messages of 100 bytes
> Time: 4.285
>
> (2) across clusters:
> root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 20000 messages of 100 bytes
> Time: 5.524
>
> This inspires us to change the wake_affine path to scan cluster before
> scanning the whole LLC to try to gatter related tasks in one cluster,
> which is done by this patch.
>
> To evaluate the performance impact to related tasks talking with each
> other, we run the below hackbench with different -g parameter from 2
> to 14, for each different g, we run the command 10 times and get the
> average time:
> $ numactl -N 0 hackbench -p -T -l 20000 -g $1
>
> hackbench will report the time which is needed to complete a certain number
> of messages transmissions between a certain number of tasks, for example:
> $ numactl -N 0 hackbench -p -T -l 20000 -g 10
> Running in threaded mode with 10 groups using 40 file descriptors each
> (== 400 tasks)
> Each sender will pass 20000 messages of 100 bytes
>
> The below is the result of hackbench w/ and w/o cluster patch:
> g= 2 4 6 8 10 12 14
> w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608
>
> Obviously some recent commits have improved the hackbench. So the change
> in wake_affine path brings less increase on hackbench compared to what
> we got in RFC v4.
> And obviously it is much more tricky to leverage wake_affine compared to
> leveraging the scatter of tasks in the previous patch as load balance
> might pull tasks which have been compact in a cluster so alternative
> suggestions welcome.
>
> In order to figure out how many times cpu is picked from the cluster and
> how many times cpu is picked out of the cluster, a tracepoint for debug
> purpose is added in this patch. And an userspace bcc script to print the
> histogram of the result of select_idle_cpu():
> #!/usr/bin/python
> #
> # selectidlecpu.py select idle cpu histogram.
> #
> # A Ctrl-C will print the gathered histogram then exit.
> #
> # 18-March-2021 Barry Song Created this.
>
> from __future__ import print_function
> from bcc import BPF
> from time import sleep
>
> # load BPF program
> b = BPF(text="""
>
> BPF_HISTOGRAM(dist);
>
> TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
> {
> u32 e;
> if (args->idle / 4 == args->target/4)
> e = 0; /* idle cpu from same cluster */

Oops here, as -1/4 = 1/4 = 2/4 = 3/4 = 0
So a part of -1 is put here(local cluster) incorrectly.

> else if (args->idle != -1)
> e = 1; /* idle cpu from different clusters */
> else
> e = 2; /* no idle cpu */
>
> dist.increment(e);
> return 0;
> }
> """)

Fixed it to:

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
u32 e;
if (args->idle == -1)
e = 2; /* no idle cpu */
else if (args->idle / 4 == args->target / 4)
e = 0; /* idle cpu from same cluster */
else
e = 1; /* idle cpu from different clusters */

dist.increment(e);
return 0;
}

>
> # header
> print("Tracing... Hit Ctrl-C to end.")
>
> # trace until Ctrl-C
> try:
> sleep(99999999)
> except KeyboardInterrupt:
> print()
>
> # output
>
> print("\nlinear histogram")
> print("~~~~~~~~~~~~~~~~")
> b["dist"].print_linear_hist("idle")
>
> Even while g=14 and the system is quite busy, we can see there are some
> chances idle cpu is picked from local cluster:
> linear histogram
> ~~~~~~~~~~~~~~
> idle : count distribution
> 0 : 15234281 |*********** |
> 1 : 18494 | |
> 2 : 53066152 |****************************************|
>
> 0: local cluster
> 1: out of the cluster
> 2: select_idle_cpu() returns -1

After fixing the script, the new histogram is like:
linear histogram
~~~~~~~~~~~~~~~~
idle : count distribution
0 : 2765930 |* |
1 : 68934 | |
2 : 77667475 |****************************************|

We get
Local cluster: 3.4358%
Out of cluster: 0.0856%
-1(no idle before nr becomes 0): 96.4785%

>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/trace/events/sched.h | 22 ++++++++++++++++++++++
> kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
> 2 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index cbe3e15..86608cf 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -136,6 +136,28 @@
> );
>
> /*
> + * Tracepoint for select_idle_cpu:
> + */
> +TRACE_EVENT(sched_select_idle_cpu,
> +
> + TP_PROTO(int target, int idle),
> +
> + TP_ARGS(target, idle),
> +
> + TP_STRUCT__entry(
> + __field( int, target )
> + __field( int, idle )
> + ),
> +
> + TP_fast_assign(
> + __entry->target = target;
> + __entry->idle = idle;
> + ),
> +
> + TP_printk("target=%d idle=%d", __entry->target, __entry->idle)
> +);
> +
> +/*
> * Tracepoint for waking up a task:
> */
> DECLARE_EVENT_CLASS(sched_wakeup_template,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c92ad9f2..3892d42 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6150,7 +6150,12 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> if (!this_sd)
> return -1;
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + if (!sched_cluster_active())
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +#ifdef CONFIG_SCHED_CLUSTER
> + if (sched_cluster_active())
> + cpumask_and(cpus, cpu_cluster_mask(target), p->cpus_ptr);
> +#endif
>
> if (sched_feat(SIS_PROP) && !smt) {
> u64 avg_cost, avg_idle, span_avg;
> @@ -6171,6 +6176,29 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> time = cpu_clock(this);
> }
>
> +#ifdef CONFIG_SCHED_CLUSTER
> + if (sched_cluster_active()) {
> + for_each_cpu_wrap(cpu, cpus, target) {
> + if (smt) {
> + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> +
> + } else {
> + if (!--nr)
> + return -1;
> + idle_cpu = __select_idle_cpu(cpu);
> + if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> + goto done;
> + }
> + }

BTW, if I return -1 here directly and don't fall back to LLC, I
can even get a better benchmark:

g= 2 4 6 8 10 12 14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608
return -1: 1.8324 3.6140 5.1029 6.5016 8.1867 9.7559 10.7716

so it seems the wake-up path change is much more trivial to
get a real and good impact, comparing to the previous 2/4
patch in which we are only spreading tasks.

> + }
> +
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_andnot(cpus, cpus, cpu_cluster_mask(target));
> + }
> +#endif
> +
> for_each_cpu_wrap(cpu, cpus, target) {
> if (smt) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -6186,6 +6214,7 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> }
> }
>
> +done:
> if (smt)
> set_idle_cores(this, false);
>
> @@ -6324,6 +6353,7 @@ static int select_idle_sibling(struct task_struct *p,
> int prev, int target)
> return target;
>
> i = select_idle_cpu(p, sd, target);
> + trace_sched_select_idle_cpu(target, i);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> --
> 1.8.3.1

Thanks
Barry

2021-03-23 22:56:52

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86



On 3/18/21 9:16 PM, Barry Song wrote:
> From: Tim Chen <[email protected]>
>
> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> is shared among a cluster of cores instead of being exclusive
> to one single core.
>
> To prevent oversubscription of L2 cache, load should be
> balanced between such L2 clusters, especially for tasks with
> no shared data.
>
> Also with cluster scheduling policy where tasks are woken up
> in the same L2 cluster, we will benefit from keeping tasks
> related to each other and likely sharing data in the same L2
> cluster.
>
> Add CPU masks of CPUs sharing the L2 cache so we can build such
> L2 cluster scheduler domain.
>
> Signed-off-by: Tim Chen <[email protected]>
> Signed-off-by: Barry Song <[email protected]>


Barry,

Can you also add this chunk to the patch.
Thanks.

Tim


diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 2a11ccc14fb1..800fa48c9fcd 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;

#ifdef CONFIG_SMP
#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu))
+#define topology_cluster_cpumask(cpu) (cpu_clustergroup_mask(cpu))
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))

Subject: RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86



> -----Original Message-----
> From: Tim Chen [mailto:[email protected]]
> Sent: Wednesday, March 24, 2021 11:51 AM
> To: Song Bao Hua (Barry Song) <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; Jonathan Cameron <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> xuwei (O) <[email protected]>; Zengtao (B) <[email protected]>;
> [email protected]; yangyicong <[email protected]>; Liguozhu (Kenneth)
> <[email protected]>; [email protected]; [email protected]
> Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
>
>
>
> On 3/18/21 9:16 PM, Barry Song wrote:
> > From: Tim Chen <[email protected]>
> >
> > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> > is shared among a cluster of cores instead of being exclusive
> > to one single core.
> >
> > To prevent oversubscription of L2 cache, load should be
> > balanced between such L2 clusters, especially for tasks with
> > no shared data.
> >
> > Also with cluster scheduling policy where tasks are woken up
> > in the same L2 cluster, we will benefit from keeping tasks
> > related to each other and likely sharing data in the same L2
> > cluster.
> >
> > Add CPU masks of CPUs sharing the L2 cache so we can build such
> > L2 cluster scheduler domain.
> >
> > Signed-off-by: Tim Chen <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
>
>
> Barry,
>
> Can you also add this chunk to the patch.
> Thanks.

Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.

>
> Tim
>
>
> diff --git a/arch/x86/include/asm/topology.h
> b/arch/x86/include/asm/topology.h
> index 2a11ccc14fb1..800fa48c9fcd 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;
>
> #ifdef CONFIG_SMP
> #define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu))
> +#define topology_cluster_cpumask(cpu) (cpu_clustergroup_mask(cpu))
> #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
> #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
>

Thanks
Barry


Subject: RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, March 24, 2021 12:15 PM
> To: 'Tim Chen' <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; Jonathan Cameron <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> xuwei (O) <[email protected]>; Zengtao (B) <[email protected]>;
> [email protected]; yangyicong <[email protected]>; Liguozhu (Kenneth)
> <[email protected]>; [email protected]; [email protected]
> Subject: RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
>
>
>
> > -----Original Message-----
> > From: Tim Chen [mailto:[email protected]]
> > Sent: Wednesday, March 24, 2021 11:51 AM
> > To: Song Bao Hua (Barry Song) <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]
> > Cc: [email protected]; [email protected];
> > [email protected]; Jonathan Cameron <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > xuwei (O) <[email protected]>; Zengtao (B) <[email protected]>;
> > [email protected]; yangyicong <[email protected]>; Liguozhu
> (Kenneth)
> > <[email protected]>; [email protected]; [email protected]
> > Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for
> x86
> >
> >
> >
> > On 3/18/21 9:16 PM, Barry Song wrote:
> > > From: Tim Chen <[email protected]>
> > >
> > > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> > > is shared among a cluster of cores instead of being exclusive
> > > to one single core.
> > >
> > > To prevent oversubscription of L2 cache, load should be
> > > balanced between such L2 clusters, especially for tasks with
> > > no shared data.
> > >
> > > Also with cluster scheduling policy where tasks are woken up
> > > in the same L2 cluster, we will benefit from keeping tasks
> > > related to each other and likely sharing data in the same L2
> > > cluster.
> > >
> > > Add CPU masks of CPUs sharing the L2 cache so we can build such
> > > L2 cluster scheduler domain.
> > >
> > > Signed-off-by: Tim Chen <[email protected]>
> > > Signed-off-by: Barry Song <[email protected]>
> >
> >
> > Barry,
> >
> > Can you also add this chunk to the patch.
> > Thanks.
>
> Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.

Hi Tim,
You might want to take a look at this qemu patchset:
https://lore.kernel.org/qemu-devel/[email protected]/T/#t

someone is trying to leverage this cluster topology
to improve KVM virtual machines performance.

>
> >
> > Tim
> >
> >
> > diff --git a/arch/x86/include/asm/topology.h
> > b/arch/x86/include/asm/topology.h
> > index 2a11ccc14fb1..800fa48c9fcd 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;
> >
> > #ifdef CONFIG_SMP
> > #define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu))
> > +#define topology_cluster_cpumask(cpu) (cpu_clustergroup_mask(cpu))
> > #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
> > #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
> >
>

Thanks
Barry

2021-04-20 18:34:14

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86



On 3/23/21 4:21 PM, Song Bao Hua (Barry Song) wrote:

>>
>> On 3/18/21 9:16 PM, Barry Song wrote:
>>> From: Tim Chen <[email protected]>
>>>
>>> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
>>> is shared among a cluster of cores instead of being exclusive
>>> to one single core.
>>>
>>> To prevent oversubscription of L2 cache, load should be
>>> balanced between such L2 clusters, especially for tasks with
>>> no shared data.
>>>
>>> Also with cluster scheduling policy where tasks are woken up
>>> in the same L2 cluster, we will benefit from keeping tasks
>>> related to each other and likely sharing data in the same L2
>>> cluster.
>>>
>>> Add CPU masks of CPUs sharing the L2 cache so we can build such
>>> L2 cluster scheduler domain.
>>>
>>> Signed-off-by: Tim Chen <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>
>>
>> Barry,
>>
>> Can you also add this chunk to the patch.
>> Thanks.
>
> Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.
>

Barry,

This chunk will also need to be added to return cluster id for x86.
Please add it in your next rev.

Thanks.

Tim

---

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 800fa48c9fcd..2548d824f103 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -109,6 +109,7 @@ extern const struct cpumask *cpu_clustergroup_mask(int cpu);
#define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id)
#define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id)
#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id)
+#define topology_cluster_id(cpu) (per_cpu(cpu_l2c_id, cpu))
#define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id)

extern unsigned int __max_die_per_package;

Subject: RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86



> -----Original Message-----
> From: Tim Chen [mailto:[email protected]]
> Sent: Wednesday, April 21, 2021 6:32 AM
> To: Song Bao Hua (Barry Song) <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; Jonathan Cameron <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> xuwei (O) <[email protected]>; Zengtao (B) <[email protected]>;
> [email protected]; yangyicong <[email protected]>; Liguozhu (Kenneth)
> <[email protected]>; [email protected]; [email protected]
> Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
>
>
>
> On 3/23/21 4:21 PM, Song Bao Hua (Barry Song) wrote:
>
> >>
> >> On 3/18/21 9:16 PM, Barry Song wrote:
> >>> From: Tim Chen <[email protected]>
> >>>
> >>> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> >>> is shared among a cluster of cores instead of being exclusive
> >>> to one single core.
> >>>
> >>> To prevent oversubscription of L2 cache, load should be
> >>> balanced between such L2 clusters, especially for tasks with
> >>> no shared data.
> >>>
> >>> Also with cluster scheduling policy where tasks are woken up
> >>> in the same L2 cluster, we will benefit from keeping tasks
> >>> related to each other and likely sharing data in the same L2
> >>> cluster.
> >>>
> >>> Add CPU masks of CPUs sharing the L2 cache so we can build such
> >>> L2 cluster scheduler domain.
> >>>
> >>> Signed-off-by: Tim Chen <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>
> >>
> >> Barry,
> >>
> >> Can you also add this chunk to the patch.
> >> Thanks.
> >
> > Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.
> >
>
> Barry,
>
> This chunk will also need to be added to return cluster id for x86.
> Please add it in your next rev.

Yes. Thanks. I'll put this in either RFC v7 or Patch v1.

For spreading path, things are much easier, though packing path is
quite tricky. But It seems RFC v6 has been quite close to what we want
to achieve to pack related tasks by scanning cluster for tasks within
same NUMA:
https://lore.kernel.org/lkml/[email protected]/

If couples have been already in same LLC(numa), scanning clusters will
gather them further. If they are running in different NUMA nodes, the
original scanning LLC will move them to the same node, after that,
scanning cluster might put them closer to each other.

it seems it is kind of the two-level packing Dietmar has suggested.

So perhaps we won't have RFC v7, I will probably send patch v1 afterwards.

>
> Thanks.
>
> Tim
>
> ---
>
> diff --git a/arch/x86/include/asm/topology.h
> b/arch/x86/include/asm/topology.h
> index 800fa48c9fcd..2548d824f103 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -109,6 +109,7 @@ extern const struct cpumask *cpu_clustergroup_mask(int cpu);
> #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id)
> #define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id)
> #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id)
> +#define topology_cluster_id(cpu) (per_cpu(cpu_l2c_id, cpu))
> #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id)
>
> extern unsigned int __max_die_per_package;

Thanks
Barry