2017-09-07 07:20:23

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: [PATCH v3] sched/topology: Introduce NUMA identity node sched domain

On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
upto 8 cores (16 threads) with the following topology.

----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------

Here, there are 2 last-level (L3) caches per logical NUMA node.
A socket can contain upto 4 NUMA nodes, and a system can support
upto 2 sockets. With full system configuration, current scheduler
creates 4 sched domains:

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NUMA (span a socket: 4 nodes)
domain3 NUMA (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a logical
NUMA node. With this hierarchy of sched domains, the scheduler does
not balance properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per logical NUMA node. This is not the case for
the current scheduler.

Case2:
In some cases, threads are scheduled on the same cpu, while other
cpus are idle. This results in run-to-run inconsistency. For example:

taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
--cpu-max-prime=100000 run

Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
on 8 cpus.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a logical NUMA node. This results in the following
hierarchy of sched domains on the same system described above.

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NODE (span a logical NUMA node)
domain3 NUMA (span a socket: 4 nodes)
domain4 NUMA (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Note that in case cpumask of the last-level-cache and NODE domains
are the same (e.g. on AMD family10h/15h servers), the NODE domain
will be excluded. Therefore, this change will not affect those systems.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
kernel/sched/topology.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..98a8bbc 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
if (!sched_domains_numa_distance)
return;

+ /* Includes NUMA identity node at level 0. */
+ sched_domains_numa_distance[level++] = curr_distance;
+ sched_domains_numa_levels = level;
+
/*
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
* unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
return;

/*
- * 'level' contains the number of unique distances, excluding the
- * identity distance node_distance(i,i).
+ * 'level' contains the number of unique distances
*
* The sched_domains_numa_distance[] array includes the actual distance
* numbers.
@@ -1445,9 +1448,26 @@ void sched_init_numa(void)
tl[i] = sched_domain_topology[i];

/*
+ * Do not setup NUMA node level if it has the same cpumask
+ * as sched domain at previous level. This is the case for
+ * system with:
+ * LLC == NODE : LLC (MC) sched domain span a NUMA node.
+ * DIE == NODE : DIE sched domain span a NUMA node.
+ *
+ * Assume all NUMA nodes are identical, so only check node 0.
+ */
+ if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
+ tl[i++] = (struct sched_domain_topology_level){
+ .mask = sd_numa_mask,
+ .numa_level = 0,
+ SD_INIT_NAME(NODE)
+ };
+ }
+
+ /*
* .. and append 'j' levels of NUMA goodness.
*/
- for (j = 0; j < level; i++, j++) {
+ for (j = 1; j < level; i++, j++) {
tl[i] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,
--
2.7.4


2017-09-14 16:12:54

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Introduce NUMA identity node sched domain

Hi,

Are there any other concerns with this patch?

Thanks,
Suravee

On 9/7/17 00:20, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> Here, there are 2 last-level (L3) caches per logical NUMA node.
> A socket can contain upto 4 NUMA nodes, and a system can support
> upto 2 sockets. With full system configuration, current scheduler
> creates 4 sched domains:
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NUMA (span a socket: 4 nodes)
> domain3 NUMA (span a system: 8 nodes)
>
> Note that there is no domain to represent cpus spaning a logical
> NUMA node. With this hierarchy of sched domains, the scheduler does
> not balance properly in the following cases:
>
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per logical NUMA node. This is not the case for
> the current scheduler.
>
> Case2:
> In some cases, threads are scheduled on the same cpu, while other
> cpus are idle. This results in run-to-run inconsistency. For example:
>
> taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
> --cpu-max-prime=100000 run
>
> Total execution time ranges from 25.1s to 33.5s depending on threads
> placement, where 25.1s is when all 8 threads are balanced properly
> on 8 cpus.
>
> Introducing NUMA identity node sched domain, which is based on how
> SRAT/SLIT table define a logical NUMA node. This results in the following
> hierarchy of sched domains on the same system described above.
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NODE (span a logical NUMA node)
> domain3 NUMA (span a socket: 4 nodes)
> domain4 NUMA (span a system: 8 nodes)
>
> This fixes the improper load balancing cases mentioned above.
>
> Note that in case cpumask of the last-level-cache and NODE domains
> are the same (e.g. on AMD family10h/15h servers), the NODE domain
> will be excluded. Therefore, this change will not affect those systems.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> ---
> kernel/sched/topology.c | 26 +++++++++++++++++++++++---
> 1 file changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..98a8bbc 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
> if (!sched_domains_numa_distance)
> return;
>
> + /* Includes NUMA identity node at level 0. */
> + sched_domains_numa_distance[level++] = curr_distance;
> + sched_domains_numa_levels = level;
> +
> /*
> * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
> return;
>
> /*
> - * 'level' contains the number of unique distances, excluding the
> - * identity distance node_distance(i,i).
> + * 'level' contains the number of unique distances
> *
> * The sched_domains_numa_distance[] array includes the actual distance
> * numbers.
> @@ -1445,9 +1448,26 @@ void sched_init_numa(void)
> tl[i] = sched_domain_topology[i];
>
> /*
> + * Do not setup NUMA node level if it has the same cpumask
> + * as sched domain at previous level. This is the case for
> + * system with:
> + * LLC == NODE : LLC (MC) sched domain span a NUMA node.
> + * DIE == NODE : DIE sched domain span a NUMA node.
> + *
> + * Assume all NUMA nodes are identical, so only check node 0.
> + */
> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
> + tl[i++] = (struct sched_domain_topology_level){
> + .mask = sd_numa_mask,
> + .numa_level = 0,
> + SD_INIT_NAME(NODE)
> + };
> + }
> +
> + /*
> * .. and append 'j' levels of NUMA goodness.
> */
> - for (j = 0; j < level; i++, j++) {
> + for (j = 1; j < level; i++, j++) {
> tl[i] = (struct sched_domain_topology_level){
> .mask = sd_numa_mask,
> .sd_flags = cpu_numa_flags,
>

2017-09-27 11:11:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Introduce NUMA identity node sched domain

On Thu, Sep 07, 2017 at 02:20:05AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a logical NUMA node can contain

Let's simply spell it F17h like we did for the older families.

> upto 8 cores (16 threads) with the following topology.
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> Here, there are 2 last-level (L3) caches per logical NUMA node.
> A socket can contain upto 4 NUMA nodes, and a system can support
> upto 2 sockets. With full system configuration, current scheduler
> creates 4 sched domains:
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NUMA (span a socket: 4 nodes)
> domain3 NUMA (span a system: 8 nodes)
>
> Note that there is no domain to represent cpus spaning a logical

s/cpus/CPUs/

s/spaning/spanning/

Please introduce a spellchecker into your patch creation workflow.

> NUMA node. With this hierarchy of sched domains, the scheduler does
> not balance properly in the following cases:
>
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per logical NUMA node. This is not the case for
> the current scheduler.

I'd like to have a sentence or two here saying what the problem is,
i.e., how do the 8 tasks get placed...

>
> Case2:
> In some cases, threads are scheduled on the same cpu, while other

s/cpu/CPU/

> cpus are idle.

... like this sentence, for example, explaining what happens without
that patch.

> This results in run-to-run inconsistency. For example:
>
> taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
> --cpu-max-prime=100000 run
>
> Total execution time ranges from 25.1s to 33.5s depending on threads
> placement, where 25.1s is when all 8 threads are balanced properly
> on 8 cpus.

s/cpus/CPUs/

Please check the whole patch (comments, etc).

> Introducing NUMA identity node sched domain, which is based on how

"Introduce... " no ing form but plain procedural, do this, do that.

> SRAT/SLIT table define a logical NUMA node. This results in the following
> hierarchy of sched domains on the same system described above.
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NODE (span a logical NUMA node)
> domain3 NUMA (span a socket: 4 nodes)
> domain4 NUMA (span a system: 8 nodes)
>
> This fixes the improper load balancing cases mentioned above.
>
> Note that in case cpumask of the last-level-cache and NODE domains
> are the same (e.g. on AMD family10h/15h servers), the NODE domain

As above.

> will be excluded. Therefore, this change will not affect those systems.

Right, and this is running on *all* machines, not only AMD or x86. Why
doesn't it affect others? The degenerate code maybe?

...

> @@ -1445,9 +1448,26 @@ void sched_init_numa(void)
> tl[i] = sched_domain_topology[i];
>
> /*
> + * Do not setup NUMA node level if it has the same cpumask
> + * as sched domain at previous level. This is the case for
> + * system with:
> + * LLC == NODE : LLC (MC) sched domain span a NUMA node.
> + * DIE == NODE : DIE sched domain span a NUMA node.
> + *
> + * Assume all NUMA nodes are identical, so only check node 0.
> + */
> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
> + tl[i++] = (struct sched_domain_topology_level){
> + .mask = sd_numa_mask,
> + .numa_level = 0,
> + SD_INIT_NAME(NODE)
> + };
> + }

Right, I think the issue wrt the degenerate code is not fully discussed
yet judging by:

https://lkml.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--