2017-08-11 07:41:58

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain

On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NUMA (span a socket: 4 nodes)
domain3 NUMA (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierarchy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
Sometimes, threads are scheduled on the same cpu, while other
cpus are idle. This results in run-to-run inconsistency. For example:

taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
--cpu-max-prime=100000 run

Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
across 8 cpus.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a NUMA node. This results in the following
hierarchy of sched domains on the same system described above.

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NODE (span a NUMA node)
domain3 NUMA (span a socket: 4 nodes)
domain4 NUMA (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Cc: [email protected]
Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
Changes from V1 (https://lkml.org/lkml/2017/8/10/540)
* Update commit message to include performance number.
* Change from NUMA_IDEN to NODE.
* Fix code styling and update comments.

kernel/sched/topology.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..2dd5b11 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
if (!sched_domains_numa_distance)
return;

+ /* Includes NUMA identity node at level 0. */
+ sched_domains_numa_distance[level++] = curr_distance;
+ sched_domains_numa_levels = level;
+
/*
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
* unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
return;

/*
- * 'level' contains the number of unique distances, excluding the
- * identity distance node_distance(i,i).
+ * 'level' contains the number of unique distances
*
* The sched_domains_numa_distance[] array includes the actual distance
* numbers.
@@ -1445,9 +1448,26 @@ void sched_init_numa(void)
tl[i] = sched_domain_topology[i];

/*
+ * Do not setup NUMA node level if it has the same cpumask
+ * as sched domain at previous level:
+ * This is the case for system with:
+ * - LLC == NODE : LLC (MC) sched domain span a NUMA node.
+ * - DIE == NODE : DIE sched domain span a NUMA node.
+ *
+ * Assume all NUMA nodes are identical, so only check node 0.
+ */
+ if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
+ tl[i++] = (struct sched_domain_topology_level){
+ .mask = sd_numa_mask,
+ .numa_level = 0,
+ SD_INIT_NAME(NODE)
+ };
+ }
+
+ /*
* .. and append 'j' levels of NUMA goodness.
*/
- for (j = 0; j < level; i++, j++) {
+ for (j = 1; j < level; i++, j++) {
tl[i] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,
--
2.7.4


2017-08-14 17:43:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain

Dropping stable@ from CC.

When using git send-email, make sure to exclude stable@ from the list of
recipients as this is not how we send a patch to stable.

On Fri, Aug 11, 2017 at 02:41:15AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
> With full system configuration, current scheduler creates 4 sched
> domains:
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NUMA (span a socket: 4 nodes)
> domain3 NUMA (span a system: 8 nodes)
>
> Note that there is no domain to represent cpus spaning a NUMA node.

...

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..2dd5b11 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
> if (!sched_domains_numa_distance)
> return;
>
> + /* Includes NUMA identity node at level 0. */
> + sched_domains_numa_distance[level++] = curr_distance;
> + sched_domains_numa_levels = level;
> +
> /*
> * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
> return;
>
> /*
> - * 'level' contains the number of unique distances, excluding the
> - * identity distance node_distance(i,i).

I'm still unclear as to why were we excluding this identity distance
until now and how would that change affect existing systems.

Also, you do use the term "NUMA" pretty loosely in the text - please
take care to explain precisely what kind of node you mean: physical,
logical, ... Don't be afraid to be too verbose.

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-08-24 01:13:45

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain

Boris,

Sorry for late reply. I missed this email earlier.

On 8/15/17 00:43, Borislav Petkov wrote:
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 79895ae..2dd5b11 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
>> if (!sched_domains_numa_distance)
>> return;
>>
>> + /* Includes NUMA identity node at level 0. */
>> + sched_domains_numa_distance[level++] = curr_distance;
>> + sched_domains_numa_levels = level;
>> +
>> /*
>> * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>> * unique distances in the node_distance() table.
>> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
>> return;
>>
>> /*
>> - * 'level' contains the number of unique distances, excluding the
>> - * identity distance node_distance(i,i).
> I'm still unclear as to why were we excluding this identity distance
> until now and how would that change affect existing systems.

One reason that I can think of is that, on older systems (at least for AMD), L3
sched domain level has the same cpumask as the node level. This is not the case
for Zen.

> Also, you do use the term "NUMA" pretty loosely in the text - please
> take care to explain precisely what kind of node you mean: physical,
> logical, ... Don't be afraid to be too verbose.
>
> Thanks.

Mostly, I am referring to the logical NUMA (as described by the SRAT/SLIT),
which can change based on how BIOS configures the system. I will update this in V3.

Thanks,
Suravee

2017-08-24 08:37:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: Introduce NUMA identity node sched domain

On Thu, Aug 24, 2017 at 08:13:23AM +0700, Suravee Suthikulpanit wrote:
> One reason that I can think of is that, on older systems (at least for AMD),
> L3 sched domain level has the same cpumask as the node level. This is not
> the case for Zen.

So please write that in the commit message. You're changing code which
runs on older systems so the impact (or the absence of it) has to be
clear.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--