LinuxLists.cc - [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

2017-08-10 15:21:25

Subject: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NUMA (span a socket: 4 nodes)
domain3 NUMA (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierachy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
a properly balanced system should schedule 8 threads on 8 cpus
(e.g. T0 of C0-C7). However, current scheduler would schedule
some threads on the same cpu, while others are idle.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a NUMA node. This results in the following
hierachy of sched domains on the same system described above.

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NUMA_IDEN (span a NUMA node)
domain3 NUMA (span a socket: 4 nodes)
domain4 NUMA (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
kernel/sched/topology.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..c57df98 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
if (!sched_domains_numa_distance)
return;

+ /* Includes NUMA identity node at level 0. */
+ sched_domains_numa_distance[level++] = curr_distance;
+ sched_domains_numa_levels = level;
+
/*
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
* unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
return;

/*
- * 'level' contains the number of unique distances, excluding the
- * identity distance node_distance(i,i).
+ * 'level' contains the number of unique distances
*
* The sched_domains_numa_distance[] array includes the actual distance
* numbers.
@@ -1445,9 +1448,24 @@ void sched_init_numa(void)
tl[i] = sched_domain_topology[i];

/*
+ * Ignore the NUMA identity level if it has the same cpumask
+ * as previous level. This is the case for:
+ * - System with last-level-cache (MC) sched domain span a NUMA node.
+ * - System with DIE sched domain span a NUMA node.
+ *
+ * Assume all NUMA nodes are identical, so only check node 0.
+ */
+ if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
+ tl[i++] = (struct sched_domain_topology_level){
+ .mask = sd_numa_mask,
+ .numa_level = 0,
+ SD_INIT_NAME(NUMA_IDEN)
+ };
+
+ /*
* .. and append 'j' levels of NUMA goodness.
*/
- for (j = 0; j < level; i++, j++) {
+ for (j = 1; j < level; i++, j++) {
tl[i] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,
--
2.7.4

2017-08-10 16:41:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
> With full system configuration, current scheduler creates 4 sched
> domains:
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)

Right, so traditionally we'd have the DIE level do that, but because
x86_has_numa_in_package we don't generate that, right?

> domain2 NUMA (span a socket: 4 nodes)
> domain3 NUMA (span a system: 8 nodes)
>
> Note that there is no domain to represent cpus spaning a NUMA node.
> With this hierachy of sched domains, the scheduler does not balance
> properly in the following cases:
>
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per NUMA node. This is not the case for
> the current scheduler.
>
> Case2:
> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
> a properly balanced system should schedule 8 threads on 8 cpus
> (e.g. T0 of C0-C7). However, current scheduler would schedule
> some threads on the same cpu, while others are idle.

Sure.. could you amend with a few actual performance numbers?

> Introducing NUMA identity node sched domain, which is based on how
> SRAT/SLIT table define a NUMA node. This results in the following
> hierachy of sched domains on the same system described above.
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NUMA_IDEN (span a NUMA node)

Hate that name though..

> domain3 NUMA (span a socket: 4 nodes)
> domain4 NUMA (span a system: 8 nodes)
>
> This fixes the improper load balancing cases mentioned above.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> ---
> kernel/sched/topology.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..c57df98 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
> if (!sched_domains_numa_distance)
> return;
>
> + /* Includes NUMA identity node at level 0. */
> + sched_domains_numa_distance[level++] = curr_distance;
> + sched_domains_numa_levels = level;
> +
> /*
> * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
> return;
>
> /*
> - * 'level' contains the number of unique distances, excluding the
> - * identity distance node_distance(i,i).
> + * 'level' contains the number of unique distances
> *
> * The sched_domains_numa_distance[] array includes the actual distance
> * numbers.
> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
> tl[i] = sched_domain_topology[i];
>
> /*
> + * Ignore the NUMA identity level if it has the same cpumask
> + * as previous level. This is the case for:
> + * - System with last-level-cache (MC) sched domain span a NUMA node.
> + * - System with DIE sched domain span a NUMA node.
> + *
> + * Assume all NUMA nodes are identical, so only check node 0.
> + */
> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
> + tl[i++] = (struct sched_domain_topology_level){
> + .mask = sd_numa_mask,
> + .numa_level = 0,
> + SD_INIT_NAME(NUMA_IDEN)

Shall we make that:

SD_INIT_NAME(NODE),

instead?

> + };

This misses a set of '{}'. While C doesn't require it, out coding style
warrants blocks around any multi-line statement.

So what you've forgotten to mention is that for those systems where the
LLC == NODE this now superfluous level gets removed by the degenerate
code. Have you verified that does the right thing?

> +
> + /*
> * .. and append 'j' levels of NUMA goodness.
> */
> - for (j = 0; j < level; i++, j++) {
> + for (j = 1; j < level; i++, j++) {
> tl[i] = (struct sched_domain_topology_level){
> .mask = sd_numa_mask,
> .sd_flags = cpu_numa_flags,
> --
> 2.7.4
>

2017-08-11 04:57:21

by Suthikulpanit, Suravee

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On 8/10/17 23:41, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
>> On AMD Family17h-based (EPYC) system, a NUMA node can contain
>> upto 8 cores (16 threads) with the following topology.
>>
>> ----------------------------
>> C0 | T0 T1 | || | T0 T1 | C4
>> --------| || |--------
>> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
>> --------| || |--------
>> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
>> --------| || |--------
>> C3 | T0 T1 | || | T0 T1 | C7
>> ----------------------------
>>
>> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
>> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
>> With full system configuration, current scheduler creates 4 sched
>> domains:
>>
>> domain0 SMT (span a core)
>> domain1 MC (span a last-level-cache)
>
> Right, so traditionally we'd have the DIE level do that, but because
> x86_has_numa_in_package we don't generate that, right?

That's correct.

>
>> domain2 NUMA (span a socket: 4 nodes)
>> domain3 NUMA (span a system: 8 nodes)
>>
>> Note that there is no domain to represent cpus spaning a NUMA node.
>> With this hierachy of sched domains, the scheduler does not balance
>> properly in the following cases:
>>
>> Case1:
>> When running 8 tasks, a properly balanced system should
>> schedule a task per NUMA node. This is not the case for
>> the current scheduler.
>>
>> Case2:
>> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
>> a properly balanced system should schedule 8 threads on 8 cpus
>> (e.g. T0 of C0-C7). However, current scheduler would schedule
>> some threads on the same cpu, while others are idle.
>
> Sure.. could you amend with a few actual performance numbers?

Sure.

>> [...]
>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>> tl[i] = sched_domain_topology[i];
>>
>> /*
>> + * Ignore the NUMA identity level if it has the same cpumask
>> + * as previous level. This is the case for:
>> + * - System with last-level-cache (MC) sched domain span a NUMA node.
>> + * - System with DIE sched domain span a NUMA node.
>> + *
>> + * Assume all NUMA nodes are identical, so only check node 0.
>> + */
>> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>> + tl[i++] = (struct sched_domain_topology_level){
>> + .mask = sd_numa_mask,
>> + .numa_level = 0,
>> + SD_INIT_NAME(NUMA_IDEN)
>
> Shall we make that:
>
> SD_INIT_NAME(NODE),
>
> instead?

Sounds good.

>> + };
>
> This misses a set of '{}'. While C doesn't require it, out coding style
> warrants blocks around any multi-line statement.
>
> So what you've forgotten to mention is that for those systems where the
> LLC == NODE this now superfluous level gets removed by the degenerate
> code. Have you verified that does the right thing?

Let me check with that one and get back.

Thanks,
Suravee

2017-08-11 05:58:36

by Suthikulpanit, Suravee

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On 8/11/17 11:57, Suravee Suthikulpanit wrote:
>
>>> [...]
>>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>> tl[i] = sched_domain_topology[i];
>>>
>>> /*
>>> + * Ignore the NUMA identity level if it has the same cpumask
>>> + * as previous level. This is the case for:
>>> + * - System with last-level-cache (MC) sched domain span a NUMA node.
>>> + * - System with DIE sched domain span a NUMA node.
>>> + *
>>> + * Assume all NUMA nodes are identical, so only check node 0.
>>> + */
>>> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>>> + tl[i++] = (struct sched_domain_topology_level){
>>> + .mask = sd_numa_mask,
>>> + .numa_level = 0,
>>> + SD_INIT_NAME(NODE)
>>> + };
>>
>> So what you've forgotten to mention is that for those systems where the
>> LLC == NODE this now superfluous level gets removed by the degenerate
>> code. Have you verified that does the right thing?
>
> Let me check with that one and get back.

Actually, it is not removed by the degenerate code. That is what this logic is
for. It checks for LCC == NODE or DIE == NODE before setting up the NODE sched
level. I can update the comment. This has also been tested on system w/ LLC == NODE.

Thanks,
Suravee

2017-08-11 09:15:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On Fri, Aug 11, 2017 at 12:58:22PM +0700, Suravee Suthikulpanit wrote:
>
>
> On 8/11/17 11:57, Suravee Suthikulpanit wrote:
> >
> > > > [...]
> > > > @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
> > > > tl[i] = sched_domain_topology[i];
> > > >
> > > > /*
> > > > + * Ignore the NUMA identity level if it has the same cpumask
> > > > + * as previous level. This is the case for:
> > > > + * - System with last-level-cache (MC) sched domain span a NUMA node.
> > > > + * - System with DIE sched domain span a NUMA node.
> > > > + *
> > > > + * Assume all NUMA nodes are identical, so only check node 0.
> > > > + */
> > > > + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
> > > > + tl[i++] = (struct sched_domain_topology_level){
> > > > + .mask = sd_numa_mask,
> > > > + .numa_level = 0,
> > > > + SD_INIT_NAME(NODE)
> > > > + };
> > >
> > > So what you've forgotten to mention is that for those systems where the
> > > LLC == NODE this now superfluous level gets removed by the degenerate
> > > code. Have you verified that does the right thing?
> >
> > Let me check with that one and get back.
>
> Actually, it is not removed by the degenerate code. That is what this logic
> is for. It checks for LCC == NODE or DIE == NODE before setting up the NODE
> sched level. I can update the comment. This has also been tested on system
> w/ LLC == NODE.

Why does the degenerate code fail to remove things?

2017-08-14 07:45:14

by Suthikulpanit, Suravee

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On 8/11/17 16:15, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 12:58:22PM +0700, Suravee Suthikulpanit wrote:
>>
>>
>> On 8/11/17 11:57, Suravee Suthikulpanit wrote:
>>>
>>>>> [...]
>>>>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>>>> tl[i] = sched_domain_topology[i];
>>>>>
>>>>> /*
>>>>> + * Ignore the NUMA identity level if it has the same cpumask
>>>>> + * as previous level. This is the case for:
>>>>> + * - System with last-level-cache (MC) sched domain span a NUMA node.
>>>>> + * - System with DIE sched domain span a NUMA node.
>>>>> + *
>>>>> + * Assume all NUMA nodes are identical, so only check node 0.
>>>>> + */
>>>>> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>>>>> + tl[i++] = (struct sched_domain_topology_level){
>>>>> + .mask = sd_numa_mask,
>>>>> + .numa_level = 0,
>>>>> + SD_INIT_NAME(NODE)
>>>>> + };
>>>>
>>>> So what you've forgotten to mention is that for those systems where the
>>>> LLC == NODE this now superfluous level gets removed by the degenerate
>>>> code. Have you verified that does the right thing?
>>>
>>> Let me check with that one and get back.
>>
>> Actually, it is not removed by the degenerate code. That is what this logic
>> is for. It checks for LCC == NODE or DIE == NODE before setting up the NODE
>> sched level. I can update the comment. This has also been tested on system
>> w/ LLC == NODE.
>
> Why does the degenerate code fail to remove things?
>

Sorry for confusion. Actually, the degenerate code does remove the duplicate
NODE sched-domain.

The logic above is taking a different approach. Instead of depending on the
degenerate code during cpu_attach_domain() at a later time, it would exclude the
NODE sched-domain during sched_init_numa(). The difference is, without
!cpumask_equal(), now the MC sched-domain would have the SD_PREFER_SIBLING flag
set by the degenerate code since the flag got transferred down from the NODE to
MC sched-domain. Would this be the preferred behavior for MC sched-domain?

Regards,
Suravee

2017-08-24 01:15:16

by Suthikulpanit, Suravee

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

Hi Peter,

On 8/14/17 14:44, Suravee Suthikulpanit wrote:
>
>
> On 8/11/17 16:15, Peter Zijlstra wrote:
>> On Fri, Aug 11, 2017 at 12:58:22PM +0700, Suravee Suthikulpanit wrote:
>>>
>>>
>>> On 8/11/17 11:57, Suravee Suthikulpanit wrote:
>>>>
>>>>>> [...]
>>>>>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>>>>> tl[i] = sched_domain_topology[i];
>>>>>>
>>>>>> /*
>>>>>> + * Ignore the NUMA identity level if it has the same cpumask
>>>>>> + * as previous level. This is the case for:
>>>>>> + * - System with last-level-cache (MC) sched domain span a NUMA node.
>>>>>> + * - System with DIE sched domain span a NUMA node.
>>>>>> + *
>>>>>> + * Assume all NUMA nodes are identical, so only check node 0.
>>>>>> + */
>>>>>> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>>>>>> + tl[i++] = (struct sched_domain_topology_level){
>>>>>> + .mask = sd_numa_mask,
>>>>>> + .numa_level = 0,
>>>>>> + SD_INIT_NAME(NODE)
>>>>>> + };
>>>>>
>>>>> So what you've forgotten to mention is that for those systems where the
>>>>> LLC == NODE this now superfluous level gets removed by the degenerate
>>>>> code. Have you verified that does the right thing?
>>>>
>>>> Let me check with that one and get back.
>>>
>>> Actually, it is not removed by the degenerate code. That is what this logic
>>> is for. It checks for LCC == NODE or DIE == NODE before setting up the NODE
>>> sched level. I can update the comment. This has also been tested on system
>>> w/ LLC == NODE.
>>
>> Why does the degenerate code fail to remove things?
>>
>
> Sorry for confusion. Actually, the degenerate code does remove the duplicate
> NODE sched-domain.
>
> The logic above is taking a different approach. Instead of depending on the
> degenerate code during cpu_attach_domain() at a later time, it would exclude the
> NODE sched-domain during sched_init_numa(). The difference is, without
> !cpumask_equal(), now the MC sched-domain would have the SD_PREFER_SIBLING flag
> set by the degenerate code since the flag got transferred down from the NODE to
> MC sched-domain. Would this be the preferred behavior for MC sched-domain?
>
> Regards,
> Suravee

Any feedback on this part?

Thanks,
Suravee