2019-06-05 16:01:43

by Matt Fleming

[permalink] [raw]
Subject: [PATCH] sched/topology: Improve load balancing on AMD EPYC

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

$ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Update the code in sd_init() to account for modern node distances, and
maintaining backward-compatible behaviour by respecting
RECLAIM_DISTANCE for distances more than 2 hops.

Signed-off-by: Matt Fleming <[email protected]>
Cc: "Suthikulpanit, Suravee" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Lendacky, Thomas" <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
kernel/sched/topology.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..0eea395f7c6b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1410,7 +1410,18 @@ sd_init(struct sched_domain_topology_level *tl,

sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+
+ /*
+ * Strip the following flags for sched domains with a NUMA
+ * distance greater than the historical 2-hops value
+ * (RECLAIM_DISTANCE) and where tl->numa_level confirms it
+ * really is more than 2 hops.
+ *
+ * Respecting RECLAIM_DISTANCE means we maintain
+ * backwards-compatible behaviour.
+ */
+ if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE &&
+ tl->numa_level > 3) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
--
2.13.7


2019-06-05 18:02:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Wed, Jun 05, 2019 at 04:59:22PM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
>
> However, as is rather unfortunately explained in
>
> commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
>
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
>
> Current AMD EPYC machines have the following NUMA node distances:
>
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> where 2 hops is 32.
>
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
>

> Update the code in sd_init() to account for modern node distances, and
> maintaining backward-compatible behaviour by respecting
> RECLAIM_DISTANCE for distances more than 2 hops.

And then we had two magic values :/

Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
surely, if we want to load-balance agressively over 30, then so too
should we do node_reclaim() I'm thikning.

2019-06-10 21:27:50

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
>
> And then we had two magic values :/
>
> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
> surely, if we want to load-balance agressively over 30, then so too
> should we do node_reclaim() I'm thikning.

Yeah we can fix it just for EPYC, Mel suggested that approach originally.

Suravee, Tom, what's the best way to detect these EPYC machines that need to
override RECLAIM_DISTANCE?

2019-06-11 17:24:46

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On 6/10/19 4:26 PM, Matt Fleming wrote:
> On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
>>
>> And then we had two magic values :/
>>
>> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
>> surely, if we want to load-balance agressively over 30, then so too
>> should we do node_reclaim() I'm thikning.
>
> Yeah we can fix it just for EPYC, Mel suggested that approach originally.
>
> Suravee, Tom, what's the best way to detect these EPYC machines that need to
> override RECLAIM_DISTANCE?

You should be able to do it based on the family. There's an init_amd_zn()
function in arch/x86/kernel/cpu/amd.c. You can add something there or,
since init_amd_zn() sets X86_FEATURE_ZEN, you could check for that if you
prefer putting it in a different location.

Thanks,
Tom

>

2019-06-18 10:43:59

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Tue, 11 Jun, at 05:22:21PM, Lendacky, Thomas wrote:
> On 6/10/19 4:26 PM, Matt Fleming wrote:
> > On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
> >>
> >> And then we had two magic values :/
> >>
> >> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
> >> surely, if we want to load-balance agressively over 30, then so too
> >> should we do node_reclaim() I'm thikning.
> >
> > Yeah we can fix it just for EPYC, Mel suggested that approach originally.
> >
> > Suravee, Tom, what's the best way to detect these EPYC machines that need to
> > override RECLAIM_DISTANCE?
>
> You should be able to do it based on the family. There's an init_amd_zn()
> function in arch/x86/kernel/cpu/amd.c. You can add something there or,
> since init_amd_zn() sets X86_FEATURE_ZEN, you could check for that if you
> prefer putting it in a different location.

This works for me under all my tests. Thoughts?

--->8---

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80a405c2048a..4db4e9e7654b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/random.h>
+#include <linux/topology.h>
#include <asm/processor.h>
#include <asm/apic.h>
#include <asm/cacheinfo.h>
@@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);

+ node_reclaim_distance = 32;
+
/* Fix erratum 1076: CPB feature bit not being set in CPUID. */
if (!cpu_has(c, X86_FEATURE_CPB))
set_cpu_cap(c, X86_FEATURE_CPB);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e1ee4b..74b484354ac9 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,9 @@ int arch_update_cpu_topology(void);
*/
#define RECLAIM_DISTANCE 30
#endif
+
+extern int __read_mostly node_reclaim_distance;
+
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..20f0f8f6792b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif

/*
@@ -1410,7 +1411,7 @@ sd_init(struct sched_domain_topology_level *tl,

sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..67f5f09d70ed 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8abe0af..8ccaaf3a47f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3450,7 +3450,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
- RECLAIM_DISTANCE;
+ node_reclaim_distance;
}
#else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)

2019-06-18 12:34:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> This works for me under all my tests. Thoughts?
>
> --->8---
>
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 80a405c2048a..4db4e9e7654b 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -8,6 +8,7 @@
> #include <linux/sched.h>
> #include <linux/sched/clock.h>
> #include <linux/random.h>
> +#include <linux/topology.h>
> #include <asm/processor.h>
> #include <asm/apic.h>
> #include <asm/cacheinfo.h>
> @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> {
> set_cpu_cap(c, X86_FEATURE_ZEN);
>

I'm thinking this deserves a comment. Traditionally the SLIT table held
relative memory latency. So where the identity is 10, 16 would indicate
1.6 times local latency and 32 would be 3.2 times local.

Now, even very early on BIOS monkeys went about their business and put
in random values in an attempt to 'tune' the system based on how
$random-os behaved, which is all sorts of fu^Wwrong.

Now, I suppose my question is; is that 32 Zen puts in an actual relative
memory latency metric, or a random value we somehow have to deal with.
And can we pretty please describe the whole sordid story behind this
'tunable' somewhere?

> + node_reclaim_distance = 32;
> +
> /* Fix erratum 1076: CPB feature bit not being set in CPUID. */
> if (!cpu_has(c, X86_FEATURE_CPB))
> set_cpu_cap(c, X86_FEATURE_CPB);

2019-06-19 21:35:40

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
> On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> > This works for me under all my tests. Thoughts?
> >
> > --->8---
> >
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index 80a405c2048a..4db4e9e7654b 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -8,6 +8,7 @@
> > #include <linux/sched.h>
> > #include <linux/sched/clock.h>
> > #include <linux/random.h>
> > +#include <linux/topology.h>
> > #include <asm/processor.h>
> > #include <asm/apic.h>
> > #include <asm/cacheinfo.h>
> > @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> > {
> > set_cpu_cap(c, X86_FEATURE_ZEN);
> >
>
> I'm thinking this deserves a comment. Traditionally the SLIT table held
> relative memory latency. So where the identity is 10, 16 would indicate
> 1.6 times local latency and 32 would be 3.2 times local.
>
> Now, even very early on BIOS monkeys went about their business and put
> in random values in an attempt to 'tune' the system based on how
> $random-os behaved, which is all sorts of fu^Wwrong.
>
> Now, I suppose my question is; is that 32 Zen puts in an actual relative
> memory latency metric, or a random value we somehow have to deal with.
> And can we pretty please describe the whole sordid story behind this
> 'tunable' somewhere?

This is one for the AMD folks. I don't know if the memory latency
really is 3.2 times or not, only that that's the value in all the Zen
machines I have access to. Even this 2-socket one:

node distances:
node 0 1
0: 10 32
1: 32 10

Tom, Suravee?

2019-06-24 14:34:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Wed, Jun 19, 2019 at 10:34:37PM +0100, Matt Fleming wrote:
> On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
> > On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> > > This works for me under all my tests. Thoughts?
> > >
> > > --->8---
> > >
> > > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > > index 80a405c2048a..4db4e9e7654b 100644
> > > --- a/arch/x86/kernel/cpu/amd.c
> > > +++ b/arch/x86/kernel/cpu/amd.c
> > > @@ -8,6 +8,7 @@
> > > #include <linux/sched.h>
> > > #include <linux/sched/clock.h>
> > > #include <linux/random.h>
> > > +#include <linux/topology.h>
> > > #include <asm/processor.h>
> > > #include <asm/apic.h>
> > > #include <asm/cacheinfo.h>
> > > @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> > > {
> > > set_cpu_cap(c, X86_FEATURE_ZEN);
> > >
> >
> > I'm thinking this deserves a comment. Traditionally the SLIT table held
> > relative memory latency. So where the identity is 10, 16 would indicate
> > 1.6 times local latency and 32 would be 3.2 times local.
> >
> > Now, even very early on BIOS monkeys went about their business and put
> > in random values in an attempt to 'tune' the system based on how
> > $random-os behaved, which is all sorts of fu^Wwrong.
> >
> > Now, I suppose my question is; is that 32 Zen puts in an actual relative
> > memory latency metric, or a random value we somehow have to deal with.
> > And can we pretty please describe the whole sordid story behind this
> > 'tunable' somewhere?
>
> This is one for the AMD folks. I don't know if the memory latency
> really is 3.2 times or not, only that that's the value in all the Zen
> machines I have access to. Even this 2-socket one:
>
> node distances:
> node 0 1
> 0: 10 32
> 1: 32 10
>
> Tom, Suravee?

Do not consider this an authorative response but based on what I know
of the physical topology, it is not unreasonable to use 32 in the SLIT
table. There is a small latency when accessing another die on the same
socket (details are generation specific). It's not quite a local access
but it's not as much as a traditional remote access either (hence 16 being
the base unit for another die to hint that it's not quite local but not
quite remote either). 32 is based on accessing a die on a remote socket
based on the expected performance and latency of the interconnect.

To the best of my knowledge, the magic numbers are reflective of the real
topology and not just a gamification of the numbers for a random OS. If
anything, the fact that there is a load balancing issue on Linux would
indicate that they were not picking random numbers for Linux at least :P

--
Mel Gorman
SUSE Labs

2019-06-26 21:19:28

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On 6/24/19 9:24 AM, Mel Gorman wrote:
> On Wed, Jun 19, 2019 at 10:34:37PM +0100, Matt Fleming wrote:
>> On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
>>> On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
>>>> This works for me under all my tests. Thoughts?
>>>>
>>>> --->8---
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
>>>> index 80a405c2048a..4db4e9e7654b 100644
>>>> --- a/arch/x86/kernel/cpu/amd.c
>>>> +++ b/arch/x86/kernel/cpu/amd.c
>>>> @@ -8,6 +8,7 @@
>>>> #include <linux/sched.h>
>>>> #include <linux/sched/clock.h>
>>>> #include <linux/random.h>
>>>> +#include <linux/topology.h>
>>>> #include <asm/processor.h>
>>>> #include <asm/apic.h>
>>>> #include <asm/cacheinfo.h>
>>>> @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
>>>> {
>>>> set_cpu_cap(c, X86_FEATURE_ZEN);
>>>>
>>>
>>> I'm thinking this deserves a comment. Traditionally the SLIT table held
>>> relative memory latency. So where the identity is 10, 16 would indicate
>>> 1.6 times local latency and 32 would be 3.2 times local.
>>>
>>> Now, even very early on BIOS monkeys went about their business and put
>>> in random values in an attempt to 'tune' the system based on how
>>> $random-os behaved, which is all sorts of fu^Wwrong.
>>>
>>> Now, I suppose my question is; is that 32 Zen puts in an actual relative
>>> memory latency metric, or a random value we somehow have to deal with.
>>> And can we pretty please describe the whole sordid story behind this
>>> 'tunable' somewhere?
>>
>> This is one for the AMD folks. I don't know if the memory latency
>> really is 3.2 times or not, only that that's the value in all the Zen
>> machines I have access to. Even this 2-socket one:
>>
>> node distances:
>> node 0 1
>> 0: 10 32
>> 1: 32 10
>>
>> Tom, Suravee?
>
> Do not consider this an authorative response but based on what I know
> of the physical topology, it is not unreasonable to use 32 in the SLIT
> table. There is a small latency when accessing another die on the same
> socket (details are generation specific). It's not quite a local access
> but it's not as much as a traditional remote access either (hence 16 being
> the base unit for another die to hint that it's not quite local but not
> quite remote either). 32 is based on accessing a die on a remote socket
> based on the expected performance and latency of the interconnect.
>
> To the best of my knowledge, the magic numbers are reflective of the real
> topology and not just a gamification of the numbers for a random OS. If
> anything, the fact that there is a load balancing issue on Linux would
> indicate that they were not picking random numbers for Linux at least :P
>

We use 16 to designate 1-hop latency (for different node within the same socket).
For across-socket access, since the latency is greater, we set the latency to 32
(twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.

At this point, it might not be possible to change the SLIT values on
existing platforms out in the field. So, introducing the AMD family17h
quirk as Matt suggested would be a more feasible approach.

Going forward, we will make sure that this would not exceed the standard
RECLAIM_DISTANCE (30).

Thanks,
Suravee

2019-06-28 15:15:43

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

On Wed, 26 Jun, at 09:18:01PM, Suthikulpanit, Suravee wrote:
>
> We use 16 to designate 1-hop latency (for different node within the same socket).
> For across-socket access, since the latency is greater, we set the latency to 32
> (twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.

I guess the question is: Is the memory latency of a remote node 1 hop
away 1.6x the local node latency?

2019-07-22 17:26:37

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

Matt

On 6/28/2019 10:15 AM, Matt Fleming wrote:
> On Wed, 26 Jun, at 09:18:01PM, Suthikulpanit, Suravee wrote:
>>
>> We use 16 to designate 1-hop latency (for different node within the same socket).
>> For across-socket access, since the latency is greater, we set the latency to 32
>> (twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.
>
> I guess the question is: Is the memory latency of a remote node 1 hop
> away 1.6x the local node latency?
>

Yes, remote node 1 hop is 1.6x the local node latency for AMD EPYC.

Suravee