2019-07-23 20:09:23

by Matt Fleming

[permalink] [raw]
Subject: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

$ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <[email protected]>
Cc: "Suthikulpanit, Suravee" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Lendacky, Thomas" <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 3 +++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..d94bf83d5ee6 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/random.h>
+#include <linux/topology.h>
#include <asm/processor.h>
#include <asm/apic.h>
#include <asm/cacheinfo.h>
@@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);

+ node_reclaim_distance = 32;
+
/*
* Fix erratum 1076: CPB feature bit not being set in CPUID.
* Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
*/
#define RECLAIM_DISTANCE 30
#endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f751ce0b783e..f684fde00536 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif

/*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,

sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
- RECLAIM_DISTANCE;
+ node_reclaim_distance;
}
#else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
--
2.13.7


2019-07-23 21:40:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
>
> However, as is rather unfortunately explained in
>
> commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
>
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
>
> Current AMD EPYC machines have the following NUMA node distances:
>
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> where 2 hops is 32.
>
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
>
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
>
> $ numactl -C 0-7,32-39 ./spinner 16
>
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.
>
> Override node_reclaim_distance for AMD Zen.
>
> Signed-off-by: Matt Fleming <[email protected]>
> Cc: "Suthikulpanit, Suravee" <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: "Lendacky, Thomas" <[email protected]>
> Cc: Borislav Petkov <[email protected]>

Acked-by: Mel Gorman <[email protected]>

The only caveat I can think of is that a future generation of Zen might
take a different magic number than 32 as their remote distance. If or
when this happens, it'll need additional smarts but lacking a crystal
ball, we can cross that bridge when we come to it.

--
Mel Gorman
SUSE Labs

2019-07-23 22:57:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

On Tue, Jul 23, 2019 at 12:42:48PM +0100, Mel Gorman wrote:
> On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> > SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> > for any sched domains with a NUMA distance greater than 2 hops
> > (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> > across domains that far apart.
> >
> > However, as is rather unfortunately explained in
> >
> > commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
> >
> > the value for RECLAIM_DISTANCE is based on node distance tables from
> > 2011-era hardware.
> >
> > Current AMD EPYC machines have the following NUMA node distances:
> >
> > node distances:
> > node 0 1 2 3 4 5 6 7
> > 0: 10 16 16 16 32 32 32 32
> > 1: 16 10 16 16 32 32 32 32
> > 2: 16 16 10 16 32 32 32 32
> > 3: 16 16 16 10 32 32 32 32
> > 4: 32 32 32 32 10 16 16 16
> > 5: 32 32 32 32 16 10 16 16
> > 6: 32 32 32 32 16 16 10 16
> > 7: 32 32 32 32 16 16 16 10
> >
> > where 2 hops is 32.
> >
> > The result is that the scheduler fails to load balance properly across
> > NUMA nodes on different sockets -- 2 hops apart.
> >
> > For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> > (CPUs 32-39) like so,
> >
> > $ numactl -C 0-7,32-39 ./spinner 16
> >
> > causes all threads to fork and remain on node 0 until the active
> > balancer kicks in after a few seconds and forcibly moves some threads
> > to node 4.
> >
> > Override node_reclaim_distance for AMD Zen.
> >
> > Signed-off-by: Matt Fleming <[email protected]>
> > Cc: "Suthikulpanit, Suravee" <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: "Lendacky, Thomas" <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
>
> Acked-by: Mel Gorman <[email protected]>
>
> The only caveat I can think of is that a future generation of Zen might
> take a different magic number than 32 as their remote distance. If or
> when this happens, it'll need additional smarts but lacking a crystal
> ball, we can cross that bridge when we come to it.

I just suggested to Matt on IRC we could do something along these lines,
but we can do that later.

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1611,6 +1611,12 @@ void sched_init_numa(void)
}

/*
+ * Set the reclaim distance at 2 hops instead of at a fixed distance value.
+ */
+ if (level >= 2)
+ node_reclaim_distance = sched_domains_numa_distance[2];
+
+ /*
* 'level' contains the number of unique distances
*
* The sched_domains_numa_distance[] array includes the actual distance

2019-07-24 00:04:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

On Tue, Jul 23, 2019 at 02:03:21PM +0100, Mel Gorman wrote:
> On Tue, Jul 23, 2019 at 02:00:30PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 23, 2019 at 12:42:48PM +0100, Mel Gorman wrote:
> > > On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> > > > Signed-off-by: Matt Fleming <[email protected]>
> > > > Cc: "Suthikulpanit, Suravee" <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: "Lendacky, Thomas" <[email protected]>
> > > > Cc: Borislav Petkov <[email protected]>
> > >
> > > Acked-by: Mel Gorman <[email protected]>
> > >
> > > The only caveat I can think of is that a future generation of Zen might
> > > take a different magic number than 32 as their remote distance. If or
> > > when this happens, it'll need additional smarts but lacking a crystal
> > > ball, we can cross that bridge when we come to it.
> >
> > I just suggested to Matt on IRC we could do something along these lines,
> > but we can do that later.
> >
>
> That would seem fair but I do think it's something that could be done
> later (maybe 1 release away?) to avoid a false bisection to this patch by
> accident.

Quite agreed; changing reclaim_distance like that will affect a lot of
hardware, while the current patch limits the impact to just AMD-Zen
based bits.

> I don't *think* there are any machines out there with a 1-hop
> distance of 14 but if there is, your patch would make a difference to
> MM behaviour. In the same context, it might make sense to rename the
> value to somewhat reflective of the fact that "reclaim distance" affects
> scheduler placement. No good name springs to mind at the moment.

Yeah, naming sucks. Let us pain this bicycle shed blue!

2019-07-24 02:15:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

On Tue, Jul 23, 2019 at 02:00:30PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 23, 2019 at 12:42:48PM +0100, Mel Gorman wrote:
> > On Tue, Jul 23, 2019 at 11:48:30AM +0100, Matt Fleming wrote:
> > > Signed-off-by: Matt Fleming <[email protected]>
> > > Cc: "Suthikulpanit, Suravee" <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: "Lendacky, Thomas" <[email protected]>
> > > Cc: Borislav Petkov <[email protected]>
> >
> > Acked-by: Mel Gorman <[email protected]>
> >
> > The only caveat I can think of is that a future generation of Zen might
> > take a different magic number than 32 as their remote distance. If or
> > when this happens, it'll need additional smarts but lacking a crystal
> > ball, we can cross that bridge when we come to it.
>
> I just suggested to Matt on IRC we could do something along these lines,
> but we can do that later.
>

That would seem fair but I do think it's something that could be done
later (maybe 1 release away?) to avoid a false bisection to this patch by
accident. I don't *think* there are any machines out there with a 1-hop
distance of 14 but if there is, your patch would make a difference to
MM behaviour. In the same context, it might make sense to rename the
value to somewhat reflective of the fact that "reclaim distance" affects
scheduler placement. No good name springs to mind at the moment.

--
Mel Gorman
SUSE Labs

2019-07-25 18:09:29

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

Matt,

On 7/23/2019 5:48 AM, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
>
> However, as is rather unfortunately explained in
>
> commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
>
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
>
> Current AMD EPYC machines have the following NUMA node distances:
>
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> where 2 hops is 32.
>
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
>
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
>
> $ numactl -C 0-7,32-39 ./spinner 16
>
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.

I am testing this patch on the Linux-5.2, and I actually do not
notice difference pre vs post patch.

Besides the case above, I have also run an experiment with
a different number of threads across two sockets:

(Note: I only focus on thread0 of each core.)

sXnY = Socket X Node Y

* s0n0 + s0n1 + s1n0 + s1n1
numactl -C 0-15,32-47 ./spinner 32

* s0n2 + s0n3 + s1n2 + s1n3
numactl -C 16-31,48-63 ./spinner 32

* s0 + s1
numactl -C 0-63 ./spinner 64

My observations are:

* I still notice improper load-balance on one of the task initially
for a few seconds before they are load-balanced correctly.

* It is taking longer to load balance w/ more number of tasks.

I wonder if you have tried with a different kernel base?

Regards,
Suravee

2019-07-29 16:00:43

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

On Thu, 25 Jul, at 04:37:06PM, Suthikulpanit, Suravee wrote:
>
> I am testing this patch on the Linux-5.2, and I actually do not
> notice difference pre vs post patch.
>
> Besides the case above, I have also run an experiment with
> a different number of threads across two sockets:
>
> (Note: I only focus on thread0 of each core.)
>
> sXnY = Socket X Node Y
>
> * s0n0 + s0n1 + s1n0 + s1n1
> numactl -C 0-15,32-47 ./spinner 32
>
> * s0n2 + s0n3 + s1n2 + s1n3
> numactl -C 16-31,48-63 ./spinner 32
>
> * s0 + s1
> numactl -C 0-63 ./spinner 64
>
> My observations are:
>
> * I still notice improper load-balance on one of the task initially
> for a few seconds before they are load-balanced correctly.
>
> * It is taking longer to load balance w/ more number of tasks.
>
> I wonder if you have tried with a different kernel base?

It was tested with one of the 5.2 -rc kernels.

I'll take another look at this behaviour, but for the benefit of LKML
readers, here's the summary I gave before. It's specific to using
cgroups to partitions tasks:

It turns out there's a secondary issue to do with how run queue load
averages are compared between sched groups.

Load averages for a sched_group (a group within a domain) are
effectively "scaled" by the number of CPUs in that group. This has a
direct influence on how quickly load ramps up in a group.

What's happening on my system when running with $(numactl -C
0-7,32-39) is that the load for the top NUMA sched_domain (domain4) is
scaling the load by 64 CPUs -- even though the workload can't use all
64 due to scheduler affinity.

So because the load balancer thinks there's plenty of room left to run
tasks, it doesn't balance very well across sockets even with the
SD_BALANCE_FORK flag.

This super quick and ugly patch, which caps the number of CPUs at 8, gets both
sockets used by fork() on my system.

---->8----

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..9444c34d038c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5791,6 +5791,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
(sd->imbalance_pct-100) / 100;
+ unsigned long capacity;

if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
@@ -5835,10 +5836,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}

/* Adjust by relative CPU capacity of the group */
+ capacity = group->sgc->capacity;
+
+ if (capacity > (SCHED_CAPACITY_SCALE * 8))
+ capacity = SCHED_CAPACITY_SCALE * 8;
+
avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
- group->sgc->capacity;
+ capacity;
runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
- group->sgc->capacity;
+ capacity;

if (local_group) {
this_runnable_load = runnable_load;

----8<----

There's still an issue with the active load balancer kicking in after a few
seconds, but I suspect that is related to the use of group capacity elsewhere
in the load balancer code (like update_sg_lb_stats()).


--
Matt Fleming
SUSE Performance Team