2019-08-08 19:54:00

by Matt Fleming

[permalink] [raw]
Subject: [PATCH v4 0/2] sched: Improve load balancing on AMD EPYC

This is another version of the AMD EPYC load balancing patch. The
difference with this one is that now it fixes the following ia64 build
error, reported by 0day:

mm/page_alloc.o: In function `get_page_from_freelist':
page_alloc.c:(.text+0x7850): undefined reference to `node_reclaim_distance'
page_alloc.c:(.text+0x7931): undefined reference to `node_reclaim_distance'

Matt Fleming (2):
ia64: Make NUMA select SMP
sched/topology: Improve load balancing on AMD EPYC

arch/ia64/Kconfig | 1 +
arch/x86/kernel/cpu/amd.c | 5 +++++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
6 files changed, 24 insertions(+), 3 deletions(-)

--
2.13.7


2019-08-08 19:54:04

by Matt Fleming

[permalink] [raw]
Subject: [PATCH 1/2] ia64: Make NUMA select SMP

While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.

Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.

The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.

Signed-off-by: Matt Fleming <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/ia64/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e50467..997baba02b70 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+ select SMP
help
Say Y to compile the kernel to support NUMA (Non-Uniform Memory
Access). This option is for configuring high-end multiprocessor
--
2.13.7

2019-08-08 19:56:18

by Matt Fleming

[permalink] [raw]
Subject: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

$ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: [email protected]
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/cpu/amd.c | 5 +++++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..ceeb8afc7cf3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/random.h>
+#include <linux/topology.h>
#include <asm/processor.h>
#include <asm/apic.h>
#include <asm/cacheinfo.h>
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);

+#ifdef CONFIG_NUMA
+ node_reclaim_distance = 32;
+#endif
+
/*
* Fix erratum 1076: CPB feature bit not being set in CPUID.
* Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
*/
#define RECLAIM_DISTANCE 30
#endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e3ea9a..b5667a273bf6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif

/*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,

sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
- RECLAIM_DISTANCE;
+ node_reclaim_distance;
}
#else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
--
2.13.7

2019-09-03 08:32:55

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/core] sched/topology: Improve load balancing on AMD EPYC systems

The following commit has been merged into the sched/core branch of tip:

Commit-ID: a55c7454a8c887b226a01d7eed088ccb5374d81e
Gitweb: https://git.kernel.org/tip/a55c7454a8c887b226a01d7eed088ccb5374d81e
Author: Matt Fleming <[email protected]>
AuthorDate: Thu, 08 Aug 2019 20:53:01 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00

sched/topology: Improve load balancing on AMD EPYC systems

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

$ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 5 +++++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e504..ceeb8af 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/random.h>
+#include <linux/topology.h>
#include <asm/processor.h>
#include <asm/apic.h>
#include <asm/cacheinfo.h>
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);

+#ifdef CONFIG_NUMA
+ node_reclaim_distance = 32;
+#endif
+
/*
* Fix erratum 1076: CPB feature bit not being set in CPUID.
* Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c..579522e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
*/
#define RECLAIM_DISTANCE 30
#endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e..b5667a2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif

/*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,

sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b..ccede24 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de..0d54cd2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
- RECLAIM_DISTANCE;
+ node_reclaim_distance;
}
#else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)

2019-09-03 08:33:58

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/core] arch, ia64: Make NUMA select SMP

The following commit has been merged into the sched/core branch of tip:

Commit-ID: a2cbfd46559e809c8165773b7fe8afa058b35414
Gitweb: https://git.kernel.org/tip/a2cbfd46559e809c8165773b7fe8afa058b35414
Author: Matt Fleming <[email protected]>
AuthorDate: Thu, 08 Aug 2019 20:53:00 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:36 +02:00

arch, ia64: Make NUMA select SMP

While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.

Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.

The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.

Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e..997baba 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+ select SMP
help
Say Y to compile the kernel to support NUMA (Non-Uniform Memory
Access). This option is for configuring high-end multiprocessor

2019-10-07 15:29:04

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

Hi,

On Thu, Aug 08, 2019 at 08:53:01PM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
>
> However, as is rather unfortunately explained in
>
> commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
>
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
>
> Current AMD EPYC machines have the following NUMA node distances:
>
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> where 2 hops is 32.
>
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
>
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
>
> $ numactl -C 0-7,32-39 ./spinner 16
>
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.
>
> Override node_reclaim_distance for AMD Zen.
>
> Signed-off-by: Matt Fleming <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: [email protected]
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]

This patch causes build errors on systems where NUMA does not depend on SMP,
for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
disabled results in

mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to `node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

I have seen a similar problem with one of my PPC test builds.

powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to `node_reclaim_distance'

Guenter

2019-10-09 12:05:48

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:
>
> This patch causes build errors on systems where NUMA does not depend on SMP,
> for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
> disabled results in
>
> mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
> page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to `node_reclaim_distance'
> Makefile:1074: recipe for target 'vmlinux' failed
> make: *** [vmlinux] Error 1
>
> I have seen a similar problem with one of my PPC test builds.
>
> powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to `node_reclaim_distance'

Thanks for this Guenter.

So, the way I've fixed this same issue for ia64 was to make NUMA
depend on SMP. Does that seem like a suitable solution for both PPC
and MIPS?

2019-10-09 12:41:52

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

On 10/9/19 5:04 AM, Matt Fleming wrote:
> On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:
>>
>> This patch causes build errors on systems where NUMA does not depend on SMP,
>> for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
>> disabled results in
>>
>> mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
>> page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
>> mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to `node_reclaim_distance'
>> mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to `node_reclaim_distance'
>> mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to `node_reclaim_distance'
>> Makefile:1074: recipe for target 'vmlinux' failed
>> make: *** [vmlinux] Error 1
>>
>> I have seen a similar problem with one of my PPC test builds.
>>
>> powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to `node_reclaim_distance'
>
> Thanks for this Guenter.
>
> So, the way I've fixed this same issue for ia64 was to make NUMA
> depend on SMP. Does that seem like a suitable solution for both PPC
> and MIPS?
>

You would still have to cover all other architectures where SMP and NUMA are independent
of each other. Fortunately, it looks like this is only sh4.

sh4-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x3ce0): undefined reference to `node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

arm64 and s390 happen to work because they mandate SMP support, even though NUMA
is nominally independent.

Wondering - why not declare node_reclaim_distance outside SMP dependency ?

Thanks,
Guenter