This is another version of the AMD EPYC load balancing patch. The
difference with this one is that now it fixes the following ia64 build
error, reported by 0day:
mm/page_alloc.o: In function `get_page_from_freelist':
page_alloc.c:(.text+0x7850): undefined reference to `node_reclaim_distance'
page_alloc.c:(.text+0x7931): undefined reference to `node_reclaim_distance'
Matt Fleming (2):
ia64: Make NUMA select SMP
sched/topology: Improve load balancing on AMD EPYC
arch/ia64/Kconfig | 1 +
arch/x86/kernel/cpu/amd.c | 5 +++++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
6 files changed, 24 insertions(+), 3 deletions(-)
--
2.13.7
While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.
Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.
The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.
Signed-off-by: Matt Fleming <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/ia64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e50467..997baba02b70 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+ select SMP
help
Say Y to compile the kernel to support NUMA (Non-Uniform Memory
Access). This option is for configuring high-end multiprocessor
--
2.13.7
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.
However, as is rather unfortunately explained in
commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.
Current AMD EPYC machines have the following NUMA node distances:
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
where 2 hops is 32.
The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.
For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,
$ numactl -C 0-7,32-39 ./spinner 16
causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.
Override node_reclaim_distance for AMD Zen.
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: [email protected]
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/cpu/amd.c | 5 +++++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..ceeb8afc7cf3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/random.h>
+#include <linux/topology.h>
#include <asm/processor.h>
#include <asm/apic.h>
#include <asm/cacheinfo.h>
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);
+#ifdef CONFIG_NUMA
+ node_reclaim_distance = 32;
+#endif
+
/*
* Fix erratum 1076: CPB feature bit not being set in CPUID.
* Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
*/
#define RECLAIM_DISTANCE 30
#endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e3ea9a..b5667a273bf6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif
/*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
- RECLAIM_DISTANCE;
+ node_reclaim_distance;
}
#else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
--
2.13.7
The following commit has been merged into the sched/core branch of tip:
Commit-ID: a55c7454a8c887b226a01d7eed088ccb5374d81e
Gitweb: https://git.kernel.org/tip/a55c7454a8c887b226a01d7eed088ccb5374d81e
Author: Matt Fleming <[email protected]>
AuthorDate: Thu, 08 Aug 2019 20:53:01 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00
sched/topology: Improve load balancing on AMD EPYC systems
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.
However, as is rather unfortunately explained in:
commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.
Current AMD EPYC machines have the following NUMA node distances:
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
where 2 hops is 32.
The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.
For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,
$ numactl -C 0-7,32-39 ./spinner 16
causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.
Override node_reclaim_distance for AMD Zen.
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 5 +++++
include/linux/topology.h | 14 ++++++++++++++
kernel/sched/topology.c | 3 ++-
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e504..ceeb8af 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/random.h>
+#include <linux/topology.h>
#include <asm/processor.h>
#include <asm/apic.h>
#include <asm/cacheinfo.h>
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);
+#ifdef CONFIG_NUMA
+ node_reclaim_distance = 32;
+#endif
+
/*
* Fix erratum 1076: CPB feature bit not being set in CPUID.
* Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c..579522e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
*/
#define RECLAIM_DISTANCE 30
#endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e..b5667a2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif
/*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b..ccede24 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de..0d54cd2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
- RECLAIM_DISTANCE;
+ node_reclaim_distance;
}
#else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
The following commit has been merged into the sched/core branch of tip:
Commit-ID: a2cbfd46559e809c8165773b7fe8afa058b35414
Gitweb: https://git.kernel.org/tip/a2cbfd46559e809c8165773b7fe8afa058b35414
Author: Matt Fleming <[email protected]>
AuthorDate: Thu, 08 Aug 2019 20:53:00 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:36 +02:00
arch, ia64: Make NUMA select SMP
While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.
Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.
The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e..997baba 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+ select SMP
help
Say Y to compile the kernel to support NUMA (Non-Uniform Memory
Access). This option is for configuring high-end multiprocessor
Hi,
On Thu, Aug 08, 2019 at 08:53:01PM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
>
> However, as is rather unfortunately explained in
>
> commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
>
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
>
> Current AMD EPYC machines have the following NUMA node distances:
>
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> where 2 hops is 32.
>
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
>
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
>
> $ numactl -C 0-7,32-39 ./spinner 16
>
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.
>
> Override node_reclaim_distance for AMD Zen.
>
> Signed-off-by: Matt Fleming <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: [email protected]
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
This patch causes build errors on systems where NUMA does not depend on SMP,
for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
disabled results in
mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to `node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1
I have seen a similar problem with one of my PPC test builds.
powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to `node_reclaim_distance'
Guenter
On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:
>
> This patch causes build errors on systems where NUMA does not depend on SMP,
> for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
> disabled results in
>
> mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
> page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to `node_reclaim_distance'
> Makefile:1074: recipe for target 'vmlinux' failed
> make: *** [vmlinux] Error 1
>
> I have seen a similar problem with one of my PPC test builds.
>
> powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to `node_reclaim_distance'
Thanks for this Guenter.
So, the way I've fixed this same issue for ia64 was to make NUMA
depend on SMP. Does that seem like a suitable solution for both PPC
and MIPS?
On 10/9/19 5:04 AM, Matt Fleming wrote:
> On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:
>>
>> This patch causes build errors on systems where NUMA does not depend on SMP,
>> for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
>> disabled results in
>>
>> mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
>> page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
>> mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to `node_reclaim_distance'
>> mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to `node_reclaim_distance'
>> mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to `node_reclaim_distance'
>> Makefile:1074: recipe for target 'vmlinux' failed
>> make: *** [vmlinux] Error 1
>>
>> I have seen a similar problem with one of my PPC test builds.
>>
>> powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to `node_reclaim_distance'
>
> Thanks for this Guenter.
>
> So, the way I've fixed this same issue for ia64 was to make NUMA
> depend on SMP. Does that seem like a suitable solution for both PPC
> and MIPS?
>
You would still have to cover all other architectures where SMP and NUMA are independent
of each other. Fortunately, it looks like this is only sh4.
sh4-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x3ce0): undefined reference to `node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1
arm64 and s390 happen to work because they mandate SMP support, even though NUMA
is nominally independent.
Wondering - why not declare node_reclaim_distance outside SMP dependency ?
Thanks,
Guenter