2023-10-12 02:49:23

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH RFC 0/5] support NUMA emulation for arm64

A brief introduction
====================

The NUMA emulation can fake more node base on a single
node system, e.g.

one node system:

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31788 MB
node 0 free: 31446 MB
node distances:
node 0
0: 10

add numa=fake=2 (fake 2 node on each origin node):

[root@localhost ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15806 MB
node 0 free: 15451 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16029 MB
node 1 free: 15989 MB
node distances:
node 0 1
0: 10 10
1: 10 10

As above shown, a new node has been faked. As cpus, the realization
of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
better (not sure, next to do if so).

Why do this
===========

It seems has following reasons:
(1) In x86 host, apply NUMA emulation can fake more nodes environment
to test or verify some performance stuff, but arm64 only has
one method that modify ACPI table to do this. It's troublesome
more or less.
(2) Reduce competition for some locks. Here an example we found:
will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
hotspot on lruvec->lock when test in single environment. What's
more, The performance improved greatly if test in two more nodes
system. The data shows below (more is better):

---------------------------------------------------------------------
threads/process | 1 | 12 | 24 | 48 | 96
---------------------------------------------------------------------
one node | 14 1122 | 110 5372 | 111 2615 | 79 7084 | 72 4516
---------------------------------------------------------------------
numa=fake=2 | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
---------------------------------------------------------------------
| For concurrency 12, no lruvec->lock hotspot. For 24,
hotspot | one node has 24% hotspot on lruvec->lock, but
| two nodes env hasn't.
---------------------------------------------------------------------

As for risks (e.g. numa balance...), they need to be discussed here.

Lastly, this just is a draft, I can improve next if it's acceptable.

Thanks!

Rongwei Wang (5):
mm/numa: move numa emulation APIs into generic files
mm: percpu: fix variable type of cpu
arch_numa: remove __init in early_cpu_to_node()
mm/numa: support CONFIG_NUMA_EMU for arm64
mm/numa: migrate leftover numa emulation into mm/numa.c

arch/x86/Kconfig | 8 -
arch/x86/include/asm/numa.h | 3 -
arch/x86/mm/Makefile | 1 -
arch/x86/mm/numa.c | 216 +-------------
arch/x86/mm/numa_internal.h | 14 +-
drivers/base/arch_numa.c | 7 +-
include/asm-generic/numa.h | 33 +++
include/linux/percpu.h | 2 +-
mm/Kconfig | 8 +
mm/Makefile | 1 +
arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
11 files changed, 373 insertions(+), 253 deletions(-)
rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)

--
2.32.0.3.gf3a3e56d6


2023-10-12 02:49:28

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH RFC 2/5] mm: percpu: fix variable type of cpu

Almost all places declare 'cpu' as 'unsigned int'
type, but early_cpu_to_nod() not. So correct it
in this patch.

Signed-off-by: Rongwei Wang <[email protected]>
---
drivers/base/arch_numa.c | 2 +-
include/linux/percpu.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index eaa31e567d1e..db0bb8b8fd67 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -144,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);

-static int __init early_cpu_to_node(int cpu)
+static int __init early_cpu_to_node(unsigned int cpu)
{
return cpu_to_node_map[cpu];
}
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 68fac2e7cbe6..4aee8400af54 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -100,7 +100,7 @@ extern const char * const pcpu_fc_names[PCPU_FC_NR];

extern enum pcpu_fc pcpu_chosen_fc;

-typedef int (pcpu_fc_cpu_to_node_fn_t)(int cpu);
+typedef int (pcpu_fc_cpu_to_node_fn_t)(unsigned int cpu);
typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);

extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
--
2.32.0.3.gf3a3e56d6

2023-10-12 02:49:32

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH RFC 1/5] mm/numa: move numa emulation APIs into generic files

In order to support NUMA EMU for other
arch, some functions that used by numa_meminfo
should be moved out x86 arch. mm/numa.c created
to place above API.

CONFIG_NUMA_EMU will be handled later.

Signed-off-by: Rongwei Wang <[email protected]>
---
arch/x86/include/asm/numa.h | 3 -
arch/x86/mm/numa.c | 216 +-------------------------
arch/x86/mm/numa_internal.h | 14 +-
include/asm-generic/numa.h | 18 +++
mm/Makefile | 1 +
mm/numa.c | 298 ++++++++++++++++++++++++++++++++++++
6 files changed, 323 insertions(+), 227 deletions(-)
create mode 100644 mm/numa.c

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..8d79be8095d5 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -9,9 +9,6 @@
#include <asm/apicdef.h>

#ifdef CONFIG_NUMA
-
-#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
-
/*
* Too small node sizes may confuse the VM badly. Usually they
* result from BIOS bugs. So dont recognize nodes as standalone
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f..969b11fff03f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -25,8 +25,8 @@ nodemask_t numa_nodes_parsed __initdata;
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);

-static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+extern struct numa_meminfo numa_meminfo;
+extern struct numa_meminfo numa_reserved_meminfo;

static int numa_distance_cnt;
static u8 *numa_distance;
@@ -148,34 +148,6 @@ static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
return 0;
}

-/**
- * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
- * @idx: Index of memblk to remove
- * @mi: numa_meminfo to remove memblk from
- *
- * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
- * decrementing @mi->nr_blks.
- */
-void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
-{
- mi->nr_blks--;
- memmove(&mi->blk[idx], &mi->blk[idx + 1],
- (mi->nr_blks - idx) * sizeof(mi->blk[0]));
-}
-
-/**
- * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
- * @dst: numa_meminfo to append block to
- * @idx: Index of memblk to remove
- * @src: numa_meminfo to remove memblk from
- */
-static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
- struct numa_meminfo *src)
-{
- dst->blk[dst->nr_blks++] = src->blk[idx];
- numa_remove_memblk_from(idx, src);
-}
-
/**
* numa_add_memblk - Add one numa_memblk to numa_meminfo
* @nid: NUMA node ID of the new memblk
@@ -225,124 +197,6 @@ static void __init alloc_node_data(int nid)
node_set_online(nid);
}

-/**
- * numa_cleanup_meminfo - Cleanup a numa_meminfo
- * @mi: numa_meminfo to clean up
- *
- * Sanitize @mi by merging and removing unnecessary memblks. Also check for
- * conflicts and clear unused memblks.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
-{
- const u64 low = 0;
- const u64 high = PFN_PHYS(max_pfn);
- int i, j, k;
-
- /* first, trim all entries */
- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *bi = &mi->blk[i];
-
- /* move / save reserved memory ranges */
- if (!memblock_overlaps_region(&memblock.memory,
- bi->start, bi->end - bi->start)) {
- numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
- continue;
- }
-
- /* make sure all non-reserved blocks are inside the limits */
- bi->start = max(bi->start, low);
-
- /* preserve info for non-RAM areas above 'max_pfn': */
- if (bi->end > high) {
- numa_add_memblk_to(bi->nid, high, bi->end,
- &numa_reserved_meminfo);
- bi->end = high;
- }
-
- /* and there's no empty block */
- if (bi->start >= bi->end)
- numa_remove_memblk_from(i--, mi);
- }
-
- /* merge neighboring / overlapping entries */
- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *bi = &mi->blk[i];
-
- for (j = i + 1; j < mi->nr_blks; j++) {
- struct numa_memblk *bj = &mi->blk[j];
- u64 start, end;
-
- /*
- * See whether there are overlapping blocks. Whine
- * about but allow overlaps of the same nid. They
- * will be merged below.
- */
- if (bi->end > bj->start && bi->start < bj->end) {
- if (bi->nid != bj->nid) {
- pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
- bi->nid, bi->start, bi->end - 1,
- bj->nid, bj->start, bj->end - 1);
- return -EINVAL;
- }
- pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
- bi->nid, bi->start, bi->end - 1,
- bj->start, bj->end - 1);
- }
-
- /*
- * Join together blocks on the same node, holes
- * between which don't overlap with memory on other
- * nodes.
- */
- if (bi->nid != bj->nid)
- continue;
- start = min(bi->start, bj->start);
- end = max(bi->end, bj->end);
- for (k = 0; k < mi->nr_blks; k++) {
- struct numa_memblk *bk = &mi->blk[k];
-
- if (bi->nid == bk->nid)
- continue;
- if (start < bk->end && end > bk->start)
- break;
- }
- if (k < mi->nr_blks)
- continue;
- printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
- bi->nid, bi->start, bi->end - 1, bj->start,
- bj->end - 1, start, end - 1);
- bi->start = start;
- bi->end = end;
- numa_remove_memblk_from(j--, mi);
- }
- }
-
- /* clear unused ones */
- for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
- mi->blk[i].start = mi->blk[i].end = 0;
- mi->blk[i].nid = NUMA_NO_NODE;
- }
-
- return 0;
-}
-
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
- const struct numa_meminfo *mi)
-{
- int i;
-
- for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
- if (mi->blk[i].start != mi->blk[i].end &&
- mi->blk[i].nid != NUMA_NO_NODE)
- node_set(mi->blk[i].nid, *nodemask);
-}
-
/**
* numa_reset_distance - Reset NUMA distance table
*
@@ -478,72 +332,6 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
return true;
}

-/*
- * Mark all currently memblock-reserved physical memory (which covers the
- * kernel's own memory ranges) as hot-unswappable.
- */
-static void __init numa_clear_kernel_node_hotplug(void)
-{
- nodemask_t reserved_nodemask = NODE_MASK_NONE;
- struct memblock_region *mb_region;
- int i;
-
- /*
- * We have to do some preprocessing of memblock regions, to
- * make them suitable for reservation.
- *
- * At this time, all memory regions reserved by memblock are
- * used by the kernel, but those regions are not split up
- * along node boundaries yet, and don't necessarily have their
- * node ID set yet either.
- *
- * So iterate over all memory known to the x86 architecture,
- * and use those ranges to set the nid in memblock.reserved.
- * This will split up the memblock regions along node
- * boundaries and will set the node IDs as well.
- */
- for (i = 0; i < numa_meminfo.nr_blks; i++) {
- struct numa_memblk *mb = numa_meminfo.blk + i;
- int ret;
-
- ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
- WARN_ON_ONCE(ret);
- }
-
- /*
- * Now go over all reserved memblock regions, to construct a
- * node mask of all kernel reserved memory areas.
- *
- * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
- * numa_meminfo might not include all memblock.reserved
- * memory ranges, because quirks such as trim_snb_memory()
- * reserve specific pages for Sandy Bridge graphics. ]
- */
- for_each_reserved_mem_region(mb_region) {
- int nid = memblock_get_region_node(mb_region);
-
- if (nid != MAX_NUMNODES)
- node_set(nid, reserved_nodemask);
- }
-
- /*
- * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
- * belonging to the reserved node mask.
- *
- * Note that this will include memory regions that reside
- * on nodes that contain kernel memory - entire nodes
- * become hot-unpluggable:
- */
- for (i = 0; i < numa_meminfo.nr_blks; i++) {
- struct numa_memblk *mb = numa_meminfo.blk + i;
-
- if (!node_isset(mb->nid, reserved_nodemask))
- continue;
-
- memblock_clear_hotplug(mb->start, mb->end - mb->start);
- }
-}
-
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
int i, nid;
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index 86860f279662..b6053beb81b1 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -16,19 +16,13 @@ struct numa_meminfo {
struct numa_memblk blk[NR_NODE_MEMBLKS];
};

-void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
-int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+extern int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
void __init numa_reset_distance(void);

void __init x86_numa_init(void);

-#ifdef CONFIG_NUMA_EMU
-void __init numa_emulation(struct numa_meminfo *numa_meminfo,
- int numa_dist_cnt);
-#else
-static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
- int numa_dist_cnt)
-{ }
-#endif
+extern void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt);
+

#endif /* __X86_MM_NUMA_INTERNAL_H */
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index 1a3ad6d29833..929d7c582a73 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -39,6 +39,24 @@ void numa_store_cpu_info(unsigned int cpu);
void numa_add_cpu(unsigned int cpu);
void numa_remove_cpu(unsigned int cpu);

+struct numa_memblk {
+ u64 start;
+ u64 end;
+ int nid;
+};
+
+struct numa_meminfo {
+ int nr_blks;
+ struct numa_memblk blk[NR_NODE_MEMBLKS];
+};
+
+extern struct numa_meminfo numa_meminfo;
+
+int __init numa_register_memblks(struct numa_meminfo *mi);
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+ int numa_dist_cnt);
+
#else /* CONFIG_NUMA */

static inline void numa_store_cpu_info(unsigned int cpu) { }
diff --git a/mm/Makefile b/mm/Makefile
index ec65984e2ade..6fc1bd7c9f5b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_NUMA) += numa.o
diff --git a/mm/numa.c b/mm/numa.c
new file mode 100644
index 000000000000..88277e8404f0
--- /dev/null
+++ b/mm/numa.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/memblock.h>
+#include <linux/mmzone.h>
+#include <linux/ctype.h>
+#include <linux/nodemask.h>
+#include <linux/sched.h>
+#include <linux/topology.h>
+
+#include <asm/dma.h>
+
+struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+/*
+ * Set nodes, which have memory in @mi, in *@nodemask.
+ */
+void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+ const struct numa_meminfo *mi)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
+ if (mi->blk[i].start != mi->blk[i].end &&
+ mi->blk[i].nid != NUMA_NO_NODE)
+ node_set(mi->blk[i].nid, *nodemask);
+}
+
+/**
+ * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
+ * @idx: Index of memblk to remove
+ * @mi: numa_meminfo to remove memblk from
+ *
+ * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
+ * decrementing @mi->nr_blks.
+ */
+static void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
+{
+ mi->nr_blks--;
+ memmove(&mi->blk[idx], &mi->blk[idx + 1],
+ (mi->nr_blks - idx) * sizeof(mi->blk[0]));
+}
+
+/**
+ * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
+ * @dst: numa_meminfo to append block to
+ * @idx: Index of memblk to remove
+ * @src: numa_meminfo to remove memblk from
+ */
+static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
+ struct numa_meminfo *src)
+{
+ dst->blk[dst->nr_blks++] = src->blk[idx];
+ numa_remove_memblk_from(idx, src);
+}
+
+int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+ struct numa_meminfo *mi)
+{
+ /* ignore zero length blks */
+ if (start == end)
+ return 0;
+
+ /* whine about and ignore invalid blks */
+ if (start > end || nid < 0 || nid >= MAX_NUMNODES) {
+ pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
+ nid, start, end - 1);
+ return 0;
+ }
+
+ if (mi->nr_blks >= NR_NODE_MEMBLKS) {
+ pr_err("too many memblk ranges\n");
+ return -EINVAL;
+ }
+
+ mi->blk[mi->nr_blks].start = start;
+ mi->blk[mi->nr_blks].end = end;
+ mi->blk[mi->nr_blks].nid = nid;
+ mi->nr_blks++;
+ return 0;
+}
+
+/**
+ * numa_cleanup_meminfo - Cleanup a numa_meminfo
+ * @mi: numa_meminfo to clean up
+ *
+ * Sanitize @mi by merging and removing unnecessary memblks. Also check for
+ * conflicts and clear unused memblks.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
+{
+ const u64 low = 0;
+ const u64 high = PFN_PHYS(max_pfn);
+ int i, j, k;
+
+ /* first, trim all entries */
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *bi = &mi->blk[i];
+
+ /* move / save reserved memory ranges */
+ if (!memblock_overlaps_region(&memblock.memory,
+ bi->start, bi->end - bi->start)) {
+ numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
+ continue;
+ }
+
+ /* make sure all non-reserved blocks are inside the limits */
+ bi->start = max(bi->start, low);
+
+ /* preserve info for non-RAM areas above 'max_pfn': */
+ if (bi->end > high) {
+ numa_add_memblk_to(bi->nid, high, bi->end,
+ &numa_reserved_meminfo);
+ bi->end = high;
+ }
+
+ /* and there's no empty block */
+ if (bi->start >= bi->end)
+ numa_remove_memblk_from(i--, mi);
+ }
+
+ /* merge neighboring / overlapping entries */
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *bi = &mi->blk[i];
+
+ for (j = i + 1; j < mi->nr_blks; j++) {
+ struct numa_memblk *bj = &mi->blk[j];
+ u64 start, end;
+
+ /*
+ * See whether there are overlapping blocks. Whine
+ * about but allow overlaps of the same nid. They
+ * will be merged below.
+ */
+ if (bi->end > bj->start && bi->start < bj->end) {
+ if (bi->nid != bj->nid) {
+ pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
+ bi->nid, bi->start, bi->end - 1,
+ bj->nid, bj->start, bj->end - 1);
+ return -EINVAL;
+ }
+ pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
+ bi->nid, bi->start, bi->end - 1,
+ bj->start, bj->end - 1);
+ }
+
+ /*
+ * Join together blocks on the same node, holes
+ * between which don't overlap with memory on other
+ * nodes.
+ */
+ if (bi->nid != bj->nid)
+ continue;
+ start = min(bi->start, bj->start);
+ end = max(bi->end, bj->end);
+ for (k = 0; k < mi->nr_blks; k++) {
+ struct numa_memblk *bk = &mi->blk[k];
+
+ if (bi->nid == bk->nid)
+ continue;
+ if (start < bk->end && end > bk->start)
+ break;
+ }
+ if (k < mi->nr_blks)
+ continue;
+ printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
+ bi->nid, bi->start, bi->end - 1, bj->start,
+ bj->end - 1, start, end - 1);
+ bi->start = start;
+ bi->end = end;
+ numa_remove_memblk_from(j--, mi);
+ }
+ }
+
+ /* clear unused ones */
+ for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
+ mi->blk[i].start = mi->blk[i].end = 0;
+ mi->blk[i].nid = NUMA_NO_NODE;
+ }
+
+ return 0;
+}
+
+/*
+ * Mark all currently memblock-reserved physical memory (which covers the
+ * kernel's own memory ranges) as hot-unswappable.
+ */
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+ nodemask_t reserved_nodemask = NODE_MASK_NONE;
+ struct memblock_region *mb_region;
+ int i;
+
+ /*
+ * We have to do some preprocessing of memblock regions, to
+ * make them suitable for reservation.
+ *
+ * At this time, all memory regions reserved by memblock are
+ * used by the kernel, but those regions are not split up
+ * along node boundaries yet, and don't necessarily have their
+ * node ID set yet either.
+ *
+ * So iterate over all memory known to the x86 architecture,
+ * and use those ranges to set the nid in memblock.reserved.
+ * This will split up the memblock regions along node
+ * boundaries and will set the node IDs as well.
+ */
+ for (i = 0; i < numa_meminfo.nr_blks; i++) {
+ struct numa_memblk *mb = numa_meminfo.blk + i;
+ int ret;
+
+ ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
+ WARN_ON_ONCE(ret);
+ }
+
+ /*
+ * Now go over all reserved memblock regions, to construct a
+ * node mask of all kernel reserved memory areas.
+ *
+ * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
+ * numa_meminfo might not include all memblock.reserved
+ * memory ranges, because quirks such as trim_snb_memory()
+ * reserve specific pages for Sandy Bridge graphics. ]
+ */
+ for_each_reserved_mem_region(mb_region) {
+ int nid = memblock_get_region_node(mb_region);
+
+ if (nid != MAX_NUMNODES)
+ node_set(nid, reserved_nodemask);
+ }
+
+ /*
+ * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
+ * belonging to the reserved node mask.
+ *
+ * Note that this will include memory regions that reside
+ * on nodes that contain kernel memory - entire nodes
+ * become hot-unpluggable:
+ */
+ for (i = 0; i < numa_meminfo.nr_blks; i++) {
+ struct numa_memblk *mb = numa_meminfo.blk + i;
+
+ if (!node_isset(mb->nid, reserved_nodemask))
+ continue;
+
+ memblock_clear_hotplug(mb->start, mb->end - mb->start);
+ }
+}
+
+int __weak __init numa_register_memblks(struct numa_meminfo *mi)
+{
+ int i;
+
+ /* Account for nodes with cpus and no memory */
+ node_possible_map = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&node_possible_map, mi);
+ if (WARN_ON(nodes_empty(node_possible_map)))
+ return -EINVAL;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start,
+ &memblock.memory, mb->nid);
+ }
+
+ /*
+ * At very early time, the kernel have to use some memory such as
+ * loading the kernel image. We cannot prevent this anyway. So any
+ * node the kernel resides in should be un-hotpluggable.
+ *
+ * And when we come here, alloc node data won't fail.
+ */
+ numa_clear_kernel_node_hotplug();
+
+ /*
+ * If sections array is gonna be used for pfn -> nid mapping, check
+ * whether its granularity is fine enough.
+ */
+ if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) {
+ unsigned long pfn_align = node_map_pfn_alignment();
+
+ if (pfn_align && pfn_align < PAGES_PER_SECTION) {
+ pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
+ PFN_PHYS(pfn_align) >> 20,
+ PFN_PHYS(PAGES_PER_SECTION) >> 20);
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
--
2.32.0.3.gf3a3e56d6

2023-10-12 02:50:11

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH RFC 4/5] mm/numa: support CONFIG_NUMA_EMU for arm64

The CONFIG_NUMA_EMU migrates from x86/Kconfig
to mm/Kconfig. Now x86 and arm64 support it.

Signed-off-by: Rongwei Wang <[email protected]>
---
arch/x86/Kconfig | 8 -
arch/x86/mm/Makefile | 1 -
arch/x86/mm/numa_emulation.c | 585 -----------------------------------
drivers/base/arch_numa.c | 3 +
include/asm-generic/numa.h | 12 +
mm/Kconfig | 8 +
mm/numa.c | 12 +
7 files changed, 35 insertions(+), 594 deletions(-)
delete mode 100644 arch/x86/mm/numa_emulation.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..13438bfe2ec1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1568,14 +1568,6 @@ config X86_64_ACPI_NUMA
help
Enable ACPI SRAT based node topology detection.

-config NUMA_EMU
- bool "NUMA emulation"
- depends on NUMA
- help
- Enable NUMA emulation. A flat machine will be split
- into virtual nodes when booted with "numa=fake=N", where N is the
- number of nodes. This is only useful for debugging.
-
config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
range 1 10
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index c80febc44cd2..1581f17e5de4 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -56,7 +56,6 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o
obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o
obj-$(CONFIG_AMD_NUMA) += amdtopology.o
obj-$(CONFIG_ACPI_NUMA) += srat.o
-obj-$(CONFIG_NUMA_EMU) += numa_emulation.o

obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
deleted file mode 100644
index 9a9305367fdd..000000000000
--- a/arch/x86/mm/numa_emulation.c
+++ /dev/null
@@ -1,585 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * NUMA emulation
- */
-#include <linux/kernel.h>
-#include <linux/errno.h>
-#include <linux/topology.h>
-#include <linux/memblock.h>
-#include <asm/dma.h>
-
-#include "numa_internal.h"
-
-static int emu_nid_to_phys[MAX_NUMNODES];
-static char *emu_cmdline __initdata;
-
-int __init numa_emu_cmdline(char *str)
-{
- emu_cmdline = str;
- return 0;
-}
-
-static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
-{
- int i;
-
- for (i = 0; i < mi->nr_blks; i++)
- if (mi->blk[i].nid == nid)
- return i;
- return -ENOENT;
-}
-
-static u64 __init mem_hole_size(u64 start, u64 end)
-{
- unsigned long start_pfn = PFN_UP(start);
- unsigned long end_pfn = PFN_DOWN(end);
-
- if (start_pfn < end_pfn)
- return PFN_PHYS(absent_pages_in_range(start_pfn, end_pfn));
- return 0;
-}
-
-/*
- * Sets up nid to range from @start to @end. The return value is -errno if
- * something went wrong, 0 otherwise.
- */
-static int __init emu_setup_memblk(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- int nid, int phys_blk, u64 size)
-{
- struct numa_memblk *eb = &ei->blk[ei->nr_blks];
- struct numa_memblk *pb = &pi->blk[phys_blk];
-
- if (ei->nr_blks >= NR_NODE_MEMBLKS) {
- pr_err("NUMA: Too many emulated memblks, failing emulation\n");
- return -EINVAL;
- }
-
- ei->nr_blks++;
- eb->start = pb->start;
- eb->end = pb->start + size;
- eb->nid = nid;
-
- if (emu_nid_to_phys[nid] == NUMA_NO_NODE)
- emu_nid_to_phys[nid] = pb->nid;
-
- pb->start += size;
- if (pb->start >= pb->end) {
- WARN_ON_ONCE(pb->start > pb->end);
- numa_remove_memblk_from(phys_blk, pi);
- }
-
- printk(KERN_INFO "Faking node %d at [mem %#018Lx-%#018Lx] (%LuMB)\n",
- nid, eb->start, eb->end - 1, (eb->end - eb->start) >> 20);
- return 0;
-}
-
-/*
- * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr
- * to max_addr.
- *
- * Returns zero on success or negative on error.
- */
-static int __init split_nodes_interleave(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, int nr_nodes)
-{
- nodemask_t physnode_mask = numa_nodes_parsed;
- u64 size;
- int big;
- int nid = 0;
- int i, ret;
-
- if (nr_nodes <= 0)
- return -1;
- if (nr_nodes > MAX_NUMNODES) {
- pr_info("numa=fake=%d too large, reducing to %d\n",
- nr_nodes, MAX_NUMNODES);
- nr_nodes = MAX_NUMNODES;
- }
-
- /*
- * Calculate target node size. x86_32 freaks on __udivdi3() so do
- * the division in ulong number of pages and convert back.
- */
- size = max_addr - addr - mem_hole_size(addr, max_addr);
- size = PFN_PHYS((unsigned long)(size >> PAGE_SHIFT) / nr_nodes);
-
- /*
- * Calculate the number of big nodes that can be allocated as a result
- * of consolidating the remainder.
- */
- big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) /
- FAKE_NODE_MIN_SIZE;
-
- size &= FAKE_NODE_MIN_HASH_MASK;
- if (!size) {
- pr_err("Not enough memory for each node. "
- "NUMA emulation disabled.\n");
- return -1;
- }
-
- /*
- * Continue to fill physical nodes with fake nodes until there is no
- * memory left on any of them.
- */
- while (!nodes_empty(physnode_mask)) {
- for_each_node_mask(i, physnode_mask) {
- u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
- u64 start, limit, end;
- int phys_blk;
-
- phys_blk = emu_find_memblk_by_nid(i, pi);
- if (phys_blk < 0) {
- node_clear(i, physnode_mask);
- continue;
- }
- start = pi->blk[phys_blk].start;
- limit = pi->blk[phys_blk].end;
- end = start + size;
-
- if (nid < big)
- end += FAKE_NODE_MIN_SIZE;
-
- /*
- * Continue to add memory to this fake node if its
- * non-reserved memory is less than the per-node size.
- */
- while (end - start - mem_hole_size(start, end) < size) {
- end += FAKE_NODE_MIN_SIZE;
- if (end > limit) {
- end = limit;
- break;
- }
- }
-
- /*
- * If there won't be at least FAKE_NODE_MIN_SIZE of
- * non-reserved memory in ZONE_DMA32 for the next node,
- * this one must extend to the boundary.
- */
- if (end < dma32_end && dma32_end - end -
- mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
- end = dma32_end;
-
- /*
- * If there won't be enough non-reserved memory for the
- * next node, this one must extend to the end of the
- * physical node.
- */
- if (limit - end - mem_hole_size(end, limit) < size)
- end = limit;
-
- ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes,
- phys_blk,
- min(end, limit) - start);
- if (ret < 0)
- return ret;
- }
- }
- return 0;
-}
-
-/*
- * Returns the end address of a node so that there is at least `size' amount of
- * non-reserved memory or `max_addr' is reached.
- */
-static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size)
-{
- u64 end = start + size;
-
- while (end - start - mem_hole_size(start, end) < size) {
- end += FAKE_NODE_MIN_SIZE;
- if (end > max_addr) {
- end = max_addr;
- break;
- }
- }
- return end;
-}
-
-static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes)
-{
- unsigned long max_pfn = PHYS_PFN(max_addr);
- unsigned long base_pfn = PHYS_PFN(base);
- unsigned long hole_pfns = PHYS_PFN(hole);
-
- return PFN_PHYS((max_pfn - base_pfn - hole_pfns) / nr_nodes);
-}
-
-/*
- * Sets up fake nodes of `size' interleaved over physical nodes ranging from
- * `addr' to `max_addr'.
- *
- * Returns zero on success or negative on error.
- */
-static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, u64 size,
- int nr_nodes, struct numa_memblk *pblk,
- int nid)
-{
- nodemask_t physnode_mask = numa_nodes_parsed;
- int i, ret, uniform = 0;
- u64 min_size;
-
- if ((!size && !nr_nodes) || (nr_nodes && !pblk))
- return -1;
-
- /*
- * In the 'uniform' case split the passed in physical node by
- * nr_nodes, in the non-uniform case, ignore the passed in
- * physical block and try to create nodes of at least size
- * @size.
- *
- * In the uniform case, split the nodes strictly by physical
- * capacity, i.e. ignore holes. In the non-uniform case account
- * for holes and treat @size as a minimum floor.
- */
- if (!nr_nodes)
- nr_nodes = MAX_NUMNODES;
- else {
- nodes_clear(physnode_mask);
- node_set(pblk->nid, physnode_mask);
- uniform = 1;
- }
-
- if (uniform) {
- min_size = uniform_size(max_addr, addr, 0, nr_nodes);
- size = min_size;
- } else {
- /*
- * The limit on emulated nodes is MAX_NUMNODES, so the
- * size per node is increased accordingly if the
- * requested size is too small. This creates a uniform
- * distribution of node sizes across the entire machine
- * (but not necessarily over physical nodes).
- */
- min_size = uniform_size(max_addr, addr,
- mem_hole_size(addr, max_addr), nr_nodes);
- }
- min_size = ALIGN(max(min_size, FAKE_NODE_MIN_SIZE), FAKE_NODE_MIN_SIZE);
- if (size < min_size) {
- pr_err("Fake node size %LuMB too small, increasing to %LuMB\n",
- size >> 20, min_size >> 20);
- size = min_size;
- }
- size = ALIGN_DOWN(size, FAKE_NODE_MIN_SIZE);
-
- /*
- * Fill physical nodes with fake nodes of size until there is no memory
- * left on any of them.
- */
- while (!nodes_empty(physnode_mask)) {
- for_each_node_mask(i, physnode_mask) {
- u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
- u64 start, limit, end;
- int phys_blk;
-
- phys_blk = emu_find_memblk_by_nid(i, pi);
- if (phys_blk < 0) {
- node_clear(i, physnode_mask);
- continue;
- }
-
- start = pi->blk[phys_blk].start;
- limit = pi->blk[phys_blk].end;
-
- if (uniform)
- end = start + size;
- else
- end = find_end_of_node(start, limit, size);
- /*
- * If there won't be at least FAKE_NODE_MIN_SIZE of
- * non-reserved memory in ZONE_DMA32 for the next node,
- * this one must extend to the boundary.
- */
- if (end < dma32_end && dma32_end - end -
- mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
- end = dma32_end;
-
- /*
- * If there won't be enough non-reserved memory for the
- * next node, this one must extend to the end of the
- * physical node.
- */
- if ((limit - end - mem_hole_size(end, limit) < size)
- && !uniform)
- end = limit;
-
- ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES,
- phys_blk,
- min(end, limit) - start);
- if (ret < 0)
- return ret;
- }
- }
- return nid;
-}
-
-static int __init split_nodes_size_interleave(struct numa_meminfo *ei,
- struct numa_meminfo *pi,
- u64 addr, u64 max_addr, u64 size)
-{
- return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size,
- 0, NULL, 0);
-}
-
-static int __init setup_emu2phys_nid(int *dfl_phys_nid)
-{
- int i, max_emu_nid = 0;
-
- *dfl_phys_nid = NUMA_NO_NODE;
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) {
- if (emu_nid_to_phys[i] != NUMA_NO_NODE) {
- max_emu_nid = i;
- if (*dfl_phys_nid == NUMA_NO_NODE)
- *dfl_phys_nid = emu_nid_to_phys[i];
- }
- }
-
- return max_emu_nid;
-}
-
-/**
- * numa_emulation - Emulate NUMA nodes
- * @numa_meminfo: NUMA configuration to massage
- * @numa_dist_cnt: The size of the physical NUMA distance table
- *
- * Emulate NUMA nodes according to the numa=fake kernel parameter.
- * @numa_meminfo contains the physical memory configuration and is modified
- * to reflect the emulated configuration on success. @numa_dist_cnt is
- * used to determine the size of the physical distance table.
- *
- * On success, the following modifications are made.
- *
- * - @numa_meminfo is updated to reflect the emulated nodes.
- *
- * - __apicid_to_node[] is updated such that APIC IDs are mapped to the
- * emulated nodes.
- *
- * - NUMA distance table is rebuilt to represent distances between emulated
- * nodes. The distances are determined considering how emulated nodes
- * are mapped to physical nodes and match the actual distances.
- *
- * - emu_nid_to_phys[] reflects how emulated nodes are mapped to physical
- * nodes. This is used by numa_add_cpu() and numa_remove_cpu().
- *
- * If emulation is not enabled or fails, emu_nid_to_phys[] is filled with
- * identity mapping and no other modification is made.
- */
-void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
-{
- static struct numa_meminfo ei __initdata;
- static struct numa_meminfo pi __initdata;
- const u64 max_addr = PFN_PHYS(max_pfn);
- u8 *phys_dist = NULL;
- size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
- int max_emu_nid, dfl_phys_nid;
- int i, j, ret;
-
- if (!emu_cmdline)
- goto no_emu;
-
- memset(&ei, 0, sizeof(ei));
- pi = *numa_meminfo;
-
- for (i = 0; i < MAX_NUMNODES; i++)
- emu_nid_to_phys[i] = NUMA_NO_NODE;
-
- /*
- * If the numa=fake command-line contains a 'M' or 'G', it represents
- * the fixed node size. Otherwise, if it is just a single number N,
- * split the system RAM into N fake nodes.
- */
- if (strchr(emu_cmdline, 'U')) {
- nodemask_t physnode_mask = numa_nodes_parsed;
- unsigned long n;
- int nid = 0;
-
- n = simple_strtoul(emu_cmdline, &emu_cmdline, 0);
- ret = -1;
- for_each_node_mask(i, physnode_mask) {
- /*
- * The reason we pass in blk[0] is due to
- * numa_remove_memblk_from() called by
- * emu_setup_memblk() will delete entry 0
- * and then move everything else up in the pi.blk
- * array. Therefore we should always be looking
- * at blk[0].
- */
- ret = split_nodes_size_interleave_uniform(&ei, &pi,
- pi.blk[0].start, pi.blk[0].end, 0,
- n, &pi.blk[0], nid);
- if (ret < 0)
- break;
- if (ret < n) {
- pr_info("%s: phys: %d only got %d of %ld nodes, failing\n",
- __func__, i, ret, n);
- ret = -1;
- break;
- }
- nid = ret;
- }
- } else if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) {
- u64 size;
-
- size = memparse(emu_cmdline, &emu_cmdline);
- ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size);
- } else {
- unsigned long n;
-
- n = simple_strtoul(emu_cmdline, &emu_cmdline, 0);
- ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
- }
- if (*emu_cmdline == ':')
- emu_cmdline++;
-
- if (ret < 0)
- goto no_emu;
-
- if (numa_cleanup_meminfo(&ei) < 0) {
- pr_warn("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
- goto no_emu;
- }
-
- /* copy the physical distance table */
- if (numa_dist_cnt) {
- u64 phys;
-
- phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0,
- PFN_PHYS(max_pfn_mapped));
- if (!phys) {
- pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
- goto no_emu;
- }
- phys_dist = __va(phys);
-
- for (i = 0; i < numa_dist_cnt; i++)
- for (j = 0; j < numa_dist_cnt; j++)
- phys_dist[i * numa_dist_cnt + j] =
- node_distance(i, j);
- }
-
- /*
- * Determine the max emulated nid and the default phys nid to use
- * for unmapped nodes.
- */
- max_emu_nid = setup_emu2phys_nid(&dfl_phys_nid);
-
- /* commit */
- *numa_meminfo = ei;
-
- /* Make sure numa_nodes_parsed only contains emulated nodes */
- nodes_clear(numa_nodes_parsed);
- for (i = 0; i < ARRAY_SIZE(ei.blk); i++)
- if (ei.blk[i].start != ei.blk[i].end &&
- ei.blk[i].nid != NUMA_NO_NODE)
- node_set(ei.blk[i].nid, numa_nodes_parsed);
-
- /*
- * Transform __apicid_to_node table to use emulated nids by
- * reverse-mapping phys_nid. The maps should always exist but fall
- * back to zero just in case.
- */
- for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
- if (__apicid_to_node[i] == NUMA_NO_NODE)
- continue;
- for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
- if (__apicid_to_node[i] == emu_nid_to_phys[j])
- break;
- __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
- }
-
- /* make sure all emulated nodes are mapped to a physical node */
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
- if (emu_nid_to_phys[i] == NUMA_NO_NODE)
- emu_nid_to_phys[i] = dfl_phys_nid;
-
- /* transform distance table */
- numa_reset_distance();
- for (i = 0; i < max_emu_nid + 1; i++) {
- for (j = 0; j < max_emu_nid + 1; j++) {
- int physi = emu_nid_to_phys[i];
- int physj = emu_nid_to_phys[j];
- int dist;
-
- if (get_option(&emu_cmdline, &dist) == 2)
- ;
- else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
- dist = physi == physj ?
- LOCAL_DISTANCE : REMOTE_DISTANCE;
- else
- dist = phys_dist[physi * numa_dist_cnt + physj];
-
- numa_set_distance(i, j, dist);
- }
- }
-
- /* free the copied physical distance table */
- memblock_free(phys_dist, phys_size);
- return;
-
-no_emu:
- /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
- for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
- emu_nid_to_phys[i] = i;
-}
-
-#ifndef CONFIG_DEBUG_PER_CPU_MAPS
-void numa_add_cpu(int cpu)
-{
- int physnid, nid;
-
- nid = early_cpu_to_node(cpu);
- BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
-
- physnid = emu_nid_to_phys[nid];
-
- /*
- * Map the cpu to each emulated node that is allocated on the physical
- * node of the cpu's apic id.
- */
- for_each_online_node(nid)
- if (emu_nid_to_phys[nid] == physnid)
- cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
-}
-
-void numa_remove_cpu(int cpu)
-{
- int i;
-
- for_each_online_node(i)
- cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
-}
-#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void numa_set_cpumask(int cpu, bool enable)
-{
- int nid, physnid;
-
- nid = early_cpu_to_node(cpu);
- if (nid == NUMA_NO_NODE) {
- /* early_cpu_to_node() already emits a warning and trace */
- return;
- }
-
- physnid = emu_nid_to_phys[nid];
-
- for_each_online_node(nid) {
- if (emu_nid_to_phys[nid] != physnid)
- continue;
-
- debug_cpumask_set_cpu(cpu, nid, enable);
- }
-}
-
-void numa_add_cpu(int cpu)
-{
- numa_set_cpumask(cpu, true);
-}
-
-void numa_remove_cpu(int cpu)
-{
- numa_set_cpumask(cpu, false);
-}
-#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 5df0ad5cb09d..67bdbcd0caf9 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -13,6 +13,7 @@
#include <linux/module.h>
#include <linux/of.h>

+#include <asm-generic/numa.h>
#include <asm/sections.h>

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
@@ -30,6 +31,8 @@ static __init int numa_parse_early_param(char *opt)
return -EINVAL;
if (str_has_prefix(opt, "off"))
numa_off = true;
+ if (!strncmp(opt, "fake=", 5))
+ return numa_emu_cmdline(opt + 5);

return 0;
}
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index 929d7c582a73..4658155a070a 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -50,12 +50,24 @@ struct numa_meminfo {
struct numa_memblk blk[NR_NODE_MEMBLKS];
};

+#ifdef CONFIG_NUMA_EMU
+#define FAKE_NODE_MIN_SIZE ((u64)32 << 20)
+#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL))
+
extern struct numa_meminfo numa_meminfo;
+extern char *emu_cmdline __initdata;

+int numa_emu_cmdline(char *str);
int __init numa_register_memblks(struct numa_meminfo *mi);
int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);
+#else
+static inline int numa_emu_cmdline(char *str)
+{
+ return -EINVAL;
+}
+#endif

#else /* CONFIG_NUMA */

diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..22bead675ee6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -549,6 +549,14 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
config ARCH_ENABLE_MEMORY_HOTREMOVE
bool

+config NUMA_EMU
+ bool "NUMA emulation (EXPERIMENTAL)"
+ depends on NUMA && (X86 || ARM64)
+ help
+ Enable NUMA emulation. A flat machine will be split
+ into virtual nodes when booted with "numa=fake=N", where N is the
+ number of nodes. This is only useful for debugging.
+
# eventually, we can have this option just 'select SPARSEMEM'
menuconfig MEMORY_HOTPLUG
bool "Memory hotplug"
diff --git a/mm/numa.c b/mm/numa.c
index 88277e8404f0..3cc01f06a2a6 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -16,6 +16,10 @@
struct numa_meminfo numa_meminfo __initdata_or_meminfo;
struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;

+#ifdef CONFIG_NUMA_EMU
+char *emu_cmdline __initdata;
+#endif
+
/*
* Set nodes, which have memory in @mi, in *@nodemask.
*/
@@ -296,3 +300,11 @@ int __weak __init numa_register_memblks(struct numa_meminfo *mi)

return 0;
}
+
+#ifdef CONFIG_NUMA_EMU
+int __init numa_emu_cmdline(char *str)
+{
+ emu_cmdline = str;
+ return 0;
+}
+#endif
--
2.32.0.3.gf3a3e56d6

2023-10-12 02:50:48

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH RFC 3/5] arch_numa: remove __init in early_cpu_to_node()

Most of arch does not stick '__init' for
early_cpu_to_node(). And it's safe to
delete this attribute here, ready for
later numa emulation.

Signed-off-by: Rongwei Wang <[email protected]>
---
drivers/base/arch_numa.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index db0bb8b8fd67..5df0ad5cb09d 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -144,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);

-static int __init early_cpu_to_node(unsigned int cpu)
+int early_cpu_to_node(unsigned int cpu)
{
return cpu_to_node_map[cpu];
}
--
2.32.0.3.gf3a3e56d6

2023-10-12 02:50:55

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH RFC 5/5] mm/numa: migrate leftover numa emulation into mm/numa.c

Here moving original x86/mm/numa_emulation.c into
mm/numa.c. And next to enable it for arm64.

Signed-off-by: Rongwei Wang <[email protected]>
---
drivers/base/arch_numa.c | 2 +
include/asm-generic/numa.h | 3 +
mm/numa.c | 586 ++++++++++++++++++++++++++++++++++++-
3 files changed, 587 insertions(+), 4 deletions(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 67bdbcd0caf9..c6f5ceadb9e1 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -64,6 +64,7 @@ EXPORT_SYMBOL(cpumask_of_node);

#endif

+#ifndef CONFIG_NUMA_EMU
static void numa_update_cpu(unsigned int cpu, bool remove)
{
int nid = cpu_to_node(cpu);
@@ -92,6 +93,7 @@ void numa_clear_node(unsigned int cpu)
numa_remove_cpu(cpu);
set_cpu_numa_node(cpu, NUMA_NO_NODE);
}
+#endif

/*
* Allocate node_to_cpumask_map based on number of available nodes
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index 4658155a070a..9969ec7f59a4 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -55,6 +55,7 @@ struct numa_meminfo {
#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL))

extern struct numa_meminfo numa_meminfo;
+extern int emu_nid_to_phys[MAX_NUMNODES];
extern char *emu_cmdline __initdata;

int numa_emu_cmdline(char *str);
@@ -62,6 +63,8 @@ int __init numa_register_memblks(struct numa_meminfo *mi);
int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);
+int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+ struct numa_meminfo *mi);
#else
static inline int numa_emu_cmdline(char *str)
{
diff --git a/mm/numa.c b/mm/numa.c
index 3cc01f06a2a6..a6e9652498c9 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0-only
+/* Most of this file comes from x86/numa_emulation.c */
#include <linux/acpi.h>
#include <linux/kernel.h>
#include <linux/mm.h>
@@ -16,10 +17,6 @@
struct numa_meminfo numa_meminfo __initdata_or_meminfo;
struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;

-#ifdef CONFIG_NUMA_EMU
-char *emu_cmdline __initdata;
-#endif
-
/*
* Set nodes, which have memory in @mi, in *@nodemask.
*/
@@ -302,9 +299,590 @@ int __weak __init numa_register_memblks(struct numa_meminfo *mi)
}

#ifdef CONFIG_NUMA_EMU
+int emu_nid_to_phys[MAX_NUMNODES];
+char *emu_cmdline __initdata;
+
int __init numa_emu_cmdline(char *str)
{
emu_cmdline = str;
return 0;
}
+
+static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
+{
+ int i;
+
+ for (i = 0; i < mi->nr_blks; i++)
+ if (mi->blk[i].nid == nid)
+ return i;
+ return -ENOENT;
+}
+
+static u64 __init mem_hole_size(u64 start, u64 end)
+{
+ unsigned long start_pfn = PFN_UP(start);
+ unsigned long end_pfn = PFN_DOWN(end);
+
+ if (start_pfn < end_pfn)
+ return PFN_PHYS(absent_pages_in_range(start_pfn, end_pfn));
+ return 0;
+}
+
+/*
+ * Sets up nid to range from @start to @end. The return value is -errno if
+ * something went wrong, 0 otherwise.
+ */
+static int __init emu_setup_memblk(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ int nid, int phys_blk, u64 size)
+{
+ struct numa_memblk *eb = &ei->blk[ei->nr_blks];
+ struct numa_memblk *pb = &pi->blk[phys_blk];
+
+ if (ei->nr_blks >= NR_NODE_MEMBLKS) {
+ pr_err("NUMA: Too many emulated memblks, failing emulation\n");
+ return -EINVAL;
+ }
+
+ ei->nr_blks++;
+ eb->start = pb->start;
+ eb->end = pb->start + size;
+ eb->nid = nid;
+
+ if (emu_nid_to_phys[nid] == NUMA_NO_NODE)
+ emu_nid_to_phys[nid] = pb->nid;
+
+ pb->start += size;
+ if (pb->start >= pb->end) {
+ WARN_ON_ONCE(pb->start > pb->end);
+ numa_remove_memblk_from(phys_blk, pi);
+ }
+
+ printk(KERN_INFO "Faking node %d at [mem %#018Lx-%#018Lx] (%LuMB)\n",
+ nid, eb->start, eb->end - 1, (eb->end - eb->start) >> 20);
+ return 0;
+}
+
+/*
+ * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr
+ * to max_addr.
+ *
+ * Returns zero on success or negative on error.
+ */
+static int __init split_nodes_interleave(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, int nr_nodes)
+{
+ nodemask_t physnode_mask = numa_nodes_parsed;
+ u64 size;
+ int big;
+ int nid = 0;
+ int i, ret;
+
+ if (nr_nodes <= 0)
+ return -1;
+ if (nr_nodes > MAX_NUMNODES) {
+ pr_info("numa=fake=%d too large, reducing to %d\n",
+ nr_nodes, MAX_NUMNODES);
+ nr_nodes = MAX_NUMNODES;
+ }
+
+ /*
+ * Calculate target node size. x86_32 freaks on __udivdi3() so do
+ * the division in ulong number of pages and convert back.
+ */
+ size = max_addr - addr - mem_hole_size(addr, max_addr);
+ size = PFN_PHYS((unsigned long)(size >> PAGE_SHIFT) / nr_nodes);
+
+ /*
+ * Calculate the number of big nodes that can be allocated as a result
+ * of consolidating the remainder.
+ */
+ big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) /
+ FAKE_NODE_MIN_SIZE;
+
+ size &= FAKE_NODE_MIN_HASH_MASK;
+ if (!size) {
+ pr_err("Not enough memory for each node. "
+ "NUMA emulation disabled.\n");
+ return -1;
+ }
+
+ /*
+ * Continue to fill physical nodes with fake nodes until there is no
+ * memory left on any of them.
+ */
+ while (!nodes_empty(physnode_mask)) {
+ for_each_node_mask(i, physnode_mask) {
+#ifdef CONFIG_X86
+ u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+#endif
+ u64 start, limit, end;
+ int phys_blk;
+
+ phys_blk = emu_find_memblk_by_nid(i, pi);
+ if (phys_blk < 0) {
+ node_clear(i, physnode_mask);
+ continue;
+ }
+ start = pi->blk[phys_blk].start;
+ limit = pi->blk[phys_blk].end;
+ end = start + size;
+
+ if (nid < big)
+ end += FAKE_NODE_MIN_SIZE;
+
+ /*
+ * Continue to add memory to this fake node if its
+ * non-reserved memory is less than the per-node size.
+ */
+ while (end - start - mem_hole_size(start, end) < size) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end > limit) {
+ end = limit;
+ break;
+ }
+ }
+
+#ifdef CONFIG_X86
+ /*
+ * If there won't be at least FAKE_NODE_MIN_SIZE of
+ * non-reserved memory in ZONE_DMA32 for the next node,
+ * this one must extend to the boundary.
+ */
+ if (end < dma32_end && dma32_end - end -
+ mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
+ end = dma32_end;
+#endif
+
+ /*
+ * If there won't be enough non-reserved memory for the
+ * next node, this one must extend to the end of the
+ * physical node.
+ */
+ if (limit - end - mem_hole_size(end, limit) < size)
+ end = limit;
+
+ ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes,
+ phys_blk,
+ min(end, limit) - start);
+ if (ret < 0)
+ return ret;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Returns the end address of a node so that there is at least `size' amount of
+ * non-reserved memory or `max_addr' is reached.
+ */
+static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size)
+{
+ u64 end = start + size;
+
+ while (end - start - mem_hole_size(start, end) < size) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end > max_addr) {
+ end = max_addr;
+ break;
+ }
+ }
+ return end;
+}
+
+static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes)
+{
+ unsigned long max_pfn = PHYS_PFN(max_addr);
+ unsigned long base_pfn = PHYS_PFN(base);
+ unsigned long hole_pfns = PHYS_PFN(hole);
+
+ return PFN_PHYS((max_pfn - base_pfn - hole_pfns) / nr_nodes);
+}
+
+/*
+ * Sets up fake nodes of `size' interleaved over physical nodes ranging from
+ * `addr' to `max_addr'.
+ *
+ * Returns zero on success or negative on error.
+ */
+static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, u64 size,
+ int nr_nodes, struct numa_memblk *pblk,
+ int nid)
+{
+ nodemask_t physnode_mask = numa_nodes_parsed;
+ int i, ret, uniform = 0;
+ u64 min_size;
+
+ if ((!size && !nr_nodes) || (nr_nodes && !pblk))
+ return -1;
+
+ /*
+ * In the 'uniform' case split the passed in physical node by
+ * nr_nodes, in the non-uniform case, ignore the passed in
+ * physical block and try to create nodes of at least size
+ * @size.
+ *
+ * In the uniform case, split the nodes strictly by physical
+ * capacity, i.e. ignore holes. In the non-uniform case account
+ * for holes and treat @size as a minimum floor.
+ */
+ if (!nr_nodes)
+ nr_nodes = MAX_NUMNODES;
+ else {
+ nodes_clear(physnode_mask);
+ node_set(pblk->nid, physnode_mask);
+ uniform = 1;
+ }
+
+ if (uniform) {
+ min_size = uniform_size(max_addr, addr, 0, nr_nodes);
+ size = min_size;
+ } else {
+ /*
+ * The limit on emulated nodes is MAX_NUMNODES, so the
+ * size per node is increased accordingly if the
+ * requested size is too small. This creates a uniform
+ * distribution of node sizes across the entire machine
+ * (but not necessarily over physical nodes).
+ */
+ min_size = uniform_size(max_addr, addr,
+ mem_hole_size(addr, max_addr), nr_nodes);
+ }
+ min_size = ALIGN(max(min_size, FAKE_NODE_MIN_SIZE), FAKE_NODE_MIN_SIZE);
+ if (size < min_size) {
+ pr_err("Fake node size %LuMB too small, increasing to %LuMB\n",
+ size >> 20, min_size >> 20);
+ size = min_size;
+ }
+ size = ALIGN_DOWN(size, FAKE_NODE_MIN_SIZE);
+
+ /*
+ * Fill physical nodes with fake nodes of size until there is no memory
+ * left on any of them.
+ */
+ while (!nodes_empty(physnode_mask)) {
+ for_each_node_mask(i, physnode_mask) {
+#ifdef CONFIG_X86
+ u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+#endif
+ u64 start, limit, end;
+ int phys_blk;
+
+ phys_blk = emu_find_memblk_by_nid(i, pi);
+ if (phys_blk < 0) {
+ node_clear(i, physnode_mask);
+ continue;
+ }
+
+ start = pi->blk[phys_blk].start;
+ limit = pi->blk[phys_blk].end;
+
+ if (uniform)
+ end = start + size;
+ else
+ end = find_end_of_node(start, limit, size);
+
+#ifdef CONFIG_X86
+ /*
+ * If there won't be at least FAKE_NODE_MIN_SIZE of
+ * non-reserved memory in ZONE_DMA32 for the next node,
+ * this one must extend to the boundary.
+ */
+ if (end < dma32_end && dma32_end - end -
+ mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE)
+ end = dma32_end;
+#endif
+
+ /*
+ * If there won't be enough non-reserved memory for the
+ * next node, this one must extend to the end of the
+ * physical node.
+ */
+ if ((limit - end - mem_hole_size(end, limit) < size)
+ && !uniform)
+ end = limit;
+
+ ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES,
+ phys_blk,
+ min(end, limit) - start);
+ if (ret < 0)
+ return ret;
+ }
+ }
+ return nid;
+}
+
+static int __init split_nodes_size_interleave(struct numa_meminfo *ei,
+ struct numa_meminfo *pi,
+ u64 addr, u64 max_addr, u64 size)
+{
+ return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size,
+ 0, NULL, 0);
+}
+
+static int __init setup_emu2phys_nid(int *dfl_phys_nid)
+{
+ int i, max_emu_nid = 0;
+
+ *dfl_phys_nid = NUMA_NO_NODE;
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) {
+ if (emu_nid_to_phys[i] != NUMA_NO_NODE) {
+ max_emu_nid = i;
+ if (*dfl_phys_nid == NUMA_NO_NODE)
+ *dfl_phys_nid = emu_nid_to_phys[i];
+ }
+ }
+
+ return max_emu_nid;
+}
+
+/**
+ * numa_emulation - Emulate NUMA nodes
+ * @numa_meminfo: NUMA configuration to massage
+ * @numa_dist_cnt: The size of the physical NUMA distance table
+ *
+ * Emulate NUMA nodes according to the numa=fake kernel parameter.
+ * @numa_meminfo contains the physical memory configuration and is modified
+ * to reflect the emulated configuration on success. @numa_dist_cnt is
+ * used to determine the size of the physical distance table.
+ *
+ * On success, the following modifications are made.
+ *
+ * - @numa_meminfo is updated to reflect the emulated nodes.
+ *
+ * - __apicid_to_node[] is updated such that APIC IDs are mapped to the
+ * emulated nodes.
+ *
+ * - NUMA distance table is rebuilt to represent distances between emulated
+ * nodes. The distances are determined considering how emulated nodes
+ * are mapped to physical nodes and match the actual distances.
+ *
+ * - emu_nid_to_phys[] reflects how emulated nodes are mapped to physical
+ * nodes. This is used by numa_add_cpu() and numa_remove_cpu().
+ *
+ * If emulation is not enabled or fails, emu_nid_to_phys[] is filled with
+ * identity mapping and no other modification is made.
+ */
+void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
+{
+ static struct numa_meminfo ei __initdata;
+ static struct numa_meminfo pi __initdata;
+ const u64 max_addr = PFN_PHYS(max_pfn);
+ u8 *phys_dist = NULL;
+ size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
+ int max_emu_nid, dfl_phys_nid;
+ int i, j, ret;
+
+ if (!emu_cmdline)
+ goto no_emu;
+
+ memset(&ei, 0, sizeof(ei));
+ pi = *numa_meminfo;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ emu_nid_to_phys[i] = NUMA_NO_NODE;
+
+ /*
+ * If the numa=fake command-line contains a 'M' or 'G', it represents
+ * the fixed node size. Otherwise, if it is just a single number N,
+ * split the system RAM into N fake nodes.
+ */
+ if (strchr(emu_cmdline, 'U')) {
+ nodemask_t physnode_mask = numa_nodes_parsed;
+ unsigned long n;
+ int nid = 0;
+
+ n = simple_strtoul(emu_cmdline, &emu_cmdline, 0);
+ ret = -1;
+ for_each_node_mask(i, physnode_mask) {
+ /*
+ * The reason we pass in blk[0] is due to
+ * numa_remove_memblk_from() called by
+ * emu_setup_memblk() will delete entry 0
+ * and then move everything else up in the pi.blk
+ * array. Therefore we should always be looking
+ * at blk[0].
+ */
+ ret = split_nodes_size_interleave_uniform(&ei, &pi,
+ pi.blk[0].start, pi.blk[0].end, 0,
+ n, &pi.blk[0], nid);
+ if (ret < 0)
+ break;
+ if (ret < n) {
+ pr_info("%s: phys: %d only got %d of %ld nodes, failing\n",
+ __func__, i, ret, n);
+ ret = -1;
+ break;
+ }
+ nid = ret;
+ }
+ } else if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) {
+ u64 size;
+
+ size = memparse(emu_cmdline, &emu_cmdline);
+ ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size);
+ } else {
+ unsigned long n;
+
+ n = simple_strtoul(emu_cmdline, &emu_cmdline, 0);
+ ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
+ }
+ if (*emu_cmdline == ':')
+ emu_cmdline++;
+
+ if (ret < 0)
+ goto no_emu;
+
+ if (numa_cleanup_meminfo(&ei) < 0) {
+ pr_warn("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
+ goto no_emu;
+ }
+
+ /* copy the physical distance table */
+ if (numa_dist_cnt) {
+ u64 phys;
+
+ phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0,
+ MEMBLOCK_ALLOC_ACCESSIBLE);
+ if (!phys) {
+ pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
+ goto no_emu;
+ }
+ phys_dist = __va(phys);
+
+ for (i = 0; i < numa_dist_cnt; i++)
+ for (j = 0; j < numa_dist_cnt; j++)
+ phys_dist[i * numa_dist_cnt + j] =
+ node_distance(i, j);
+ }
+
+ /*
+ * Determine the max emulated nid and the default phys nid to use
+ * for unmapped nodes.
+ */
+ max_emu_nid = setup_emu2phys_nid(&dfl_phys_nid);
+
+ /* commit */
+ *numa_meminfo = ei;
+
+ /* Make sure numa_nodes_parsed only contains emulated nodes */
+ nodes_clear(numa_nodes_parsed);
+ for (i = 0; i < ARRAY_SIZE(ei.blk); i++)
+ if (ei.blk[i].start != ei.blk[i].end &&
+ ei.blk[i].nid != NUMA_NO_NODE)
+ node_set(ei.blk[i].nid, numa_nodes_parsed);
+
+#ifdef CONFIG_X86
+ /*
+ * Transform __apicid_to_node table to use emulated nids by
+ * reverse-mapping phys_nid. The maps should always exist but fall
+ * back to zero just in case.
+ */
+ for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
+ if (__apicid_to_node[i] == NUMA_NO_NODE)
+ continue;
+ for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
+ if (__apicid_to_node[i] == emu_nid_to_phys[j])
+ break;
+ __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
+ }
+#endif
+
+ /* make sure all emulated nodes are mapped to a physical node */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ if (emu_nid_to_phys[i] == NUMA_NO_NODE)
+ emu_nid_to_phys[i] = dfl_phys_nid;
+
+ /* transform distance table */
+ numa_free_distance();
+ for (i = 0; i < max_emu_nid + 1; i++) {
+ for (j = 0; j < max_emu_nid + 1; j++) {
+ int physi = emu_nid_to_phys[i];
+ int physj = emu_nid_to_phys[j];
+ int dist;
+
+ if (get_option(&emu_cmdline, &dist) == 2)
+ ;
+ else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
+ dist = physi == physj ?
+ LOCAL_DISTANCE : REMOTE_DISTANCE;
+ else
+ dist = phys_dist[physi * numa_dist_cnt + physj];
+
+ numa_set_distance(i, j, dist);
+ }
+ }
+
+ /* free the copied physical distance table */
+ memblock_free(phys_dist, phys_size);
+ return;
+
+no_emu:
+ /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
+ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+ emu_nid_to_phys[i] = i;
+}
+
+#ifndef CONFIG_DEBUG_PER_CPU_MAPS
+extern int early_cpu_to_node(unsigned int cpu);
+
+void numa_add_cpu(unsigned int cpu)
+{
+ int physnid, nid;
+
+ nid = early_cpu_to_node(cpu);
+ BUG_ON(nid == NUMA_NO_NODE || !node_online(nid));
+
+ physnid = emu_nid_to_phys[nid];
+
+ /*
+ * Map the cpu to each emulated node that is allocated on the physical
+ * node of the cpu's apic id.
+ */
+ for_each_online_node(nid)
+ if (emu_nid_to_phys[nid] == physnid)
+ cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
+}
+
+void numa_remove_cpu(unsigned int cpu)
+{
+ int i;
+
+ for_each_online_node(i)
+ cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
+}
+#else /* !CONFIG_DEBUG_PER_CPU_MAPS */
+static void numa_set_cpumask(int cpu, bool enable)
+{
+ int nid, physnid;
+
+ nid = early_cpu_to_node(cpu);
+ if (nid == NUMA_NO_NODE) {
+ /* early_cpu_to_node() already emits a warning and trace */
+ return;
+ }
+
+ physnid = emu_nid_to_phys[nid];
+
+ for_each_online_node(nid) {
+ if (emu_nid_to_phys[nid] != physnid)
+ continue;
+
+ debug_cpumask_set_cpu(cpu, nid, enable);
+ }
+}
+
+void numa_add_cpu(unsigned int cpu)
+{
+ numa_set_cpumask(cpu, true);
+}
+
+void numa_remove_cpu(unsigned int cpu)
+{
+ numa_set_cpumask(cpu, false);
+}
+#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
#endif
--
2.32.0.3.gf3a3e56d6

2023-10-12 06:06:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 1/5] mm/numa: move numa emulation APIs into generic files


* Rongwei Wang <[email protected]> wrote:

> In order to support NUMA EMU for other
> arch, some functions that used by numa_meminfo
> should be moved out x86 arch. mm/numa.c created
> to place above API.
>
> CONFIG_NUMA_EMU will be handled later.
>
> Signed-off-by: Rongwei Wang <[email protected]>
> ---
> arch/x86/include/asm/numa.h | 3 -
> arch/x86/mm/numa.c | 216 +-------------------------
> arch/x86/mm/numa_internal.h | 14 +-
> include/asm-generic/numa.h | 18 +++
> mm/Makefile | 1 +
> mm/numa.c | 298 ++++++++++++++++++++++++++++++++++++
> 6 files changed, 323 insertions(+), 227 deletions(-)
> create mode 100644 mm/numa.c

No objections to moving the x86 NUMA emulation code to mm/numa.c, as long
as it stays similarly functional:

Acked-by: Ingo Molnar <[email protected]>

Thanks,

Ingo

2023-10-12 12:38:14

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] support NUMA emulation for arm64

Hello Rongwei,

On 10/12/23 04:48, Rongwei Wang wrote:
> A brief introduction
> ====================
>
> The NUMA emulation can fake more node base on a single
> node system, e.g.
>
> one node system:
>
> [root@localhost ~]# numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 31788 MB
> node 0 free: 31446 MB
> node distances:
> node 0
> 0: 10
>
> add numa=fake=2 (fake 2 node on each origin node):
>
> [root@localhost ~]# numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 15806 MB
> node 0 free: 15451 MB
> node 1 cpus: 0 1 2 3 4 5 6 7
> node 1 size: 16029 MB
> node 1 free: 15989 MB
> node distances:
> node 0 1
> 0: 10 10
> 1: 10 10
>
> As above shown, a new node has been faked. As cpus, the realization
> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
> better (not sure, next to do if so).
>
> Why do this
> ===========
>
> It seems has following reasons:
> (1) In x86 host, apply NUMA emulation can fake more nodes environment
> to test or verify some performance stuff, but arm64 only has
> one method that modify ACPI table to do this. It's troublesome
> more or less.
> (2) Reduce competition for some locks. Here an example we found:
> will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
> hotspot on lruvec->lock when test in single environment. What's
> more, The performance improved greatly if test in two more nodes
> system. The data shows below (more is better):
>
> ---------------------------------------------------------------------
> threads/process | 1 | 12 | 24 | 48 | 96
> ---------------------------------------------------------------------
> one node | 14 1122 | 110 5372 | 111 2615 | 79 7084 | 72 4516
> ---------------------------------------------------------------------
> numa=fake=2 | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
> ---------------------------------------------------------------------
> | For concurrency 12, no lruvec->lock hotspot. For 24,
> hotspot | one node has 24% hotspot on lruvec->lock, but
> | two nodes env hasn't.
> ---------------------------------------------------------------------
>
> As for risks (e.g. numa balance...), they need to be discussed here.
>
> Lastly, this just is a draft, I can improve next if it's acceptable.

I'm not engaging on the utility/relevance of the patch-set, but I tried
them on an arm64 system with the 'numa=fake=2' parameter and could not
see 2 nodes being created under:
/sys/devices/system/node/
Indeed it seems that even though numa_emulation() is moved to a generic
mm/numa.c file, the function is only called from:
arch/x86/mm/numa.c:numa_init()
(or maybe I'm misinterpreting the intent of the patches).

Also I had the following errors when building (still for arm64):
mm/numa.c:862:8: error: implicit declaration of function 'early_cpu_to_node' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
nid = early_cpu_to_node(cpu);
^
mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' declared here
void __init early_map_cpu_to_node(unsigned int cpu, int nid);
^
mm/numa.c:874:3: error: implicit declaration of function 'debug_cpumask_set_cpu' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
debug_cpumask_set_cpu(cpu, nid, enable);
^
mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
^
2 errors generated.

Regards,
Pierre

>
> Thanks!
>
> Rongwei Wang (5):
> mm/numa: move numa emulation APIs into generic files
> mm: percpu: fix variable type of cpu
> arch_numa: remove __init in early_cpu_to_node()
> mm/numa: support CONFIG_NUMA_EMU for arm64
> mm/numa: migrate leftover numa emulation into mm/numa.c
>
> arch/x86/Kconfig | 8 -
> arch/x86/include/asm/numa.h | 3 -
> arch/x86/mm/Makefile | 1 -
> arch/x86/mm/numa.c | 216 +-------------
> arch/x86/mm/numa_internal.h | 14 +-
> drivers/base/arch_numa.c | 7 +-
> include/asm-generic/numa.h | 33 +++
> include/linux/percpu.h | 2 +-
> mm/Kconfig | 8 +
> mm/Makefile | 1 +
> arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
> 11 files changed, 373 insertions(+), 253 deletions(-)
> rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)
>

2023-10-12 13:31:28

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] support NUMA emulation for arm64


On 2023/10/12 20:37, Pierre Gondois wrote:
> Hello Rongwei,
>
> On 10/12/23 04:48, Rongwei Wang wrote:
>> A brief introduction
>> ====================
>>
>> The NUMA emulation can fake more node base on a single
>> node system, e.g.
>>
>> one node system:
>>
>> [root@localhost ~]# numactl -H
>> available: 1 nodes (0)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 31788 MB
>> node 0 free: 31446 MB
>> node distances:
>> node   0
>>    0:  10
>>
>> add numa=fake=2 (fake 2 node on each origin node):
>>
>> [root@localhost ~]# numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 15806 MB
>> node 0 free: 15451 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7
>> node 1 size: 16029 MB
>> node 1 free: 15989 MB
>> node distances:
>> node   0   1
>>    0:  10  10
>>    1:  10  10
>>
>> As above shown, a new node has been faked. As cpus, the realization
>> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
>> better (not sure, next to do if so).
>>
>> Why do this
>> ===========
>>
>> It seems has following reasons:
>>    (1) In x86 host, apply NUMA emulation can fake more nodes environment
>>        to test or verify some performance stuff, but arm64 only has
>>        one method that modify ACPI table to do this. It's troublesome
>>        more or less.
>>    (2) Reduce competition for some locks. Here an example we found:
>>        will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
>>        hotspot on lruvec->lock when test in single environment. What's
>>        more, The performance improved greatly if test in two more nodes
>>        system. The data shows below (more is better):
>>
>> ---------------------------------------------------------------------
>>        threads/process |   1     |     12   |     24   | 48     |   96
>> ---------------------------------------------------------------------
>>        one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  |
>> 72 4516
>> ---------------------------------------------------------------------
>>        numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 |
>> 142 3968
>> ---------------------------------------------------------------------
>>                        | For concurrency 12, no lruvec->lock hotspot.
>> For 24,
>>        hotspot         | one node has 24% hotspot on lruvec->lock, but
>>                        | two nodes env hasn't.
>> ---------------------------------------------------------------------
>>
>> As for risks (e.g. numa balance...), they need to be discussed here.
>>
>> Lastly, this just is a draft, I can improve next if it's acceptable.
>
> I'm not engaging on the utility/relevance of the patch-set, but I tried
> them on an arm64 system with the 'numa=fake=2' parameter and could not

Sorry, my fault.

I should mention this in previous brief introduction: acpi=on numa=fake=2.

The default patch of arm64 numa initialize is numa_init() ->
dummy_numa_init() if turn off acpi (this path has not been taken into
account yet in this patch, next will to do).

What's more, if you test these patchset in qemu-kvm, you should add
below parameters in the script.

object memory-backend-ram,id=mem0,size=32G \
numa node,memdev=mem0,cpus=0-7,nodeid=0 \

(Above parameters just make sure SRAT table has NUMA configure, avoiding
path of numa_init() -> dummy_numa_init())

> see 2 nodes being created under:
>   /sys/devices/system/node/
> Indeed it seems that even though numa_emulation() is moved to a generic
> mm/numa.c file, the function is only called from:
>   arch/x86/mm/numa.c:numa_init()
> (or maybe I'm misinterpreting the intent of the patches).

Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I
guess it works if you add acpi=on :-)).


>
> Also I had the following errors when building (still for arm64):
> mm/numa.c:862:8: error: implicit declaration of function
> 'early_cpu_to_node' is invalid in C99
> [-Werror,-Wimplicit-function-declaration]
>         nid = early_cpu_to_node(cpu);

It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can
disable CONFIG_DEBUG_PER_CPU_MAPS and test it again.

I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very
helpful, I will fix it next time.

If you have any questions, please let me know.

Regards,

-wrw

> ^
> mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
> ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node'
> declared here
> void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>             ^
> mm/numa.c:874:3: error: implicit declaration of function
> 'debug_cpumask_set_cpu' is invalid in C99
> [-Werror,-Wimplicit-function-declaration]
>                 debug_cpumask_set_cpu(cpu, nid, enable);
>                 ^
> mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
> ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
> static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct
> cpumask *dstp)
>                             ^
> 2 errors generated.
>
> Regards,
> Pierre
>
>>
>> Thanks!
>>
>> Rongwei Wang (5):
>>    mm/numa: move numa emulation APIs into generic files
>>    mm: percpu: fix variable type of cpu
>>    arch_numa: remove __init in early_cpu_to_node()
>>    mm/numa: support CONFIG_NUMA_EMU for arm64
>>    mm/numa: migrate leftover numa emulation into mm/numa.c
>>
>>   arch/x86/Kconfig                          |   8 -
>>   arch/x86/include/asm/numa.h               |   3 -
>>   arch/x86/mm/Makefile                      |   1 -
>>   arch/x86/mm/numa.c                        | 216 +-------------
>>   arch/x86/mm/numa_internal.h               |  14 +-
>>   drivers/base/arch_numa.c                  |   7 +-
>>   include/asm-generic/numa.h                |  33 +++
>>   include/linux/percpu.h                    |   2 +-
>>   mm/Kconfig                                |   8 +
>>   mm/Makefile                               |   1 +
>>   arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
>>   11 files changed, 373 insertions(+), 253 deletions(-)
>>   rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)
>>

2023-10-23 13:04:23

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] support NUMA emulation for arm64

Hello Rongwei,

On 10/12/23 15:30, Rongwei Wang wrote:
>
> On 2023/10/12 20:37, Pierre Gondois wrote:
>> Hello Rongwei,
>>
>> On 10/12/23 04:48, Rongwei Wang wrote:
>>> A brief introduction
>>> ====================
>>>
>>> The NUMA emulation can fake more node base on a single
>>> node system, e.g.
>>>
>>> one node system:
>>>
>>> [root@localhost ~]# numactl -H
>>> available: 1 nodes (0)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 31788 MB
>>> node 0 free: 31446 MB
>>> node distances:
>>> node   0
>>>    0:  10
>>>
>>> add numa=fake=2 (fake 2 node on each origin node):
>>>
>>> [root@localhost ~]# numactl -H
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 15806 MB
>>> node 0 free: 15451 MB
>>> node 1 cpus: 0 1 2 3 4 5 6 7
>>> node 1 size: 16029 MB
>>> node 1 free: 15989 MB
>>> node distances:
>>> node   0   1
>>>    0:  10  10
>>>    1:  10  10
>>>
>>> As above shown, a new node has been faked. As cpus, the realization
>>> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
>>> better (not sure, next to do if so).
>>>
>>> Why do this
>>> ===========
>>>
>>> It seems has following reasons:
>>>    (1) In x86 host, apply NUMA emulation can fake more nodes environment
>>>        to test or verify some performance stuff, but arm64 only has
>>>        one method that modify ACPI table to do this. It's troublesome
>>>        more or less.
>>>    (2) Reduce competition for some locks. Here an example we found:
>>>        will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
>>>        hotspot on lruvec->lock when test in single environment. What's
>>>        more, The performance improved greatly if test in two more nodes
>>>        system. The data shows below (more is better):
>>>
>>> ---------------------------------------------------------------------
>>>        threads/process |   1     |     12   |     24   | 48     |   96
>>> ---------------------------------------------------------------------
>>>        one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  |
>>> 72 4516
>>> ---------------------------------------------------------------------
>>>        numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 |
>>> 142 3968
>>> ---------------------------------------------------------------------
>>>                        | For concurrency 12, no lruvec->lock hotspot.
>>> For 24,
>>>        hotspot         | one node has 24% hotspot on lruvec->lock, but
>>>                        | two nodes env hasn't.
>>> ---------------------------------------------------------------------
>>>
>>> As for risks (e.g. numa balance...), they need to be discussed here.
>>>
>>> Lastly, this just is a draft, I can improve next if it's acceptable.
>>
>> I'm not engaging on the utility/relevance of the patch-set, but I tried
>> them on an arm64 system with the 'numa=fake=2' parameter and could not
>
> Sorry, my fault.
>
> I should mention this in previous brief introduction: acpi=on numa=fake=2.
>
> The default patch of arm64 numa initialize is numa_init() ->
> dummy_numa_init() if turn off acpi (this path has not been taken into
> account yet in this patch, next will to do).
>
> What's more, if you test these patchset in qemu-kvm, you should add
> below parameters in the script.
>
> object memory-backend-ram,id=mem0,size=32G \
> numa node,memdev=mem0,cpus=0-7,nodeid=0 \
>
> (Above parameters just make sure SRAT table has NUMA configure, avoiding
> path of numa_init() -> dummy_numa_init())
>
>> see 2 nodes being created under:
>>   /sys/devices/system/node/
>> Indeed it seems that even though numa_emulation() is moved to a generic
>> mm/numa.c file, the function is only called from:
>>   arch/x86/mm/numa.c:numa_init()
>> (or maybe I'm misinterpreting the intent of the patches).
>
> Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I
> guess it works if you add acpi=on :-)).

I don't see numa_emulation() being called from drivers/base/arch_numa.c:numa_init()

I have:
$ git grep numa_emulation
arch/x86/mm/numa.c: numa_emulation(&numa_meminfo, numa_distance_cnt);
arch/x86/mm/numa_internal.h:extern void __init numa_emulation(struct numa_meminfo *numa_meminfo,
include/asm-generic/numa.h:void __init numa_emulation(struct numa_meminfo *numa_meminfo,
mm/numa.c:/* Most of this file comes from x86/numa_emulation.c */
mm/numa.c: * numa_emulation - Emulate NUMA nodes
mm/numa.c:void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
so from this, an arm64-based platform should not be able to call numa_emulation().

Is it possible to add a call to dump_stack() in numa_emulation() to see the call stack ?

The branch I'm using is based on v6.6-rc5 and has the 5 patches applied:
2af398a87cc7 mm/numa: migrate leftover numa emulation into mm/numa.c
c8e314fb23be mm/numa: support CONFIG_NUMA_EMU for arm64
335b7219d40e arch_numa: remove __init in early_cpu_to_node()
d9358adf1cdc mm: percpu: fix variable type of cpu
1ffbe40a00f5 mm/numa: move numa emulation APIs into generic files
94f6f0550c62 (tag: v6.6-rc5) Linux 6.6-rc5

Regards,
Pierre

>
>
>>
>> Also I had the following errors when building (still for arm64):
>> mm/numa.c:862:8: error: implicit declaration of function
>> 'early_cpu_to_node' is invalid in C99
>> [-Werror,-Wimplicit-function-declaration]
>>         nid = early_cpu_to_node(cpu);
>
> It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can
> disable CONFIG_DEBUG_PER_CPU_MAPS and test it again.
>
> I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very
> helpful, I will fix it next time.
>
> If you have any questions, please let me know.
>
> Regards,
>
> -wrw
>
>> ^
>> mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
>> ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node'
>> declared here
>> void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>>             ^
>> mm/numa.c:874:3: error: implicit declaration of function
>> 'debug_cpumask_set_cpu' is invalid in C99
>> [-Werror,-Wimplicit-function-declaration]
>>                 debug_cpumask_set_cpu(cpu, nid, enable);
>>                 ^
>> mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
>> ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
>> static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct
>> cpumask *dstp)
>>                             ^
>> 2 errors generated.
>>
>> Regards,
>> Pierre
>>
>>>
>>> Thanks!
>>>
>>> Rongwei Wang (5):
>>>    mm/numa: move numa emulation APIs into generic files
>>>    mm: percpu: fix variable type of cpu
>>>    arch_numa: remove __init in early_cpu_to_node()
>>>    mm/numa: support CONFIG_NUMA_EMU for arm64
>>>    mm/numa: migrate leftover numa emulation into mm/numa.c
>>>
>>>   arch/x86/Kconfig                          |   8 -
>>>   arch/x86/include/asm/numa.h               |   3 -
>>>   arch/x86/mm/Makefile                      |   1 -
>>>   arch/x86/mm/numa.c                        | 216 +-------------
>>>   arch/x86/mm/numa_internal.h               |  14 +-
>>>   drivers/base/arch_numa.c                  |   7 +-
>>>   include/asm-generic/numa.h                |  33 +++
>>>   include/linux/percpu.h                    |   2 +-
>>>   mm/Kconfig                                |   8 +
>>>   mm/Makefile                               |   1 +
>>>   arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
>>>   11 files changed, 373 insertions(+), 253 deletions(-)
>>>   rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)
>>>

2024-02-20 11:36:58

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH v1 0/2] support NUMA emulation for genertic arch

A brief introduction
====================

The NUMA emulation can fake more node base on a single
node system, e.g.

one node system:

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31788 MB
node 0 free: 31446 MB
node distances:
node 0
0: 10

add numa=fake=2 (fake 2 node on each origin node):

[root@localhost ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15806 MB
node 0 free: 15451 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16029 MB
node 1 free: 15989 MB
node distances:
node 0 1
0: 10 10
1: 10 10

As above shown, a new node has been faked. As cpus, the realization
of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
better (not sure, next to do if so).

Why do this
===========

It seems has following reasons:
(1) In x86 host, apply NUMA emulation can fake more nodes environment
to test or verify some performance stuff, but arm64 only has
one method that modify ACPI table to do this. It's troublesome
more or less.
(2) Reduce competition for some locks. Here an example we found:
will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
hotspot on lruvec->lock when test in single environment. What's
more, The performance improved greatly if test in two more nodes
system. The data shows below (more is better):

---------------------------------------------------------------------
threads/process | 1 | 12 | 24 | 48 | 96
---------------------------------------------------------------------
one node | 14 1122 | 110 5372 | 111 2615 | 79 7084 | 72 4516
---------------------------------------------------------------------
numa=fake=2 | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
---------------------------------------------------------------------
| For concurrency 12, no lruvec->lock hotspot. For 24,
hotspot | one node has 24% hotspot on lruvec->lock, but
| two nodes env hasn't.
---------------------------------------------------------------------

As for risks (e.g. numa balance...), they need to be discussed here.

Lastly, it seems not a good choice to realize x86 and other genertic
archs separately. But it can indeed avoid some architecture related
APIs adjustments and alleviate future maintenance. The previous RFC
link see [1].

Any advice are welcome, Thanks!

Change log
==========

RFC v1 -> v1
* add new CONFIG_NUMA_FAKE for genertic archs.
* keep x86 implementation, realize numa emulation in driver/base/ for
genertic arch, e.g, arm64.

[1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/

Rongwei Wang (2):
arch_numa: remove __init for early_cpu_to_node
numa: introduce numa emulation for genertic arch

drivers/base/Kconfig | 9 +
drivers/base/Makefile | 1 +
drivers/base/arch_numa.c | 32 +-
drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
drivers/base/numa_emulation.h | 41 ++
include/asm-generic/numa.h | 2 +-
6 files changed, 992 insertions(+), 2 deletions(-)
create mode 100644 drivers/base/numa_emulation.c
create mode 100644 drivers/base/numa_emulation.h

--
2.32.0.3.gf3a3e56d6


2024-02-20 11:37:09

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH v1 1/2] arch_numa: remove __init for early_cpu_to_node

Next, in order to support other arch, early_cpu_to_node()
will be called in a general function. We have to delete
'__init' to avoid warning messages during compiling.

Signed-off-by: Rongwei Wang <[email protected]>
Signed-off-by: Teng Ma <[email protected]>
---
drivers/base/arch_numa.c | 2 +-
include/asm-generic/numa.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 5b59d133b6af..90519d981471 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -144,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);

-int __init early_cpu_to_node(int cpu)
+int early_cpu_to_node(int cpu)
{
return cpu_to_node_map[cpu];
}
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index c32e0cf23c90..16073111bffc 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -35,7 +35,7 @@ int __init numa_add_memblk(int nodeid, u64 start, u64 end);
void __init numa_set_distance(int from, int to, int distance);
void __init numa_free_distance(void);
void __init early_map_cpu_to_node(unsigned int cpu, int nid);
-int __init early_cpu_to_node(int cpu);
+int early_cpu_to_node(int cpu);
void numa_store_cpu_info(unsigned int cpu);
void numa_add_cpu(unsigned int cpu);
void numa_remove_cpu(unsigned int cpu);
--
2.32.0.3.gf3a3e56d6


2024-02-21 06:12:57

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] support NUMA emulation for genertic arch

On Tue, Feb 20, 2024 at 07:36:00PM +0800, Rongwei Wang wrote:
> A brief introduction
> ====================
>
> The NUMA emulation can fake more node base on a single
> node system, e.g.

..

> Lastly, it seems not a good choice to realize x86 and other genertic
> archs separately. But it can indeed avoid some architecture related
> APIs adjustments and alleviate future maintenance.

Why is it a good choice? Copying 1k lines from x86 to a new place and
having to maintain two copies does not sound like a good choice to me.

> The previous RFC link see [1].
>
> Any advice are welcome, Thanks!
>
> Change log
> ==========
>
> RFC v1 -> v1
> * add new CONFIG_NUMA_FAKE for genertic archs.
> * keep x86 implementation, realize numa emulation in driver/base/ for
> genertic arch, e.g, arm64.
>
> [1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
>
> Rongwei Wang (2):
> arch_numa: remove __init for early_cpu_to_node
> numa: introduce numa emulation for genertic arch
>
> drivers/base/Kconfig | 9 +
> drivers/base/Makefile | 1 +
> drivers/base/arch_numa.c | 32 +-
> drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
> drivers/base/numa_emulation.h | 41 ++
> include/asm-generic/numa.h | 2 +-
> 6 files changed, 992 insertions(+), 2 deletions(-)
> create mode 100644 drivers/base/numa_emulation.c
> create mode 100644 drivers/base/numa_emulation.h
>
> --
> 2.32.0.3.gf3a3e56d6
>
>

--
Sincerely yours,
Mike.

2024-02-21 15:52:25

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] support NUMA emulation for genertic arch



On 2/21/24 07:12, Mike Rapoport wrote:
> On Tue, Feb 20, 2024 at 07:36:00PM +0800, Rongwei Wang wrote:
>> A brief introduction
>> ====================
>>
>> The NUMA emulation can fake more node base on a single
>> node system, e.g.
>
> ...
>
>> Lastly, it seems not a good choice to realize x86 and other genertic
>> archs separately. But it can indeed avoid some architecture related
>> APIs adjustments and alleviate future maintenance.
>
> Why is it a good choice? Copying 1k lines from x86 to a new place and
> having to maintain two copies does not sound like a good choice to me.

I agree it would be better to avoid duplication and extract the common
code from the original x86 implementation. The RFC seemed to go more
in this direction.
Also NITs:
- genertic -> generic
- there is a 'ifdef CONFIG_X86' in drivers/base/numa_emulation.c,
but the file should not be used by x86 as the arch doesn't set
CONFIG_GENERIC_ARCH_NUMA

Regards,
Pierre

>
>> The previous RFC link see [1].
>>
>> Any advice are welcome, Thanks!
>>
>> Change log
>> ==========
>>
>> RFC v1 -> v1
>> * add new CONFIG_NUMA_FAKE for genertic archs.
>> * keep x86 implementation, realize numa emulation in driver/base/ for
>> genertic arch, e.g, arm64.
>>
>> [1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
>>
>> Rongwei Wang (2):
>> arch_numa: remove __init for early_cpu_to_node
>> numa: introduce numa emulation for genertic arch
>>
>> drivers/base/Kconfig | 9 +
>> drivers/base/Makefile | 1 +
>> drivers/base/arch_numa.c | 32 +-
>> drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
>> drivers/base/numa_emulation.h | 41 ++
>> include/asm-generic/numa.h | 2 +-
>> 6 files changed, 992 insertions(+), 2 deletions(-)
>> create mode 100644 drivers/base/numa_emulation.c
>> create mode 100644 drivers/base/numa_emulation.h
>>
>> --
>> 2.32.0.3.gf3a3e56d6
>>
>>
>

2024-02-29 03:26:53

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] support NUMA emulation for genertic arch



On 2/21/24 11:51 PM, Pierre Gondois wrote:
>
>
> On 2/21/24 07:12, Mike Rapoport wrote:
>> On Tue, Feb 20, 2024 at 07:36:00PM +0800, Rongwei Wang wrote:
>>> A brief introduction
>>> ====================
>>>
>>> The NUMA emulation can fake more node base on a single
>>> node system, e.g.
>>
>> ...
>>> Lastly, it seems not a good choice to realize x86 and other genertic
>>> archs separately. But it can indeed avoid some architecture related
>>> APIs adjustments and alleviate future maintenance.
>>
>> Why is it a good choice? Copying 1k lines from x86 to a new place and
>> having to maintain two copies does not sound like a good choice to me.
Hi Pierre
> I agree it would be better to avoid duplication and extract the common
> code from the original x86 implementation. The RFC seemed to go more
> in this direction.
> Also NITs:
> - genertic -> generic
Thanks, my fault, zhaoyu also found this (thanks).
> - there is a 'ifdef CONFIG_X86' in drivers/base/numa_emulation.c,
>   but the file should not be used by x86 as the arch doesn't set
>   CONFIG_GENERIC_ARCH_NUMA
>
Actually, I have not think about how to ask the question. I'm also try
to original direction like RFC version, but found much APIs need to be
updated, and there are many APIs are similar but a little difference.
That seems much modification needed in more than one arch if go in
original direction.

But if all think original method is right, I will continue it in RFC
version.

Thanks for your time to review.
> Regards,
> Pierre
>
>>
>>> The previous RFC link see [1].
>>>
>>> Any advice are welcome, Thanks!
>>>
>>> Change log
>>> ==========
>>>
>>> RFC v1 -> v1
>>> * add new CONFIG_NUMA_FAKE for genertic archs.
>>> * keep x86 implementation, realize numa emulation in driver/base/ for
>>>    genertic arch, e.g, arm64.
>>>
>>> [1] RFC v1:
>>> https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
>>>
>>> Rongwei Wang (2):
>>>    arch_numa: remove __init for early_cpu_to_node
>>>    numa: introduce numa emulation for genertic arch
>>>
>>>   drivers/base/Kconfig          |   9 +
>>>   drivers/base/Makefile         |   1 +
>>>   drivers/base/arch_numa.c      |  32 +-
>>>   drivers/base/numa_emulation.c | 909
>>> ++++++++++++++++++++++++++++++++++
>>>   drivers/base/numa_emulation.h |  41 ++
>>>   include/asm-generic/numa.h    |   2 +-
>>>   6 files changed, 992 insertions(+), 2 deletions(-)
>>>   create mode 100644 drivers/base/numa_emulation.c
>>>   create mode 100644 drivers/base/numa_emulation.h
>>>
>>> --
>>> 2.32.0.3.gf3a3e56d6
>>>
>>>
>>

--
Thanks,
-wrw