2013-04-05 09:37:22

by Tang Chen

[permalink] [raw]
Subject: [PATCH 00/11] Introduce movablemem_map=acpi boot option.

Before this patch-set, we introduced movablemem_map boot option which allowed
users to specify physical address ranges to set memory as movable. This is not
user friendly enough for normal users.

So now, we introduce just movablemem_map=acpi to allow users to enable/disable
the kernel to use Hot Pluggable bit in SRAT to determine which memory ranges are
hotpluggable, and set them as ZONE_MOVABLE.

This patch-set is based on Yinghai's patch-set:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47

So it supports to allocate pagetable pages in local nodes.

We also split the large patch-set into smaller ones, and it seems easier to review.


========================================================================
[What we are doing]
This patchset introduces a boot option for users to specify ZONE_MOVABLE
memory map for each node in the system. Users can use it in two ways:

1. movablecore_map=acpi
In this way, the kernel will use Hot Pluggable bit in SRAT to determine
ZONE_MOVABLE for each node. All the ranges user has specified will be
ignored.


[Why we do this]
If we hot remove a memroy device, it cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.
(Here is an exception: When we implement the node hotplug functionality,
for those kernel memory whose life cycle is the same as the node, such as
pagetables, vmemmap and so on, although the kernel cannot migrate them,
we can still put them on local node because we can free them before we
hot-remove the node. This is not completely implemented yet.)

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.
(NOTE: doing this will cause NUMA performance because the kernel won't
be able to distribute kernel memory evenly to each node.)

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x80000000-0c0000000, we have no way to specify
the memory as movable memory.

Furthermore, even if we can use SRAT, users still need an interface
to enable/disable this functionality if they don't want to lose their
NUMA performance. So I think, a user interface is always needed.

So we proposed this new feature which enable/disable the kernel to set
hotpluggable memory as ZONE_MOVABLE.


[Ways to do this]
There may be 2 ways to specify movable memory.
1. use firmware information
2. use boot option

1. use firmware information
According to ACPI spec 5.0, SRAT table has memory affinity structure
and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
Affinity Structure". If we use the information, we might be able to
specify movable memory by firmware. For example, if Hot Pluggable
Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.


[How we do this]
We now propose a boot option, but support the first way above. A boot option
is always needed because set memory as movable will cause NUMA performance
down. So at least, we need an interface to enable/disable it so that users
who don't want to use memory hotplug functionality will also be happy.


[How to use]
Specify movablemem_map=acpi in kernel commandline:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* ZONE_MOVABLE: |_____| |_________|
*
NOTE: 1) Before parsing SRAT, memblock has already reserve some memory ranges
for other purposes, such as for kernel image. We cannot prevent
kernel from using these memory, so we need to exclude these memory
even if it is hotpluggable.
Furthermore, to ensure the kernel has enough memory to boot, we make
all the memory on the node which the kernel resides in should be
un-hotpluggable.
2) In this case, all the user specified memory ranges will be ingored.

We also need to consider the following points:
1) Using this boot option could cause NUMA performance down because the kernel
memory will not be distributed on each node evenly. So for users who don't
want to lose their NUMA performance, just don't use it.
2) If kernelcore or movablecore is also specified, movablecore_map will have
higher priority to be satisfied.
3) This option has no conflict with memmap option.

Tane Chen (10):
acpi: Print hotplug info in SRAT.
numa, acpi, memory-hotplug: Add movablemem_map=acpi boot option.
x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct
numa_meminfo.
x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup
numa_meminfo.
X86, numa, acpi, memory-hotplug: Add hotpluggable ranges to
movablemem_map.
x86, numa, acpi, memory-hotplug: Make any node which the kernel
resides in un-hotpluggable.
x86, numa, acpi, memory-hotplug: Introduce zone_movable_limit[] to
store start pfn of ZONE_MOVABLE.
x86, numa, acpi, memory-hotplug: Sanitize zone_movable_limit[].
x86, numa, acpi, memory-hotplug: make movablemem_map have higher
priority
x86, numa, acpi, memory-hotplug: Memblock limit with movablemem_map

Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node

Documentation/kernel-parameters.txt | 11 ++
arch/x86/include/asm/numa.h | 3 +-
arch/x86/kernel/apic/numaq_32.c | 2 +-
arch/x86/mm/amdtopology.c | 3 +-
arch/x86/mm/numa.c | 92 ++++++++++++++--
arch/x86/mm/numa_internal.h | 1 +
arch/x86/mm/srat.c | 28 ++++-
include/linux/memblock.h | 2 +
include/linux/mm.h | 19 +++
mm/memblock.c | 50 ++++++++
mm/page_alloc.c | 210 ++++++++++++++++++++++++++++++++++-
11 files changed, 399 insertions(+), 22 deletions(-)


2013-04-05 09:37:26

by Tang Chen

[permalink] [raw]
Subject: [PATCH 02/11] acpi: Print hotplug info in SRAT.

The Hot Pluggable field in SRAT points out if the memory could be
hotplugged while the system is running. It is useful to print out
this info when parsing SRAT.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/srat.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 443f9ef..5055fa7 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -146,6 +146,7 @@ int __init
acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
+ u32 hotpluggable;
int node, pxm;

if (srat_disabled())
@@ -154,7 +155,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
goto out_err;
- if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
+ hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
+ if (hotpluggable && !save_add_info())
goto out_err;

start = ma->base_address;
@@ -174,9 +176,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)

node_set(node, numa_nodes_parsed);

- printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
+ printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
node, pxm,
- (unsigned long long) start, (unsigned long long) end - 1);
+ (unsigned long long) start, (unsigned long long) end - 1,
+ hotpluggable ? "Hot Pluggable" : "");

return 0;
out_err_bad_srat:
--
1.7.1

2013-04-05 09:37:24

by Tang Chen

[permalink] [raw]
Subject: [PATCH 01/11] x86: get pg_data_t's memory from other node

From: Yasuaki Ishimatsu <[email protected]>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/mm/numa.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 11acdf6..4f754e6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -214,10 +214,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/
- nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
+ pr_err("Cannot find %zu bytes in any node\n", nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.1

2013-04-05 09:37:57

by Tang Chen

[permalink] [raw]
Subject: [PATCH 10/11] x86, numa, acpi, memory-hotplug: make movablemem_map have higher priority

If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
mm/page_alloc.c | 28 +++++++++++++++++++++++++---
1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f800aec..5db286f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4872,9 +4872,17 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}

- /* If kernelcore was not specified, there is no ZONE_MOVABLE */
- if (!required_kernelcore)
+ /*
+ * If neither kernelcore/movablecore nor movablemem_map is specified,
+ * there is no ZONE_MOVABLE. But if movablemem_map is specified, the
+ * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+ */
+ if (!required_kernelcore) {
+ if (movablemem_map.nr_map)
+ memcpy(zone_movable_pfn, zone_movable_limit,
+ sizeof(zone_movable_pfn));
goto out;
+ }

/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
@@ -4904,10 +4912,24 @@ restart:
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
unsigned long size_pages;

+ /*
+ * Find more memory for kernelcore in
+ * [zone_movable_pfn[nid], zone_movable_limit[nid]).
+ */
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;

+ if (zone_movable_limit[nid]) {
+ end_pfn = min(end_pfn, zone_movable_limit[nid]);
+ /* No range left for kernelcore in this node */
+ if (start_pfn >= end_pfn) {
+ zone_movable_pfn[nid] =
+ zone_movable_limit[nid];
+ break;
+ }
+ }
+
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -4967,12 +4989,12 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;

+out:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);

-out:
/* restore the node_state */
node_states[N_MEMORY] = saved_node_state;
}
--
1.7.1

2013-04-05 09:37:55

by Tang Chen

[permalink] [raw]
Subject: [PATCH 09/11] x86, numa, acpi, memory-hotplug: Sanitize zone_movable_limit[].

As mentioned by Liu Jiang and Wu Jiangguo, users could specify DMA,
DMA32, and HIGHMEM as movable. In order to ensure the kernel will
work correctly, we should exclude these memory ranges out from
zone_movable_limit[].

NOTE: Do find_usable_zone_for_movable() to initialize movable_zone
so that sanitize_zone_movable_limit() could use it. This is
pointed out by Wu Jianguo <[email protected]>.

Reported-by: Wu Jianguo <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Liu Jiang <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
mm/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 53 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b97bdb5..f800aec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4412,6 +4412,57 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
}

+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit[] have been initialized when parsing SRAT or
+ * movablemem_map. This function will try to exclude ZONE_DMA, ZONE_DMA32,
+ * and HIGHMEM from zone_movable_limit[].
+ *
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Need to be called with movable_zone initialized.
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+ int nid;
+
+ if (!movablemem_map.nr_map)
+ return;
+
+ /* Iterate each node id. */
+ for_each_node(nid) {
+ /* If we have no limit for this node, just skip it. */
+ if (!zone_movable_limit[nid])
+ continue;
+
+#ifdef CONFIG_ZONE_DMA
+ /* Skip DMA memory. */
+ if (zone_movable_limit[nid] <
+ arch_zone_highest_possible_pfn[ZONE_DMA])
+ zone_movable_limit[nid] =
+ arch_zone_highest_possible_pfn[ZONE_DMA];
+#endif
+
+#ifdef CONFIG_ZONE_DMA32
+ /* Skip DMA32 memory. */
+ if (zone_movable_limit[nid] <
+ arch_zone_highest_possible_pfn[ZONE_DMA32])
+ zone_movable_limit[nid] =
+ arch_zone_highest_possible_pfn[ZONE_DMA32];
+#endif
+
+#ifdef CONFIG_HIGHMEM
+ /* Skip lowmem if ZONE_MOVABLE is highmem. */
+ if (zone_movable_is_highmem() &&
+ zone_movable_limit[nid] <
+ arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
+ zone_movable_limit[nid] =
+ arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+ }
+}
+
#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
@@ -4826,7 +4877,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;

/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
- find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

restart:
@@ -4985,6 +5035,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)

/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+ find_usable_zone_for_movable();
+ sanitize_zone_movable_limit();
find_zone_movable_pfns_for_nodes();

/* Print out the zone ranges */
--
1.7.1

2013-04-05 09:38:36

by Tang Chen

[permalink] [raw]
Subject: [PATCH 07/11] x86, numa, acpi, memory-hotplug: Make any node which the kernel resides in un-hotpluggable.

Before parsing SRAT, memblock has already reserved some memory ranges
for other purposes, such as for kernel image. We cannot prevent
kernel from using these memory. Furthermore, if all the memory is
hotpluggable, then the system won't have enough memory to boot if we set
all of them as movable. So we always set the nodes which the kernel
resides in as non-movable.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 25 +++++++++++++++++++------
arch/x86/mm/srat.c | 17 ++++++++++++++++-
include/linux/mm.h | 1 +
3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 73e7934..dcaf248 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -736,24 +736,37 @@ static void __init early_x86_numa_init_mapping(void)
* we will put pagetable pages in local node even if the memory of that node is
* hotpluggable.
*
- * If users specify movablemem_map=acpi, then:
+ * And, when the kernel is booting, memblock has reserved some memory for other
+ * purpose, such as storing kernel image. We cannot prevent the kernel from
+ * using this kind of memory. So whatever node the kernel resides in should be
+ * un-hotpluggable, because if all the memory is hotpluggable, and is set as
+ * movable, the kernel won't have enough memory to boot.
+ *
+ * It works like this:
+ * If users specify movablemem_map=acpi, then
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
- * hotpluggable: n y y n
+ * hotpluggable: y y y n
+ * kernel resides in: y n n n
* movablemem_map: |_____| |_________|
*/
static void __init early_mem_hotplug_init()
{
- int i;
+ int i, nid;

if (!movablemem_map.acpi)
return;

for (i = 0; i < numa_meminfo.nr_blks; i++) {
- if (numa_meminfo.blk[i].hotpluggable)
- movablemem_map_add_region(numa_meminfo.blk[i].start,
- numa_meminfo.blk[i].end);
+ nid = numa_meminfo_all.blk[i].nid;
+
+ if (node_isset(nid, movablemem_map.numa_nodes_kernel) ||
+ !numa_meminfo.blk[i].hotpluggable)
+ continue;
+
+ movablemem_map_add_region(numa_meminfo.blk[i].start,
+ numa_meminfo.blk[i].end);
}
}
#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index f7f6fd4..0b5904e 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -147,7 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
u32 hotpluggable;
- int node, pxm;
+ int node, pxm, i;
+ struct memblock_type *rgn = &memblock.reserved;

if (srat_disabled())
goto out_err;
@@ -176,6 +177,20 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)

node_set(node, numa_nodes_parsed);

+ /*
+ * Before parsing SRAT, memblock has reserved some memory for other
+ * purpose, such as storing kernel image. We cannot prevent the kernel
+ * from using this kind of memory. So just mark which nodes the kernel
+ * resides in, and set these nodes un-hotpluggable later.
+ */
+ for (i = 0; i < rgn->cnt; i++) {
+ if (end <= rgn->regions[i].base ||
+ start >= rgn->regions[i].base + rgn->regions[i].size)
+ continue;
+
+ node_set(node, movablemem_map.numa_nodes_kernel);
+ }
+
printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7468221..2835c91 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1342,6 +1342,7 @@ struct movablemem_map {
bool acpi;
int nr_map;
struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
+ nodemask_t numa_nodes_kernel; /* on which nodes kernel resides in */
};

extern struct movablemem_map movablemem_map;
--
1.7.1

2013-04-05 09:38:51

by Tang Chen

[permalink] [raw]
Subject: [PATCH 08/11] x86, numa, acpi, memory-hotplug: Introduce zone_movable_limit[] to store start pfn of ZONE_MOVABLE.

Since node info in SRAT may not be in increasing order, we may meet
a lower range after we handled a higher range. So we need to keep
the lowest movable pfn each time we parse a SRAT memory entry, and
update it when we get a lower one.

This patch introduces a new array zone_movable_limit[], which is used
to store the start pfn of each node's ZONE_MOVABLE.

We update it each time we parsed a SRAT memory entry if necessary.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 16 ++++++++++++++--
include/linux/mm.h | 2 ++
mm/page_alloc.c | 1 +
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index dcaf248..8cbe8a0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -727,7 +727,8 @@ static void __init early_x86_numa_init_mapping(void)

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
/**
- * early_mem_hotplug_init - Add hotpluggable memory ranges to movablemem_map.
+ * early_mem_hotplug_init - Add hotpluggable memory ranges to movablemem_mapi,
+ * and initialize zone_movable_limit.
*
* This function scan numa_meminfo.blk[], and add all the hotpluggable memory
* ranges to movablemem_map. movablemem_map can be used to prevent memblock
@@ -750,6 +751,10 @@ static void __init early_x86_numa_init_mapping(void)
* hotpluggable: y y y n
* kernel resides in: y n n n
* movablemem_map: |_____| |_________|
+ *
+ * This function will also initialize zone_movable_limit[].
+ * ZONE_MOVABLE of node i should start at least from zone_movable_limit[i].
+ * zone_movable_limit[i] == 0 means there is no limitation for node i.
*/
static void __init early_mem_hotplug_init()
{
@@ -759,7 +764,7 @@ static void __init early_mem_hotplug_init()
return;

for (i = 0; i < numa_meminfo.nr_blks; i++) {
- nid = numa_meminfo_all.blk[i].nid;
+ nid = numa_meminfo.blk[i].nid;

if (node_isset(nid, movablemem_map.numa_nodes_kernel) ||
!numa_meminfo.blk[i].hotpluggable)
@@ -767,6 +772,13 @@ static void __init early_mem_hotplug_init()

movablemem_map_add_region(numa_meminfo.blk[i].start,
numa_meminfo.blk[i].end);
+
+ if (zone_movable_limit[nid])
+ zone_movable_limit[nid] = min(zone_movable_limit[nid],
+ PFN_DOWN(numa_meminfo.blk[i].start));
+ else
+ zone_movable_limit[nid] =
+ PFN_DOWN(numa_meminfo.blk[i].start);
}
}
#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2835c91..b313d83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1349,6 +1349,8 @@ extern struct movablemem_map movablemem_map;

extern void __init movablemem_map_add_region(u64 start, u64 size);

+extern unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2a7904f..b97bdb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -213,6 +213,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];

/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
--
1.7.1

2013-04-05 09:39:14

by Tang Chen

[permalink] [raw]
Subject: [PATCH 11/11] x86, numa, acpi, memory-hotplug: Memblock limit with movablemem_map

Ensure memblock will not allocate memory from areas that may be
ZONE_MOVABLE. The map info is from movablemem_map boot option.

The following problem was reported by Stephen Rothwell:
The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Tested-by: Lin Feng <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
---
include/linux/memblock.h | 2 +
mm/memblock.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..3e5ecb2 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {

extern struct memblock memblock;
extern int memblock_debug;
+extern struct movablemem_map movablemem_map;

#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -60,6 +61,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
unsigned long *out_end_pfn, int *out_nid);

diff --git a/mm/memblock.c b/mm/memblock.c
index b8d9147..1bcd9b9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -92,9 +92,58 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
*
* Find @size free area aligned to @align in the specified range and node.
*
+ * If we have CONFIG_HAVE_MEMBLOCK_NODE_MAP defined, we need to check if the
+ * memory we found if not in hotpluggable ranges.
+ *
* RETURNS:
* Found address on success, %0 on failure.
*/
+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+ phys_addr_t end, phys_addr_t size,
+ phys_addr_t align, int nid)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+ int curr = movablemem_map.nr_map - 1;
+
+ /* pump up @end */
+ if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+ end = memblock.current_limit;
+
+ /* avoid allocating the first page */
+ start = max_t(phys_addr_t, start, PAGE_SIZE);
+ end = max(start, end);
+
+ for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+restart:
+ if (this_end <= this_start || this_end < size)
+ continue;
+
+ for (; curr >= 0; curr--) {
+ if ((movablemem_map.map[curr].start_pfn << PAGE_SHIFT)
+ < this_end)
+ break;
+ }
+
+ cand = round_down(this_end - size, align);
+ if (curr >= 0 &&
+ cand < movablemem_map.map[curr].end_pfn << PAGE_SHIFT) {
+ this_end = movablemem_map.map[curr].start_pfn
+ << PAGE_SHIFT;
+ goto restart;
+ }
+
+ if (cand >= this_start)
+ return cand;
+ }
+
+ return 0;
+}
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
phys_addr_t align, int nid)
@@ -123,6 +172,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
}
return 0;
}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

/**
* memblock_find_in_range - find free area in given range
--
1.7.1

2013-04-05 09:39:35

by Tang Chen

[permalink] [raw]
Subject: [PATCH 04/11] x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct numa_meminfo.

Since we are using struct numa_meminfo to store SRAT info, and sanitize
movablemem_map.map[], we need hotplug info in struct numa_meminfo.

This patch introduces a "bool hotpluggable" member into struct
numa_meminfo.

And modifies the following APIs' prototypes to support it:
- numa_add_memblk()
- numa_add_memblk_to()

And the following callers:
- numaq_register_node()
- dummy_numa_init()
- amd_numa_init()
- acpi_numa_memory_affinity_init() in x86

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/numa.h | 3 ++-
arch/x86/kernel/apic/numaq_32.c | 2 +-
arch/x86/mm/amdtopology.c | 3 ++-
arch/x86/mm/numa.c | 10 +++++++---
arch/x86/mm/numa_internal.h | 1 +
arch/x86/mm/srat.c | 2 +-
6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 1b99ee5..73096b2 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,7 +31,8 @@ extern int numa_off;
extern s16 __apicid_to_node[MAX_LOCAL_APIC];
extern nodemask_t numa_nodes_parsed __initdata;

-extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+extern int __init numa_add_memblk(int nodeid, u64 start, u64 end,
+ bool hotpluggable);
extern void __init numa_set_distance(int from, int to, int distance);

static inline void set_apicid_to_node(int apicid, s16 node)
diff --git a/arch/x86/kernel/apic/numaq_32.c b/arch/x86/kernel/apic/numaq_32.c
index d661ee9..7a9c542 100644
--- a/arch/x86/kernel/apic/numaq_32.c
+++ b/arch/x86/kernel/apic/numaq_32.c
@@ -82,7 +82,7 @@ static inline void numaq_register_node(int node, struct sys_cfg_data *scd)
int ret;

node_set(node, numa_nodes_parsed);
- ret = numa_add_memblk(node, start, end);
+ ret = numa_add_memblk(node, start, end, false);
BUG_ON(ret < 0);
}

diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index 5247d01..d521471 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -167,7 +167,8 @@ int __init amd_numa_init(void)
nodeid, base, limit);

prevbase = base;
- numa_add_memblk(nodeid, base, limit);
+ /* Do not support memory hotplug for AMD cpu. */
+ numa_add_memblk(nodeid, base, limit, false);
node_set(nodeid, numa_nodes_parsed);
}

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4f754e6..ecf37fd 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -134,6 +134,7 @@ void __init setup_node_to_cpumask_map(void)
}

static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+ bool hotpluggable,
struct numa_meminfo *mi)
{
/* ignore zero length blks */
@@ -155,6 +156,7 @@ static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
mi->blk[mi->nr_blks].start = start;
mi->blk[mi->nr_blks].end = end;
mi->blk[mi->nr_blks].nid = nid;
+ mi->blk[mi->nr_blks].hotpluggable = hotpluggable;
mi->nr_blks++;
return 0;
}
@@ -179,15 +181,17 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
* @nid: NUMA node ID of the new memblk
* @start: Start address of the new memblk
* @end: End address of the new memblk
+ * @hotpluggable: True if memblk is hotpluggable
*
* Add a new memblk to the default numa_meminfo.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
-int __init numa_add_memblk(int nid, u64 start, u64 end)
+int __init numa_add_memblk(int nid, u64 start, u64 end,
+ bool hotpluggable)
{
- return numa_add_memblk_to(nid, start, end, &numa_meminfo);
+ return numa_add_memblk_to(nid, start, end, hotpluggable, &numa_meminfo);
}

/* Initialize NODE_DATA for a node on the local memory */
@@ -631,7 +635,7 @@ static int __init dummy_numa_init(void)
0LLU, PFN_PHYS(max_pfn) - 1);

node_set(0, numa_nodes_parsed);
- numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
+ numa_add_memblk(0, 0, PFN_PHYS(max_pfn), false);

return 0;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index bb2fbcc..1ce4e6b 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -8,6 +8,7 @@ struct numa_memblk {
u64 start;
u64 end;
int nid;
+ bool hotpluggable;
};

struct numa_meminfo {
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 5055fa7..f7f6fd4 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -171,7 +171,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
}

- if (numa_add_memblk(node, start, end) < 0)
+ if (numa_add_memblk(node, start, end, hotpluggable) < 0)
goto out_err_bad_srat;

node_set(node, numa_nodes_parsed);
--
1.7.1

2013-04-05 09:39:34

by Tang Chen

[permalink] [raw]
Subject: [PATCH 06/11] X86, numa, acpi, memory-hotplug: Add hotpluggable ranges to movablemem_map.

When parsing SRAT, we are able to know which memory ranges are hotpluggable,
and we add them to movablemem_map. So movablemem_map could be used to prevent
memblock from allocating memory in area which will be set as ZONE_MOVABLE later.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 39 ++++++++++++++++++++++
include/linux/mm.h | 4 ++
mm/page_alloc.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 135 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 26d1800..73e7934 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -725,6 +725,43 @@ static void __init early_x86_numa_init_mapping(void)
}
#endif

+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/**
+ * early_mem_hotplug_init - Add hotpluggable memory ranges to movablemem_map.
+ *
+ * This function scan numa_meminfo.blk[], and add all the hotpluggable memory
+ * ranges to movablemem_map. movablemem_map can be used to prevent memblock
+ * from allocating memory in area which will be set as ZONE_MOVABLE later, so
+ * this function should be called after memory mapping is initialized because
+ * we will put pagetable pages in local node even if the memory of that node is
+ * hotpluggable.
+ *
+ * If users specify movablemem_map=acpi, then:
+ *
+ * SRAT: |_____| |_____| |_________| |_________| ......
+ * node id: 0 1 1 2
+ * hotpluggable: n y y n
+ * movablemem_map: |_____| |_________|
+ */
+static void __init early_mem_hotplug_init()
+{
+ int i;
+
+ if (!movablemem_map.acpi)
+ return;
+
+ for (i = 0; i < numa_meminfo.nr_blks; i++) {
+ if (numa_meminfo.blk[i].hotpluggable)
+ movablemem_map_add_region(numa_meminfo.blk[i].start,
+ numa_meminfo.blk[i].end);
+ }
+}
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+static inline void early_mem_hotplug_init()
+{
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
@@ -734,6 +771,8 @@ void __init early_initmem_init(void)
load_cr3(swapper_pg_dir);
__flush_tlb_all();

+ early_mem_hotplug_init();
+
early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52c3558..7468221 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1344,6 +1344,10 @@ struct movablemem_map {
struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
};

+extern struct movablemem_map movablemem_map;
+
+extern void __init movablemem_map_add_region(u64 start, u64 size);
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 475fd8b..2a7904f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5068,6 +5068,98 @@ early_param("kernelcore", cmdline_parse_kernelcore);
early_param("movablecore", cmdline_parse_movablecore);

/**
+ * insert_movablemem_map - Insert a memory range in to movablemem_map.map.
+ * @start_pfn: start pfn of the range
+ * @end_pfn: end pfn of the range
+ *
+ * This function will also merge the overlapped ranges, and sort the array
+ * by start_pfn in monotonic increasing order.
+ */
+static void __init insert_movablemem_map(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ int pos, overlap;
+
+ /*
+ * pos will be at the 1st overlapped range, or the position
+ * where the element should be inserted.
+ */
+ for (pos = 0; pos < movablemem_map.nr_map; pos++)
+ if (start_pfn <= movablemem_map.map[pos].end_pfn)
+ break;
+
+ /* If there is no overlapped range, just insert the element. */
+ if (pos == movablemem_map.nr_map ||
+ end_pfn < movablemem_map.map[pos].start_pfn) {
+ /*
+ * If pos is not the end of array, we need to move all
+ * the rest elements backward.
+ */
+ if (pos < movablemem_map.nr_map)
+ memmove(&movablemem_map.map[pos+1],
+ &movablemem_map.map[pos],
+ sizeof(struct movablemem_entry) *
+ (movablemem_map.nr_map - pos));
+ movablemem_map.map[pos].start_pfn = start_pfn;
+ movablemem_map.map[pos].end_pfn = end_pfn;
+ movablemem_map.nr_map++;
+ return;
+ }
+
+ /* overlap will be at the last overlapped range */
+ for (overlap = pos + 1; overlap < movablemem_map.nr_map; overlap++)
+ if (end_pfn < movablemem_map.map[overlap].start_pfn)
+ break;
+
+ /*
+ * If there are more ranges overlapped, we need to merge them,
+ * and move the rest elements forward.
+ */
+ overlap--;
+ movablemem_map.map[pos].start_pfn = min(start_pfn,
+ movablemem_map.map[pos].start_pfn);
+ movablemem_map.map[pos].end_pfn = max(end_pfn,
+ movablemem_map.map[overlap].end_pfn);
+
+ if (pos != overlap && overlap + 1 != movablemem_map.nr_map)
+ memmove(&movablemem_map.map[pos+1],
+ &movablemem_map.map[overlap+1],
+ sizeof(struct movablemem_entry) *
+ (movablemem_map.nr_map - overlap - 1));
+
+ movablemem_map.nr_map -= overlap - pos;
+}
+
+/**
+ * movablemem_map_add_region - Add a memory range into movablemem_map.
+ * @start: physical start address of range
+ * @end: physical end address of range
+ *
+ * This function transform the physical address into pfn, and then add the
+ * range into movablemem_map by calling insert_movablemem_map().
+ */
+void __init movablemem_map_add_region(u64 start, u64 size)
+{
+ unsigned long start_pfn, end_pfn;
+
+ /* In case size == 0 or start + size overflows */
+ if (start + size <= start)
+ return;
+
+ if (movablemem_map.nr_map >= ARRAY_SIZE(movablemem_map.map)) {
+ pr_err("movablemem_map: too many entries; "
+ "ignoring [mem %#010llx-%#010llx]\n",
+ (unsigned long long) start,
+ (unsigned long long) (start + size - 1));
+ return;
+ }
+
+ start_pfn = PFN_DOWN(start);
+ end_pfn = PFN_UP(start + size);
+ insert_movablemem_map(start_pfn, end_pfn);
+}
+
+/**
* cmdline_parse_movablemem_map - Parse boot option movablemem_map.
* @p: The boot option of the following format:
* movablemem_map=acpi
--
1.7.1

2013-04-05 09:37:20

by Tang Chen

[permalink] [raw]
Subject: [PATCH 05/11] x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup numa_meminfo.

Since we have introduced hotplug info into struct numa_meminfo, we need
to consider it when cleanup numa_meminfo.

The original logic in numa_cleanup_meminfo() is:
Merge blocks on the same node, holes between which don't overlap with
memory on other nodes.

This patch modifies numa_cleanup_meminfo() logic like this:
Merge blocks with the same hotpluggable type on the same node, holes
between which don't overlap with memory on other nodes.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ecf37fd..26d1800 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -296,18 +296,22 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
}

/*
- * Join together blocks on the same node, holes
- * between which don't overlap with memory on other
- * nodes.
+ * Join together blocks on the same node, with the same
+ * hotpluggable flags, holes between which don't overlap
+ * with memory on other nodes.
*/
if (bi->nid != bj->nid)
continue;
+ if (bi->hotpluggable != bj->hotpluggable)
+ continue;
+
start = min(bi->start, bj->start);
end = max(bi->end, bj->end);
for (k = 0; k < mi->nr_blks; k++) {
struct numa_memblk *bk = &mi->blk[k];

- if (bi->nid == bk->nid)
+ if (bi->nid == bk->nid &&
+ bi->hotpluggable == bk->hotpluggable)
continue;
if (start < bk->end && end > bk->start)
break;
@@ -327,6 +331,7 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
mi->blk[i].start = mi->blk[i].end = 0;
mi->blk[i].nid = NUMA_NO_NODE;
+ mi->blk[i].hotpluggable = false;
}

return 0;
--
1.7.1

2013-04-05 09:40:42

by Tang Chen

[permalink] [raw]
Subject: [PATCH 03/11] numa, acpi, memory-hotplug: Add movablemem_map=acpi boot option.

Since the kernel pages cannot be migrated, if we want a memory device
hotpluggable, we have to set all the memory on it as ZONE_MOVABLE.

This patch adds a boot option movablemem_map=acpi to inform the kernel
to use Hot Pluggable bit in SRAT to determine which memory device is
hotpluggable.

Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
Documentation/kernel-parameters.txt | 11 +++++++++++
include/linux/mm.h | 12 ++++++++++++
mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..e039888 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1649,6 +1649,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.

+ movablemem_map=acpi
+ [KNL,X86,IA-64,PPC] This parameter is similar to
+ memmap except it specifies the memory map of
+ ZONE_MOVABLE.
+ This option inform the kernel to use Hot Pluggable bit
+ in flags from SRAT from ACPI BIOS to determine which
+ memory devices could be hotplugged. The corresponding
+ memory ranges will be set as ZONE_MOVABLE.
+ NOTE: Whatever node the kernel resides in will always
+ be un-hotpluggable.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1c79b10..52c3558 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1332,6 +1332,18 @@ extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
extern void sparse_memory_present_with_active_regions(int nid);

+#define MOVABLEMEM_MAP_MAX MAX_NUMNODES
+struct movablemem_entry {
+ unsigned long start_pfn; /* start pfn of memory segment */
+ unsigned long end_pfn; /* end pfn of memory segment (exclusive) */
+};
+
+struct movablemem_map {
+ bool acpi;
+ int nr_map;
+ struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
+};
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f368db4..475fd8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -202,6 +202,12 @@ static unsigned long __meminitdata nr_all_pages;
static unsigned long __meminitdata dma_reserve;

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/* Movable memory ranges, will also be used by memblock subsystem. */
+struct movablemem_map movablemem_map = {
+ .acpi = false,
+ .nr_map = 0,
+};
+
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
static unsigned long __initdata required_kernelcore;
@@ -5061,6 +5067,35 @@ static int __init cmdline_parse_movablecore(char *p)
early_param("kernelcore", cmdline_parse_kernelcore);
early_param("movablecore", cmdline_parse_movablecore);

+/**
+ * cmdline_parse_movablemem_map - Parse boot option movablemem_map.
+ * @p: The boot option of the following format:
+ * movablemem_map=acpi
+ *
+ * This option inform the kernel to use Hot Pluggable bit in SRAT to determine
+ * which memory device is hotpluggable, and set the memory on it as movable.
+ *
+ * Return: 0 on success or -EINVAL on failure.
+ */
+static int __init cmdline_parse_movablemem_map(char *p)
+{
+ if (!p || strcmp(p, "acpi"))
+ goto err;
+
+ movablemem_map.acpi = true;
+
+ if (movablemem_map.nr_map) {
+ memset(movablemem_map.map, 0,
+ sizeof(struct movablemem_entry) * movablemem_map.nr_map);
+ }
+
+ return 0;
+
+err:
+ return -EINVAL;
+}
+early_param("movablemem_map", cmdline_parse_movablemem_map);
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

/**
--
1.7.1

2013-04-09 05:16:03

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 00/11] Introduce movablemem_map=acpi boot option.

Hi Tang,

The patch works well on my x86_64 box.
I confirmed that hotpluggable node is allocated as Movable Zone.
So feel free to add:

Tested by: Yasuaki Ishimatsu <[email protected]>

Nitpick below.

2013/04/05 18:39, Tang Chen wrote:
> Before this patch-set, we introduced movablemem_map boot option which allowed
> users to specify physical address ranges to set memory as movable. This is not
> user friendly enough for normal users.
>
> So now, we introduce just movablemem_map=acpi to allow users to enable/disable
> the kernel to use Hot Pluggable bit in SRAT to determine which memory ranges are
> hotpluggable, and set them as ZONE_MOVABLE.
>
> This patch-set is based on Yinghai's patch-set:
> v1: https://lkml.org/lkml/2013/3/7/642
> v2: https://lkml.org/lkml/2013/3/10/47
>
> So it supports to allocate pagetable pages in local nodes.
>
> We also split the large patch-set into smaller ones, and it seems easier to review.
>
>
> ========================================================================
> [What we are doing]
> This patchset introduces a boot option for users to specify ZONE_MOVABLE
> memory map for each node in the system. Users can use it in two ways:
>
> 1. movablecore_map=acpi
> In this way, the kernel will use Hot Pluggable bit in SRAT to determine
> ZONE_MOVABLE for each node. All the ranges user has specified will be
> ignored.
>
>
> [Why we do this]
> If we hot remove a memroy device, it cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> (Here is an exception: When we implement the node hotplug functionality,
> for those kernel memory whose life cycle is the same as the node, such as
> pagetables, vmemmap and so on, although the kernel cannot migrate them,
> we can still put them on local node because we can free them before we
> hot-remove the node. This is not completely implemented yet.)
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> (NOTE: doing this will cause NUMA performance because the kernel won't
> be able to distribute kernel memory evenly to each node.)
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>
> Furthermore, even if we can use SRAT, users still need an interface
> to enable/disable this functionality if they don't want to lose their
> NUMA performance. So I think, a user interface is always needed.
>
> So we proposed this new feature which enable/disable the kernel to set
> hotpluggable memory as ZONE_MOVABLE.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
> 1. use firmware information
> 2. use boot option
>
> 1. use firmware information
> According to ACPI spec 5.0, SRAT table has memory affinity structure
> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> Affinity Structure". If we use the information, we might be able to
> specify movable memory by firmware. For example, if Hot Pluggable
> Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
> This is our proposal. New boot option can specify memory range to use
> as movable memory.
>
>
> [How we do this]
> We now propose a boot option, but support the first way above. A boot option
> is always needed because set memory as movable will cause NUMA performance
> down. So at least, we need an interface to enable/disable it so that users
> who don't want to use memory hotplug functionality will also be happy.
>
>
> [How to use]
> Specify movablemem_map=acpi in kernel commandline:
> *
> * SRAT: |_____| |_____| |_________| |_________| ......
> * node id: 0 1 1 2
> * hotpluggable: n y y n
> * ZONE_MOVABLE: |_____| |_________|
> *
> NOTE: 1) Before parsing SRAT, memblock has already reserve some memory ranges
> for other purposes, such as for kernel image. We cannot prevent
> kernel from using these memory, so we need to exclude these memory
> even if it is hotpluggable.
> Furthermore, to ensure the kernel has enough memory to boot, we make
> all the memory on the node which the kernel resides in should be
> un-hotpluggable.
> 2) In this case, all the user specified memory ranges will be ingored.
>
> We also need to consider the following points:
> 1) Using this boot option could cause NUMA performance down because the kernel
> memory will not be distributed on each node evenly. So for users who don't
> want to lose their NUMA performance, just don't use it.
> 2) If kernelcore or movablecore is also specified, movablecore_map will have
> higher priority to be satisfied.
> 3) This option has no conflict with memmap option.
>


> Tane Chen (10):
> acpi: Print hotplug info in SRAT.
> numa, acpi, memory-hotplug: Add movablemem_map=acpi boot option.
> x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct
> numa_meminfo.
> x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup
> numa_meminfo.

> X86, numa, acpi, memory-hotplug: Add hotpluggable ranges to
> movablemem_map.

It has a whitespace error.

> x86, numa, acpi, memory-hotplug: Make any node which the kernel
> resides in un-hotpluggable.

> x86, numa, acpi, memory-hotplug: Introduce zone_movable_limit[] to
> store start pfn of ZONE_MOVABLE.

It has a whitespace error.

> x86, numa, acpi, memory-hotplug: Sanitize zone_movable_limit[].
> x86, numa, acpi, memory-hotplug: make movablemem_map have higher
> priority
> x86, numa, acpi, memory-hotplug: Memblock limit with movablemem_map

Thanks,
Yasuaki Ishimatsu

>
> Yasuaki Ishimatsu (1):
> x86: get pg_data_t's memory from other node
>
> Documentation/kernel-parameters.txt | 11 ++
> arch/x86/include/asm/numa.h | 3 +-
> arch/x86/kernel/apic/numaq_32.c | 2 +-
> arch/x86/mm/amdtopology.c | 3 +-
> arch/x86/mm/numa.c | 92 ++++++++++++++--
> arch/x86/mm/numa_internal.h | 1 +
> arch/x86/mm/srat.c | 28 ++++-
> include/linux/memblock.h | 2 +
> include/linux/mm.h | 19 +++
> mm/memblock.c | 50 ++++++++
> mm/page_alloc.c | 210 ++++++++++++++++++++++++++++++++++-
> 11 files changed, 399 insertions(+), 22 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-04-09 08:18:37

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 00/11] Introduce movablemem_map=acpi boot option.

On 04/09/2013 01:14 PM, Yasuaki Ishimatsu wrote:
> Hi Tang,
>
> The patch works well on my x86_64 box.
> I confirmed that hotpluggable node is allocated as Movable Zone.
> So feel free to add:
>
> Tested by: Yasuaki Ishimatsu<[email protected]>
>
> Nitpick below.

Thanks for testing. Will fix the whitespace error and resend the
patch-set soon. :)