LinuxLists.cc - [PATCH v2 00/13] Arrange hotpluggable memory in SRAT as ZONE_MOVABLE.

2013-04-30 09:18:41

[permalink] [raw]

Subject: [PATCH v2 00/13] Arrange hotpluggable memory in SRAT as ZONE_MOVABLE.

In memory hotplug situation, the hotpluggable memory should be
arranged in ZONE_MOVABLE because memory in ZONE_NORMAL may be
used by kernel, and Linux cannot migrate pages used by kernel.

So we need a way to specify hotpluggable memory as movable. It
should be as easy as possible.

According to ACPI spec 5.0, SRAT table has memory affinity
structure and the structure has Hot Pluggable Filed.
See "5.2.16.2 Memory Affinity Structure".

If we use the information, we might be able to specify hotpluggable
memory by firmware. For example, if Hot Pluggable Filed is enabled,
kernel sets the memory as movable memory.

To achieve this goal, we need to do the following:
1. Prevent memblock from allocating hotpluggable memroy for kernel.
This is done by reserving hotpluggable memory in memblock as the
folowing steps:
1) Parse SRAT early enough so that memblock knows which memory
is hotpluggable.
2) Add a "flags" member to memblock so that it is able to tell
which memory is hotpluggable when freeing it to buddy.

2. Free hotpluggable memory to buddy system when memory initialization
is done.

3. Arrange hotpluggable memory in ZONE_MOVABLE.
(This will cause NUMA performance decreased)

4. Provide a user interface to enable/disable this functionality.
(This is useful for those who don't use memory hotplug and who don't
want to lose their NUMA performance.)

This patch-set does the following:
patch1: Fix a little problem.
patch2: Have Hot-Pluggable Field in SRAT printed when parsing SRAT.
patch4,5: Introduce hotpluggable field to numa_meminfo.
patch6,7: Introduce flags to memblock, and keep the public APIs prototype
unmodified.
patch8,9: Reserve node-life-cycle memory as MEMBLK_LOCAL_NODE with memblock.
patch10,11: Reserve hotpluggable memory as MEMBLK_HOTPLUGGABLE with memblock,
and free it to buddy when memory initialization is done.
patch3,12,13: Improve "movablecore" boot option to support "movablecore=acpi".

Change log:
1. Fix a bug in patch10: forgot to update start and end value.
2. Add new patch8: make alloc_low_pages be able to call
memory_add_physaddr_to_nid().

This patch-set is based on Yinghai's
"x86, ACPI, numa: Parse numa info early" patch-set.
Please refer to:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47
v3: https://lkml.org/lkml/2013/4/4/639
v4: https://lkml.org/lkml/2013/4/11/829

And Yinghai's patch did the following things:
1) Parse SRAT early enough.
2）Allocate pagetable pages in local node.

Tang Chen (12):
acpi: Print Hot-Pluggable Field in SRAT.
page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using
SRAT.
x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct
numa_meminfo.
x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup
numa_meminfo.
memblock, numa: Introduce flag into memblock.
x86, numa, mem-hotplug: Mark nodes which the kernel resides in.
x86, numa: Move memory_add_physaddr_to_nid() to CONFIG_NUMA.
x86, numa, memblock: Introduce MEMBLK_LOCAL_NODE to mark and reserve
node-life-cycle data.
x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark
and reserve hotpluggable memory.
x86, memblock, mem-hotplug: Free hotpluggable memory reserved by
memblock.
x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher
priority.
doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot
option.

Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node

Documentation/kernel-parameters.txt | 8 ++
arch/x86/include/asm/numa.h | 3 +-
arch/x86/kernel/apic/numaq_32.c | 2 +-
arch/x86/mm/amdtopology.c | 3 +-
arch/x86/mm/init.c | 16 +++-
arch/x86/mm/numa.c | 64 +++++++++++++++---
arch/x86/mm/numa_internal.h | 1 +
arch/x86/mm/srat.c | 11 ++-
include/linux/memblock.h | 16 +++++
include/linux/memory_hotplug.h | 3 +
mm/memblock.c | 127 ++++++++++++++++++++++++++++++----
mm/nobootmem.c | 3 +
mm/page_alloc.c | 37 ++++++++++-
13 files changed, 256 insertions(+), 38 deletions(-)

2013-04-30 09:18:43

[permalink] [raw]

Subject: [PATCH v2 05/13] x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup numa_meminfo.

Since we have introduced hotplug info into struct numa_meminfo, we need
to consider it when cleanup numa_meminfo.

The original logic in numa_cleanup_meminfo() is:
Merge blocks on the same node, holes between which don't overlap with
memory on other nodes.

This patch modifies numa_cleanup_meminfo() logic like this:
Merge blocks with the same hotpluggable type on the same node, holes
between which don't overlap with memory on other nodes.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ecf37fd..26d1800 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -296,18 +296,22 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
}

/*
- * Join together blocks on the same node, holes
- * between which don't overlap with memory on other
- * nodes.
+ * Join together blocks on the same node, with the same
+ * hotpluggable flags, holes between which don't overlap
+ * with memory on other nodes.
*/
if (bi->nid != bj->nid)
continue;
+ if (bi->hotpluggable != bj->hotpluggable)
+ continue;
+
start = min(bi->start, bj->start);
end = max(bi->end, bj->end);
for (k = 0; k < mi->nr_blks; k++) {
struct numa_memblk *bk = &mi->blk[k];

- if (bi->nid == bk->nid)
+ if (bi->nid == bk->nid &&
+ bi->hotpluggable == bk->hotpluggable)
continue;
if (start < bk->end && end > bk->start)
break;
@@ -327,6 +331,7 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
mi->blk[i].start = mi->blk[i].end = 0;
mi->blk[i].nid = NUMA_NO_NODE;
+ mi->blk[i].hotpluggable = false;
}

return 0;
--
1.7.1

2013-04-30 09:18:46

[permalink] [raw]

Subject: [PATCH v2 12/13] x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher priority.

Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance decreased
because the kernel cannot use movable memory.

For users who don't use memory hotplug and who don't want to lose their NUMA
performance, they need a way to disable this functionality.

So, if users specify "movablecore=acpi" in kernel commandline, the kernel will
use SRAT to arrange ZONE_MOVABLE, and it has higher priority then original
movablecore and kernelcore boot option.

For those who don't want this, just specify nothing.

Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 +++++
mm/page_alloc.c | 24 +++++++++++++++++++++++-
3 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 08c761d..5528e8f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -69,6 +69,7 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
+bool memblock_is_hotpluggable(struct memblock_region *region);
void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);
diff --git a/mm/memblock.c b/mm/memblock.c
index 54de398..8b9a13c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -623,6 +623,11 @@ int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
return memblock_reserve_region(base, size, nid, flags);
}

+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+ return region->flags & (1 << MEMBLK_HOTPLUGGABLE);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9ea143..2fe9ebf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4793,9 +4793,31 @@ static void __init find_zone_movable_pfns_for_nodes(void)
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+ struct memblock_type *reserved = &memblock.reserved;

/*
- * If movablecore was specified, calculate what size of
+ * If movablecore=acpi was specified, then zone_movable_pfn[] has been
+ * initialized, and no more work needs to do.
+ * NOTE: In this case, we ignore kernelcore option.
+ */
+ if (movablecore_enable_srat) {
+ for (i = 0; i < reserved->cnt; i++) {
+ if (!memblock_is_hotpluggable(&reserved->regions[i]))
+ continue;
+
+ nid = reserved->regions[i].nid;
+
+ usable_startpfn = reserved->regions[i].base;
+ zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+ min(usable_startpfn, zone_movable_pfn[nid]) :
+ usable_startpfn;
+ }
+
+ goto out;
+ }
+
+ /*
+ * If movablecore=nn[KMG] was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
--
1.7.1

2013-04-30 09:18:47

[permalink] [raw]

Subject: [PATCH v2 08/13] x86, numa: Move memory_add_physaddr_to_nid() to CONFIG_NUMA.

memory_add_physaddr_to_nid() is declared in include/linux/memory_hotplug.h,
protected by CONFIG_NUMA. And in x86, the definitions are protected by
CONFIG_MEMORY_HOTPLUG.

memory_add_physaddr_to_nid() uses numa_meminfo to find the physical address's
nid. It has nothing to do with memory hotplug. And also, it can be used by
alloc_low_pages() to obtain nid of the allocated memory.

So in x86, also use CONFIG_NUMA to protect it.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 105b092..1367fe4 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -959,7 +959,7 @@ EXPORT_SYMBOL(cpumask_of_node);

#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */

-#ifdef CONFIG_MEMORY_HOTPLUG
+#ifdef CONFIG_NUMA
int memory_add_physaddr_to_nid(u64 start)
{
struct numa_meminfo *mi = &numa_meminfo;
--
1.7.1

2013-04-30 09:19:05

[permalink] [raw]

Subject: [PATCH v2 10/13] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory.

We mark out movable memory ranges and reserve them with MEMBLK_HOTPLUGGABLE flag in
memblock.reserved. This should be done after the memory mapping is initialized
because the kernel now supports allocate pagetable pages on local node, which
are kernel pages.

The reserved hotpluggable will be freed to buddy when memory initialization
is done.

This idea is from Wen Congyang <[email protected]> and Jiang Liu <[email protected]>.

Suggested-by: Jiang Liu <[email protected]>
Suggested-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 28 ++++++++++++++++++++++++++++
include/linux/memblock.h | 3 +++
mm/memblock.c | 19 +++++++++++++++++++
3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1367fe4..a1f1f90 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -731,6 +731,32 @@ static void __init early_x86_numa_init_mapping(void)
}
#endif

+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+static void __init early_mem_hotplug_init()
+{
+ int i, nid;
+ phys_addr_t start, end;
+
+ if (!movablecore_enable_srat)
+ return;
+
+ for (i = 0; i < numa_meminfo.nr_blks; i++) {
+ if (!numa_meminfo.blk[i].hotpluggable)
+ continue;
+
+ nid = numa_meminfo.blk[i].nid;
+ start = numa_meminfo.blk[i].start;
+ end = numa_meminfo.blk[i].end;
+
+ memblock_reserve_hotpluggable(start, end - start, nid);
+ }
+}
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+static inline void early_mem_hotplug_init()
+{
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
@@ -740,6 +766,8 @@ void __init early_initmem_init(void)
load_cr3(swapper_pg_dir);
__flush_tlb_all();

+ early_mem_hotplug_init();
+
early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 3b2d1c4..0f01930 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -24,6 +24,7 @@
/* Definition of memblock flags. */
enum memblock_flags {
MEMBLK_LOCAL_NODE, /* node-life-cycle data */
+ MEMBLK_HOTPLUGGABLE, /* hotpluggable region */
__NR_MEMBLK_FLAGS, /* number of flags */
};

@@ -67,8 +68,10 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);
+bool memblock_is_kernel_node(int nid);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index edde4c2..0c55588 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -596,6 +596,13 @@ int __init_memblock memblock_reserve_local_node(phys_addr_t base,
return memblock_reserve_region(base, size, nid, flags);
}

+int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
+ phys_addr_t size, int nid)
+{
+ unsigned long flags = 1 << MEMBLK_HOTPLUGGABLE;
+ return memblock_reserve_region(base, size, nid, flags);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
@@ -809,11 +816,23 @@ void __init_memblock memblock_mark_kernel_nodes()
node_set(nid, memblock_kernel_nodemask);
}
}
+
+bool __init_memblock memblock_is_kernel_node(int nid)
+{
+ if (node_isset(nid, memblock_kernel_nodemask))
+ return true;
+ return false;
+}
#else
void __init_memblock memblock_mark_kernel_nodes()
{
node_set(0, memblock_kernel_nodemask);
}
+
+bool __init_memblock memblock_is_kernel_node(int nid)
+{
+ return true;
+}
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
--
1.7.1

2013-04-30 09:19:48

[permalink] [raw]

Subject: [PATCH v2 09/13] x86, numa, memblock: Introduce MEMBLK_LOCAL_NODE to mark and reserve node-life-cycle data.

node-life-cycle data (whose life cycle is the same as a node)
allocated by memblock should be marked so that when we free usable
memory to buddy system, we can skip them.

This patch introduces a flag MEMBLK_LOCAL_NODE for memblock to reserve
node-life-cycle data. For now, it is only kernel direct mapping pagetable
pages, based on Yinghai's patch.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/init.c | 16 ++++++++++++----
include/linux/memblock.h | 2 ++
mm/memblock.c | 7 +++++++
3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8d0007a..002d487 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -62,14 +62,22 @@ __ref void *alloc_low_pages(unsigned int num)
low_min_pfn_mapped << PAGE_SHIFT,
low_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
- } else
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+
+ memblock_reserve(ret, PAGE_SIZE * num);
+ } else {
ret = memblock_find_in_range(
local_min_pfn_mapped << PAGE_SHIFT,
local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
- if (!ret)
- panic("alloc_low_page: can not alloc memory");
- memblock_reserve(ret, PAGE_SIZE * num);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+
+ memblock_reserve_local_node(ret, PAGE_SIZE * num,
+ memory_add_physaddr_to_nid(ret));
+ }
+
pfn = ret >> PAGE_SHIFT;
} else {
pfn = pgt_buf_end;
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5064eed..3b2d1c4 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -23,6 +23,7 @@

/* Definition of memblock flags. */
enum memblock_flags {
+ MEMBLK_LOCAL_NODE, /* node-life-cycle data */
__NR_MEMBLK_FLAGS, /* number of flags */
};

@@ -65,6 +66,7 @@ int memblock_add(phys_addr_t base, phys_addr_t size);
int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
+int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);

diff --git a/mm/memblock.c b/mm/memblock.c
index 1b93a5d..edde4c2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -589,6 +589,13 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
MEMBLK_FLAGS_DEFAULT);
}

+int __init_memblock memblock_reserve_local_node(phys_addr_t base,
+ phys_addr_t size, int nid)
+{
+ unsigned long flags = 1 << MEMBLK_LOCAL_NODE;
+ return memblock_reserve_region(base, size, nid, flags);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
--
1.7.1

2013-04-30 09:20:11

[permalink] [raw]

Subject: [PATCH v2 11/13] x86, memblock, mem-hotplug: Free hotpluggable memory reserved by memblock.

We reserved hotpluggable memory in memblock. And when memory initialization
is done, we have to free it to buddy system.

This patch free memory reserved by memblock with flag MEMBLK_HOTPLUGGABLE.

Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 20 ++++++++++++++++++++
mm/nobootmem.c | 3 +++
3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 0f01930..08c761d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -69,6 +69,7 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
+void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);
bool memblock_is_kernel_node(int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index 0c55588..54de398 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -568,6 +568,26 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}

+static void __init_memblock memblock_free_flags(unsigned long flags)
+{
+ int i;
+ struct memblock_type *reserved = &memblock.reserved;
+
+ for (i = 0; i < reserved->cnt; i++) {
+ if (reserved->regions[i].flags == flags)
+ memblock_remove_region(reserved, i);
+ }
+}
+
+void __init_memblock memblock_free_hotpluggable()
+{
+ unsigned long flags = 1 << MEMBLK_HOTPLUGGABLE;
+
+ memblock_dbg("memblock: free all hotpluggable memory");
+
+ memblock_free_flags(flags);
+}
+
static int __init_memblock memblock_reserve_region(phys_addr_t base,
phys_addr_t size,
int nid,
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 5e07d36..cd85604 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -165,6 +165,9 @@ unsigned long __init free_all_bootmem(void)
for_each_online_pgdat(pgdat)
reset_node_lowmem_managed_pages(pgdat);

+ /* Hotpluggable memory reserved by memblock should also be freed. */
+ memblock_free_hotpluggable();
+
/*
* We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
--
1.7.1

2013-04-30 09:20:09

[permalink] [raw]

Subject: [PATCH v2 07/13] x86, numa, mem-hotplug: Mark nodes which the kernel resides in.

If all the memory ranges in SRAT are hotpluggable, we should not
arrange them all in ZONE_MOVABLE. Otherwise the kernel won't have
enough memory to boot.

This patch introduce a global variable kernel_nodemask to mark
all the nodes the kernel resides in. And no matter if they are
hotpluggable, we arrange them as un-hotpluggable.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 6 ++++++
include/linux/memblock.h | 1 +
mm/memblock.c | 20 ++++++++++++++++++++
3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 26d1800..105b092 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -658,6 +658,12 @@ static bool srat_used __initdata;
*/
static void __init early_x86_numa_init(void)
{
+ /*
+ * Need to find out which nodes the kernel resides in, and arrange
+ * them as un-hotpluggable when parsing SRAT.
+ */
+ memblock_mark_kernel_nodes();
+
if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
if (!numa_init(numaq_numa_init))
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index c63a66e..5064eed 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -66,6 +66,7 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
+void memblock_mark_kernel_nodes(void);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 63924ae..1b93a5d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -35,6 +35,9 @@ struct memblock memblock __initdata_memblock = {
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};

+/* Mark which nodes the kernel resides in. */
+static nodemask_t memblock_kernel_nodemask __initdata_memblock;
+
int memblock_debug __initdata_memblock;
static int memblock_can_resize __initdata_memblock;
static int memblock_memory_in_slab __initdata_memblock = 0;
@@ -787,6 +790,23 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
memblock_merge_regions(type);
return 0;
}
+
+void __init_memblock memblock_mark_kernel_nodes()
+{
+ int i, nid;
+ struct memblock_type *reserved = &memblock.reserved;
+
+ for (i = 0; i < reserved->cnt; i++)
+ if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
+ nid = memblock_get_region_node(&reserved->regions[i]);
+ node_set(nid, memblock_kernel_nodemask);
+ }
+}
+#else
+void __init_memblock memblock_mark_kernel_nodes()
+{
+ node_set(0, memblock_kernel_nodemask);
+}
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
--
1.7.1

2013-04-30 09:20:55

[permalink] [raw]

Subject: [PATCH v2 06/13] memblock, numa: Introduce flag into memblock.

There is no flag in memblock to discribe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
For example, as Yinghai did in his patch, allocate pagetables on local
node before all the memory on the node is mapped.
Please refer to Yinghai's patch:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47
v3: https://lkml.org/lkml/2013/4/4/639
v4: https://lkml.org/lkml/2013/4/11/829

In hotplug environment, there could be some problems when we hot-remove
memory if we do so. Pagetable pages are kernel memory, which we cannot
migrate. But we can put them in local node because their life-cycle is
the same as the node. So we need to free them all before memory hot-removing.

Actually, data whose life cycle is the same as a node, such as pagetable
pages, vmemmap pages, page_cgroup pages, all could be put on local node.
They can be freed when we hot-removing a whole node.

In order to do so, we need to mark out these special pages in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};

This patch does the following things:
1) Add "flags" member to memblock_region, and MEMBLK_ANY flag for common usage.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>.

Suggested-by: Wen Congyang <[email protected]>
Suggested-by: Liu Jiang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 8 ++++++
mm/memblock.c | 56 +++++++++++++++++++++++++++++++++------------
2 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..c63a66e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,9 +19,17 @@

#define INIT_MEMBLOCK_REGIONS 128

+#define MEMBLK_FLAGS_DEFAULT 0
+
+/* Definition of memblock flags. */
+enum memblock_flags {
+ __NR_MEMBLK_FLAGS, /* number of flags */
+};
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
+ unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
diff --git a/mm/memblock.c b/mm/memblock.c
index 16eda3d..63924ae 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
type->cnt = 1;
type->regions[0].base = 0;
type->regions[0].size = 0;
+ type->regions[0].flags = 0;
memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
}
}
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)

if (this->base + this->size != next->base ||
memblock_get_region_node(this) !=
- memblock_get_region_node(next)) {
+ memblock_get_region_node(next) ||
+ this->flags != next->flags) {
BUG_ON(this->base + this->size > next->base);
i++;
continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
* @base: base address of the new region
* @size: size of the new region
* @nid: node id of the new region
+ * @flags: flags of the new region
*
* Insert new memblock region [@base,@base+@size) into @type at @idx.
* @type must already have extra room to accomodate the new region.
*/
static void __init_memblock memblock_insert_region(struct memblock_type *type,
int idx, phys_addr_t base,
- phys_addr_t size, int nid)
+ phys_addr_t size,
+ int nid, unsigned long flags)
{
struct memblock_region *rgn = &type->regions[idx];

@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
rgn->base = base;
rgn->size = size;
+ rgn->flags = flags;
memblock_set_region_node(rgn, nid);
type->cnt++;
type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
+ * @flags: flags of the new region
*
* Add new memblock region [@base,@base+@size) into @type. The new region
* is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* 0 on success, -errno on failure.
*/
static int __init_memblock memblock_add_region(struct memblock_type *type,
- phys_addr_t base, phys_addr_t size, int nid)
+ phys_addr_t base, phys_addr_t size,
+ int nid, unsigned long flags)
{
bool insert = false;
phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
+ type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
@@ -407,7 +415,8 @@ repeat:
nr_new++;
if (insert)
memblock_insert_region(type, i++, base,
- rbase - base, nid);
+ rbase - base, nid,
+ flags);
}
/* area below @rend is dealt with, forget about it */
base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
if (base < end) {
nr_new++;
if (insert)
- memblock_insert_region(type, i, base, end - base, nid);
+ memblock_insert_region(type, i, base, end - base,
+ nid, flags);
}

/*
@@ -439,12 +449,14 @@ repeat:
int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
int nid)
{
- return memblock_add_region(&memblock.memory, base, size, nid);
+ return memblock_add_region(&memblock.memory, base, size,
+ nid, MEMBLK_FLAGS_DEFAULT);
}

int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
- return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+ return memblock_add_region(&memblock.memory, base, size,
+ MAX_NUMNODES, MEMBLK_FLAGS_DEFAULT);
}

/**
@@ -499,7 +511,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else if (rend > end) {
/*
* @rgn intersects from above. Split and redo the
@@ -509,7 +522,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else {
/* @rgn is fully contained, record it */
if (!*end_rgn)
@@ -551,16 +565,25 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}

-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+ phys_addr_t size,
+ int nid,
+ unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;

- memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] with flags %#016lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size,
- (void *)_RET_IP_);
+ flags, (void *)_RET_IP_);
+
+ return memblock_add_region(_rgn, base, size, nid, flags);
+}

- return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_reserve_region(base, size, MAX_NUMNODES,
+ MEMBLK_FLAGS_DEFAULT);
}

/**
@@ -982,6 +1005,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
{
unsigned long long base, size;
+ unsigned long flags;
int i;

pr_info(" %s.cnt = 0x%lx\n", name, type->cnt);
@@ -992,13 +1016,15 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name

base = rgn->base;
size = rgn->size;
+ flags = rgn->flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
if (memblock_get_region_node(rgn) != MAX_NUMNODES)
snprintf(nid_buf, sizeof(nid_buf), " on node %d",
memblock_get_region_node(rgn));
#endif
- pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
- name, i, base, base + size - 1, size, nid_buf);
+ pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s "
+ "flags: %#lx\n",
+ name, i, base, base + size - 1, size, nid_buf, flags);
}
}

--
1.7.1

2013-04-30 09:20:54

[permalink] [raw]

Subject: [PATCH v2 13/13] doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot option.

Since we modify movablecore boot option to support
"movablecore=acpi", this patch adds doc for it.

Signed-off-by: Tang Chen <[email protected]>
---
Documentation/kernel-parameters.txt | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..a1c515b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1649,6 +1649,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.

+ movablecore=acpi [KNL,X86] This parameter will enable the
+ kernel to arrange ZONE_MOVABLE with the help of
+ Hot-Pluggable Field in SRAT. All the hotpluggable
+ memory will be arranged in ZONE_MOVABLE.
+ NOTE: Any node which the kernel resides in will
+ always be un-hotpluggable so that the kernel
+ will always have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

--
1.7.1

2013-04-30 09:21:44

[permalink] [raw]

Subject: [PATCH v2 04/13] x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct numa_meminfo.

Since Yinghai has implement "Allocate pagetable pages in local node", for a
node with hotpluggable memory, we have to allocate pagetable pages first, and
then reserve the rest as hotpluggable memory in memblock.

But the kernel parse SRAT first, and then initialize memory mapping. So we have
to remember the which memory ranges are hotpluggable for future usage.

When parsing SRAT, we added each memory range to numa_meminfo. So we can store
hotpluggable info in numa_meminfo.

This patch introduces a "bool hotpluggable" member into struct
numa_meminfo.

And modifies the following APIs' prototypes to support it:
- numa_add_memblk()
- numa_add_memblk_to()

And the following callers:
- numaq_register_node()
- dummy_numa_init()
- amd_numa_init()
- acpi_numa_memory_affinity_init() in x86

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/numa.h | 3 ++-
arch/x86/kernel/apic/numaq_32.c | 2 +-
arch/x86/mm/amdtopology.c | 3 ++-
arch/x86/mm/numa.c | 10 +++++++---
arch/x86/mm/numa_internal.h | 1 +
arch/x86/mm/srat.c | 2 +-
6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 1b99ee5..73096b2 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,7 +31,8 @@ extern int numa_off;
extern s16 __apicid_to_node[MAX_LOCAL_APIC];
extern nodemask_t numa_nodes_parsed __initdata;

-extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+extern int __init numa_add_memblk(int nodeid, u64 start, u64 end,
+ bool hotpluggable);
extern void __init numa_set_distance(int from, int to, int distance);

static inline void set_apicid_to_node(int apicid, s16 node)
diff --git a/arch/x86/kernel/apic/numaq_32.c b/arch/x86/kernel/apic/numaq_32.c
index d661ee9..7a9c542 100644
--- a/arch/x86/kernel/apic/numaq_32.c
+++ b/arch/x86/kernel/apic/numaq_32.c
@@ -82,7 +82,7 @@ static inline void numaq_register_node(int node, struct sys_cfg_data *scd)
int ret;

node_set(node, numa_nodes_parsed);
- ret = numa_add_memblk(node, start, end);
+ ret = numa_add_memblk(node, start, end, false);
BUG_ON(ret < 0);
}

diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index 5247d01..d521471 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -167,7 +167,8 @@ int __init amd_numa_init(void)
nodeid, base, limit);

prevbase = base;
- numa_add_memblk(nodeid, base, limit);
+ /* Do not support memory hotplug for AMD cpu. */
+ numa_add_memblk(nodeid, base, limit, false);
node_set(nodeid, numa_nodes_parsed);
}

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4f754e6..ecf37fd 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -134,6 +134,7 @@ void __init setup_node_to_cpumask_map(void)
}

static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+ bool hotpluggable,
struct numa_meminfo *mi)
{
/* ignore zero length blks */
@@ -155,6 +156,7 @@ static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
mi->blk[mi->nr_blks].start = start;
mi->blk[mi->nr_blks].end = end;
mi->blk[mi->nr_blks].nid = nid;
+ mi->blk[mi->nr_blks].hotpluggable = hotpluggable;
mi->nr_blks++;
return 0;
}
@@ -179,15 +181,17 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
* @nid: NUMA node ID of the new memblk
* @start: Start address of the new memblk
* @end: End address of the new memblk
+ * @hotpluggable: True if memblk is hotpluggable
*
* Add a new memblk to the default numa_meminfo.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
-int __init numa_add_memblk(int nid, u64 start, u64 end)
+int __init numa_add_memblk(int nid, u64 start, u64 end,
+ bool hotpluggable)
{
- return numa_add_memblk_to(nid, start, end, &numa_meminfo);
+ return numa_add_memblk_to(nid, start, end, hotpluggable, &numa_meminfo);
}

/* Initialize NODE_DATA for a node on the local memory */
@@ -631,7 +635,7 @@ static int __init dummy_numa_init(void)
0LLU, PFN_PHYS(max_pfn) - 1);

node_set(0, numa_nodes_parsed);
- numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
+ numa_add_memblk(0, 0, PFN_PHYS(max_pfn), false);

return 0;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index bb2fbcc..1ce4e6b 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -8,6 +8,7 @@ struct numa_memblk {
u64 start;
u64 end;
int nid;
+ bool hotpluggable;
};

struct numa_meminfo {
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 5055fa7..f7f6fd4 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -171,7 +171,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
}

- if (numa_add_memblk(node, start, end) < 0)
+ if (numa_add_memblk(node, start, end, hotpluggable) < 0)
goto out_err_bad_srat;

node_set(node, numa_nodes_parsed);
--
1.7.1

2013-04-30 09:22:07

[permalink] [raw]

Subject: [PATCH v2 02/13] acpi: Print Hot-Pluggable Field in SRAT.

The Hot-Pluggable field in SRAT suggests if the memory could be
hotplugged while the system is running. Print it as well when
parsing SRAT will help users to know which memory is hotpluggable.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/srat.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 443f9ef..5055fa7 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -146,6 +146,7 @@ int __init
acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
+ u32 hotpluggable;
int node, pxm;

if (srat_disabled())
@@ -154,7 +155,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
goto out_err;
- if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
+ hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
+ if (hotpluggable && !save_add_info())
goto out_err;

start = ma->base_address;
@@ -174,9 +176,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)

node_set(node, numa_nodes_parsed);

- printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
+ printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
node, pxm,
- (unsigned long long) start, (unsigned long long) end - 1);
+ (unsigned long long) start, (unsigned long long) end - 1,
+ hotpluggable ? "Hot Pluggable" : "");

return 0;
out_err_bad_srat:
--
1.7.1

2013-04-30 09:22:24

[permalink] [raw]

Subject: [PATCH v2 03/13] page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using SRAT.

The Hot-Pluggable Fired in SRAT specified which memory ranges are hotpluggable.
We will arrange hotpluggable memory as ZONE_MOVABLE for users who want to use
memory hotplug functionality. But this will cause NUMA performance decreased
because kernel cannot use ZONE_MOVABLE.

So we improve movablecore boot option to allow those who want to use memory
hotplug functionality to enable using SRAT info to arrange movable memory.

Users can specify "movablecore=acpi" in kernel commandline to enable this
functionality.

For those who don't use memory hotplug or who don't want to lose their NUMA
performance, just don't specify anything. The kernel will work as before.

Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memory_hotplug.h | 3 +++
mm/page_alloc.c | 13 +++++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index b6a3be7..18fe2a3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
ONLINE_MOVABLE,
};

+/* Enable/disable SRAT in movablecore boot option */
+extern bool movablecore_enable_srat;
+
/*
* pgdat resizing functions
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f368db4..b9ea143 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,8 @@ static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];

+bool __initdata movablecore_enable_srat = false;
+
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
EXPORT_SYMBOL(movable_zone);
@@ -5025,6 +5027,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
}
}

+static void __init cmdline_movablecore_srat(char *p)
+{
+ if (p && !strcmp(p, "acpi"))
+ movablecore_enable_srat = true;
+}
+
static int __init cmdline_parse_core(char *p, unsigned long *core)
{
unsigned long long coremem;
@@ -5055,6 +5063,11 @@ static int __init cmdline_parse_kernelcore(char *p)
*/
static int __init cmdline_parse_movablecore(char *p)
{
+ cmdline_movablecore_srat(p);
+
+ if (movablecore_enable_srat)
+ return 0;
+
return cmdline_parse_core(p, &required_movablecore);
}

--
1.7.1

2013-04-30 09:22:40

[permalink] [raw]

Subject: [PATCH v2 01/13] x86: get pg_data_t's memory from other node

From: Yasuaki Ishimatsu <[email protected]>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/mm/numa.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 11acdf6..4f754e6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -214,10 +214,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/
- nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
+ pr_err("Cannot find %zu bytes in any node\n", nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.1

2013-05-03 10:50:45

by Vasilis Liaskovitis

[permalink] [raw]

Subject: Re: [PATCH v2 10/13] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory.

Hi,

On Tue, Apr 30, 2013 at 05:21:20PM +0800, Tang Chen wrote:
> We mark out movable memory ranges and reserve them with MEMBLK_HOTPLUGGABLE flag in
> memblock.reserved. This should be done after the memory mapping is initialized
> because the kernel now supports allocate pagetable pages on local node, which
> are kernel pages.
>
> The reserved hotpluggable will be freed to buddy when memory initialization
> is done.
>
> This idea is from Wen Congyang <[email protected]> and Jiang Liu <[email protected]>.
>
> Suggested-by: Jiang Liu <[email protected]>
> Suggested-by: Wen Congyang <[email protected]>
> Signed-off-by: Tang Chen <[email protected]>
> ---
> arch/x86/mm/numa.c | 28 ++++++++++++++++++++++++++++
> include/linux/memblock.h | 3 +++
> mm/memblock.c | 19 +++++++++++++++++++
> 3 files changed, 50 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1367fe4..a1f1f90 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -731,6 +731,32 @@ static void __init early_x86_numa_init_mapping(void)
> }
> #endif
>
> +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> +static void __init early_mem_hotplug_init()
> +{
> + int i, nid;
> + phys_addr_t start, end;
> +
> + if (!movablecore_enable_srat)
> + return;
> +
> + for (i = 0; i < numa_meminfo.nr_blks; i++) {
> + if (!numa_meminfo.blk[i].hotpluggable)
> + continue;
> +
> + nid = numa_meminfo.blk[i].nid;

Should we skip ranges on nodes that the kernel uses? e.g. with

if (memblock_is_kernel_node(nid))
continue;

> + start = numa_meminfo.blk[i].start;
> + end = numa_meminfo.blk[i].end;
> +
> + memblock_reserve_hotpluggable(start, end - start, nid);
> + }
> +}

- I am getting a "PANIC: early exception" when rebooting with movablecore=acpi
after hotplugging memory on node0 or node1 of a 2-node VM. The guest kernel is
based on
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm (e9058baf) + these v2 patches.

This happens with or without the above memblock_is_kernel_node(nid) check.
Perhaps I am missing something or I need a newer "ACPI, numa: Parse numa info
early" patch-set?

A general question: Disabling hot-pluggability/zone-movable eligibility for a
whole node sounds a bit inflexible, if the machine only has one node to begin
with. Would it be possible to keep movable information per SRAT entry? I.e
if the BIOS presents multiple SRAT entries for one node/PXM (say node 0), and
there is no memblock/kernel allocation on one of these SRAT entries, could
we still mark this SRAT entry's range as hot-pluggable/movable? Not sure if
many real machine BIOSes would do this, but seabios could. This implies that
SRAT entries are processed for movable-zone eligilibity before they are merged
on node/PXM basis entry-granularity (I think numa_cleanup_meminfo currently does
this merge).

Of course the kernel should still have enough memory(i.e. non movable zone) to
boot. Can we ensure that at least certain amount of memory is non-movable, and
then, given more separate SRAT entries for node0 not used by kernel, treat
these rest entries as movable?

thanks,

- Vasilis

2013-05-06 02:24:57

[permalink] [raw]

Subject: Re: [PATCH v2 10/13] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory.

Hi Vasilis,

Sorry for the delay and thank you for reviewing and testing. :)

On 05/03/2013 06:50 PM, Vasilis Liaskovitis wrote:
>
> Should we skip ranges on nodes that the kernel uses? e.g. with
>
> if (memblock_is_kernel_node(nid))
> continue;

Yes. I think I forgot to call it in this patch.
Will update in the next version.

>
>
> - I am getting a "PANIC: early exception" when rebooting with movablecore=acpi
> after hotplugging memory on node0 or node1 of a 2-node VM. The guest kernel is
> based on
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm (e9058baf) + these v2 patches.
>
> This happens with or without the above memblock_is_kernel_node(nid) check.
> Perhaps I am missing something or I need a newer "ACPI, numa: Parse numa info
> early" patch-set?

I didn't test it on a VM. But on my real box, I haven't got a panic
when rebooting. I think I can help to test it in a VM, but would you
please to
tell me how to setup a environment as yours ?

>
> A general question: Disabling hot-pluggability/zone-movable eligibility for a
> whole node sounds a bit inflexible, if the machine only has one node to begin
> with. Would it be possible to keep movable information per SRAT entry? I.e
> if the BIOS presents multiple SRAT entries for one node/PXM (say node 0), and
> there is no memblock/kernel allocation on one of these SRAT entries, could
> we still mark this SRAT entry's range as hot-pluggable/movable? Not sure if
> many real machine BIOSes would do this, but seabios could. This implies that
> SRAT entries are processed for movable-zone eligilibity before they are merged
> on node/PXM basis entry-granularity (I think numa_cleanup_meminfo currently does
> this merge).

Yes, this can be done. But in real usage, part of the memory in a node
is hot-removable makes no sense, I think. We cannot remove the whole node,
so we cannot remove a real hardware device.

But in virtualization, would you please give a reason why we need this
entry-granularity ?

Another thinking. Assume I didn't understand your question correctly. :)

Now in kernel, we can recognize a node (by PXM in SRAT), but we cannot
recognize a memory device. Are you saying if we have this
entry-granularity,
we can hotplug a single memory device in a node ? (Perhaps there are more
than on memory device in a node.)

If so, it makes sense. But I don't the kernel is able to recognize which
device a memory range belongs to now. And I'm not sure if we can do this.

>
> Of course the kernel should still have enough memory(i.e. non movable zone) to
> boot. Can we ensure that at least certain amount of memory is non-movable, and
> then, given more separate SRAT entries for node0 not used by kernel, treat
> these rest entries as movable?

I tried this idea before. But as HPA said, it seems no way to calculate
how much
memory the kernel needs.
https://lkml.org/lkml/2012/11/27/29

Thanks. :)

2013-05-06 10:37:51

by Vasilis Liaskovitis

[permalink] [raw]

Subject: Re: [PATCH v2 10/13] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory.

Hi Tang,

On Mon, May 06, 2013 at 10:27:44AM +0800, Tang Chen wrote:
> Hi Vasilis,
>
> Sorry for the delay and thank you for reviewing and testing. :)
>
> On 05/03/2013 06:50 PM, Vasilis Liaskovitis wrote:
> >
> >Should we skip ranges on nodes that the kernel uses? e.g. with
> >
> > if (memblock_is_kernel_node(nid))
> > continue;
>
> Yes. I think I forgot to call it in this patch.
> Will update in the next version.
ok

>
> >
> >
> >- I am getting a "PANIC: early exception" when rebooting with movablecore=acpi
> >after hotplugging memory on node0 or node1 of a 2-node VM. The guest kernel is
> >based on
> >git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> >for-x86-mm (e9058baf) + these v2 patches.
> >
> >This happens with or without the above memblock_is_kernel_node(nid) check.
> >Perhaps I am missing something or I need a newer "ACPI, numa: Parse numa info
> >early" patch-set?
>
> I didn't test it on a VM. But on my real box, I haven't got a panic
> when rebooting. I think I can help to test it in a VM, but would you
> please to
> tell me how to setup a environment as yours ?

you can use qemu-kvm and seabios from these branches:
https://github.com/vliaskov/qemu-kvm/commits/memhp-v4
https://github.com/vliaskov/seabios/commits/memhp-v4

Instructions on how to use the DIMM/memory hotplug are here:

http://lists.gnu.org/archive/html/qemu-devel/2012-12/msg02693.html
(these patchsets are not in mainline qemu/qemu-kvm and seabios)

e.g. the following creates a VM with 2G initial memory on 2 nodes (1GB on each).
There is also an extra 1GB DIMM on each node (the last 3 lines below describe
this):

/opt/qemu/bin/qemu-system-x86_64 -bios /opt/devel/seabios-upstream/out/bios.bin \
-enable-kvm -M pc -smp 4,maxcpus=8 -cpu host -m 2G \
-drive
file=/opt/images/debian.img,if=none,id=drive-virtio-disk0,format=raw,cache=none \
-device virtio-blk-pci,bus=pci.0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev type=tap,id=guest0,vhost=on -device virtio-net-pci,netdev=guest0 -vga \
std -monitor stdio \
-numa node,mem=1G,cpus=2,nodeid=0 -numa node,mem=0,cpus=2,nodeid=1 \
-device dimm,id=dimm0,size=1G,node=0,bus=membus.0,populated=off \
-device dimm,id=dimm1,size=1G,node=1,bus=membus.0,populated=off

After startup I hotplug the dimm0 on node0 (or dimm1 on node1, same result)
(qemu) device_add dimm,id=dimm0,size=1G,node=0,bus=membus.0

than i reboot VM. Kernel works without "movablecore=acpi" but panics with this
option.

Note this qemu/seabios does not model initial memory (-m 2G) as memory devices.
Only extra dimms ("device -dimm") are modeled as separate memory devices.

>
> >
> >A general question: Disabling hot-pluggability/zone-movable eligibility for a
> >whole node sounds a bit inflexible, if the machine only has one node to begin
> >with. Would it be possible to keep movable information per SRAT entry? I.e
> >if the BIOS presents multiple SRAT entries for one node/PXM (say node 0), and
> >there is no memblock/kernel allocation on one of these SRAT entries, could
> >we still mark this SRAT entry's range as hot-pluggable/movable? Not sure if
> >many real machine BIOSes would do this, but seabios could. This implies that
> >SRAT entries are processed for movable-zone eligilibity before they are merged
> >on node/PXM basis entry-granularity (I think numa_cleanup_meminfo currently does
> >this merge).
>
> Yes, this can be done. But in real usage, part of the memory in a node
> is hot-removable makes no sense, I think. We cannot remove the whole node,
> so we cannot remove a real hardware device.
>
> But in virtualization, would you please give a reason why we need this
> entry-granularity ?

see below, basically as you suggest we may have multiple memory devices on same
node.
>
>
> Another thinking. Assume I didn't understand your question correctly. :)
>
> Now in kernel, we can recognize a node (by PXM in SRAT), but we cannot
> recognize a memory device. Are you saying if we have this
> entry-granularity,
> we can hotplug a single memory device in a node ? (Perhaps there are more
> than on memory device in a node.)

yes, this is what I mean. Multiple memory devices on one node is possible in
both a real machine and a VM.
In the VM case, seabios can present different DIMM devices for any number of
nodes. Each DIMM is also given a separate SRAT entry by seabios. So when the
kernel initially parses the entries, it sees multiple ones for the same node.
(these are merged together in numa_cleanup_meminfo though)

>
> If so, it makes sense. But I don't the kernel is able to recognize which
> device a memory range belongs to now. And I'm not sure if we can do this.

kernel knows which memory ranges belong to each DIMM (with ACPI enabled, each
DIMM is represented by an acpi memory device, see drivers/acpi/acpi_memhotplug.c)

>
> >
> >Of course the kernel should still have enough memory(i.e. non movable zone) to
> >boot. Can we ensure that at least certain amount of memory is non-movable, and
> >then, given more separate SRAT entries for node0 not used by kernel, treat
> >these rest entries as movable?
>
> I tried this idea before. But as HPA said, it seems no way to
> calculate how much
> memory the kernel needs.
> https://lkml.org/lkml/2012/11/27/29

yes, if we can't guarantee enough non-movable memory for the kernel, I am not
sure how to do this.

thanks,

- Vasilis

2013-05-07 02:13:55

[permalink] [raw]

Subject: Re: [PATCH v2 10/13] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory.

Hi Vasilis,

On 05/06/2013 06:37 PM, Vasilis Liaskovitis wrote:
>
> you can use qemu-kvm and seabios from these branches:
> https://github.com/vliaskov/qemu-kvm/commits/memhp-v4
> https://github.com/vliaskov/seabios/commits/memhp-v4
>
> Instructions on how to use the DIMM/memory hotplug are here:
>
> http://lists.gnu.org/archive/html/qemu-devel/2012-12/msg02693.html
> (these patchsets are not in mainline qemu/qemu-kvm and seabios)
>
> e.g. the following creates a VM with 2G initial memory on 2 nodes (1GB on each).
> There is also an extra 1GB DIMM on each node (the last 3 lines below describe
> this):
>
> /opt/qemu/bin/qemu-system-x86_64 -bios /opt/devel/seabios-upstream/out/bios.bin \
> -enable-kvm -M pc -smp 4,maxcpus=8 -cpu host -m 2G \
> -drive
> file=/opt/images/debian.img,if=none,id=drive-virtio-disk0,format=raw,cache=none \
> -device virtio-blk-pci,bus=pci.0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> -netdev type=tap,id=guest0,vhost=on -device virtio-net-pci,netdev=guest0 -vga \
> std -monitor stdio \
> -numa node,mem=1G,cpus=2,nodeid=0 -numa node,mem=0,cpus=2,nodeid=1 \
> -device dimm,id=dimm0,size=1G,node=0,bus=membus.0,populated=off \
> -device dimm,id=dimm1,size=1G,node=1,bus=membus.0,populated=off
>
> After startup I hotplug the dimm0 on node0 (or dimm1 on node1, same result)
> (qemu) device_add dimm,id=dimm0,size=1G,node=0,bus=membus.0
>
> than i reboot VM. Kernel works without "movablecore=acpi" but panics with this
> option.
>
> Note this qemu/seabios does not model initial memory (-m 2G) as memory devices.
> Only extra dimms ("device -dimm") are modeled as separate memory devices.
>

OK, I'll try it. Thank you for telling me this.:)

>>
>> Now in kernel, we can recognize a node (by PXM in SRAT), but we cannot
>> recognize a memory device. Are you saying if we have this
>> entry-granularity,
>> we can hotplug a single memory device in a node ? (Perhaps there are more
>> than on memory device in a node.)
>
> yes, this is what I mean. Multiple memory devices on one node is possible in
> both a real machine and a VM.
> In the VM case, seabios can present different DIMM devices for any number of
> nodes. Each DIMM is also given a separate SRAT entry by seabios. So when the
> kernel initially parses the entries, it sees multiple ones for the same node.
> (these are merged together in numa_cleanup_meminfo though)
>
>>
>> If so, it makes sense. But I don't the kernel is able to recognize which
>> device a memory range belongs to now. And I'm not sure if we can do this.
>
> kernel knows which memory ranges belong to each DIMM (with ACPI enabled, each
> DIMM is represented by an acpi memory device, see drivers/acpi/acpi_memhotplug.c)
>

Oh, I'll check acpi_memhotplug.c and see what we can do.

And BTW, as Yinghai suggested, we'd better put pagetable in local node.
But the best
way is to put pagetable in the local memory device, I think. Otherwise,
we are not
able to hot-remove a memory device.

Thanks. :)

2013-05-22 04:40:29

[permalink] [raw]

Subject: Re: [PATCH v2 12/13] x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher priority.

Hi Vasilis,

Maybe the following two problems are the cause of the reboot panic
problem in qemu you mentioned.

On 04/30/2013 05:21 PM, Tang Chen wrote:
......
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b9ea143..2fe9ebf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4793,9 +4793,31 @@ static void __init find_zone_movable_pfns_for_nodes(void)
> nodemask_t saved_node_state = node_states[N_MEMORY];
> unsigned long totalpages = early_calculate_totalpages();
> int usable_nodes = nodes_weight(node_states[N_MEMORY]);
> + struct memblock_type *reserved =&memblock.reserved;
>

Need to call find_usable_zone_for_movable() here before goto out.

> /*
> - * If movablecore was specified, calculate what size of
> + * If movablecore=acpi was specified, then zone_movable_pfn[] has been
> + * initialized, and no more work needs to do.
> + * NOTE: In this case, we ignore kernelcore option.
> + */
> + if (movablecore_enable_srat) {
> + for (i = 0; i< reserved->cnt; i++) {
> + if (!memblock_is_hotpluggable(&reserved->regions[i]))
> + continue;
> +
> + nid = reserved->regions[i].nid;
> +
> + usable_startpfn = reserved->regions[i].base;

Here, it should be PFN_DOWN(reserved->regions[i].base).

Thanks. :)

> + zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> + min(usable_startpfn, zone_movable_pfn[nid]) :
> + usable_startpfn;
> + }
> +
> + goto out;
> + }
> +
> + /*
> + * If movablecore=nn[KMG] was specified, calculate what size of
> * kernelcore that corresponds so that memory usable for
> * any allocation type is evenly spread. If both kernelcore
> * and movablecore are specified, then the value of kernelcore

2013-05-22 09:02:38

[permalink] [raw]

Subject: Re: [PATCH v2 01/13] x86: get pg_data_t's memory from other node

On Tue, Apr 30, 2013 at 05:21:11PM +0800, Tang Chen wrote:
> Date: Tue, 30 Apr 2013 17:21:11 +0800
> From: Tang Chen <[email protected]>
> To: [email protected], [email protected], [email protected],
> [email protected], [email protected], [email protected],
> [email protected], [email protected], [email protected],
> [email protected], [email protected], [email protected],
> [email protected]
> Cc: [email protected], [email protected],
> [email protected], [email protected]
> Subject: [PATCH v2 01/13] x86: get pg_data_t's memory from other node
> X-Mailer: git-send-email 1.7.10.1
>
> From: Yasuaki Ishimatsu <[email protected]>
>
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
> to retry when the first allocation fails.
>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Signed-off-by: Tang Chen <[email protected]>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> arch/x86/mm/numa.c | 5 ++---
> 1 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 11acdf6..4f754e6 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -214,10 +214,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
> * Allocate node data. Try node-local memory and then any node.
> * Never allocate in DMA zone.
> */
> - nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
> + nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);

go through the implementation of memblock_alloc_try_nid, it will call
panic when allocation fails(a.k.a alloc = 0), if so, below information
will be never printed. Do we really need this?

> if (!nd_pa) {
> - pr_err("Cannot find %zu bytes in node %d\n",
> - nd_size, nid);
> + pr_err("Cannot find %zu bytes in any node\n", nd_size);
> return;
> }
> nd = __va(nd_pa);
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

Attachments:

(No filename) (2.36 kB)
signature.asc (836.00 B)
Digital signature Download all attachments

2013-05-22 09:21:47

[permalink] [raw]

Subject: Re: [PATCH v2 01/13] x86: get pg_data_t's memory from other node

On 05/22/2013 04:55 PM, Chen Gong wrote:
......
>> - nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>> + nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>
> go through the implementation of memblock_alloc_try_nid, it will call
> panic when allocation fails(a.k.a alloc = 0), if so, below information
> will be never printed. Do we really need this?

Oh, yes.

We don't need this. Will remove the following in the next version.

Thanks. :)

>
>> if (!nd_pa) {
>> - pr_err("Cannot find %zu bytes in node %d\n",
>> - nd_size, nid);
>> + pr_err("Cannot find %zu bytes in any node\n", nd_size);
>> return;
>> }
>> nd = __va(nd_pa);
>> --
>> 1.7.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>

2013-05-31 16:15:14

by Vasilis Liaskovitis

[permalink] [raw]

Subject: Re: [PATCH v2 07/13] x86, numa, mem-hotplug: Mark nodes which the kernel resides in.

Hi,

On Tue, Apr 30, 2013 at 05:21:17PM +0800, Tang Chen wrote:
> If all the memory ranges in SRAT are hotpluggable, we should not
> arrange them all in ZONE_MOVABLE. Otherwise the kernel won't have
> enough memory to boot.
>
> This patch introduce a global variable kernel_nodemask to mark
> all the nodes the kernel resides in. And no matter if they are
> hotpluggable, we arrange them as un-hotpluggable.
>
> Signed-off-by: Tang Chen <[email protected]>
> ---
> arch/x86/mm/numa.c | 6 ++++++
> include/linux/memblock.h | 1 +
> mm/memblock.c | 20 ++++++++++++++++++++
> 3 files changed, 27 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 26d1800..105b092 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -658,6 +658,12 @@ static bool srat_used __initdata;
> */
> static void __init early_x86_numa_init(void)
> {
> + /*
> + * Need to find out which nodes the kernel resides in, and arrange
> + * them as un-hotpluggable when parsing SRAT.
> + */
> + memblock_mark_kernel_nodes();
> +
> if (!numa_off) {
> #ifdef CONFIG_X86_NUMAQ
> if (!numa_init(numaq_numa_init))
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index c63a66e..5064eed 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -66,6 +66,7 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> int memblock_free(phys_addr_t base, phys_addr_t size);
> int memblock_reserve(phys_addr_t base, phys_addr_t size);
> void memblock_trim_memory(phys_addr_t align);
> +void memblock_mark_kernel_nodes(void);
>
> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 63924ae..1b93a5d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -35,6 +35,9 @@ struct memblock memblock __initdata_memblock = {
> .current_limit = MEMBLOCK_ALLOC_ANYWHERE,
> };
>
> +/* Mark which nodes the kernel resides in. */
> +static nodemask_t memblock_kernel_nodemask __initdata_memblock;
> +
> int memblock_debug __initdata_memblock;
> static int memblock_can_resize __initdata_memblock;
> static int memblock_memory_in_slab __initdata_memblock = 0;
> @@ -787,6 +790,23 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
> memblock_merge_regions(type);
> return 0;
> }
> +
> +void __init_memblock memblock_mark_kernel_nodes()
> +{
> + int i, nid;
> + struct memblock_type *reserved = &memblock.reserved;
> +
> + for (i = 0; i < reserved->cnt; i++)
> + if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
> + nid = memblock_get_region_node(&reserved->regions[i]);
> + node_set(nid, memblock_kernel_nodemask);
> + }
> +}

I think there is a problem here because memblock_set_region_node is sometimes
called with nid == MAX_NUMNODES. This means the correct node is not properly
masked in the memblock_kernel_nodemask bitmap.
E.g. in a VM test, memblock_mark_kernel_nodes with extra pr_warn calls iterates
over the following memblocks (ranges below are memblks base-(base+size)):

[ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000000000000-0x00000000010000
[ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000000098000-0x00000000100000
[ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000001000000-0x00000001a5a000
[ 0.000000] memblock_mark_kernel_nodes nid=64 0x00000037000000-0x000000377f8000

where MAX_NUMNODES is 64 because CONFIG_NODES_SHIFT=6.
The ranges above belong to node 0, but the node's bit is never marked.

With a buggy bios that marks all memory as hotpluggable, this results in a
panic, because both checks against hotpluggable bit and memblock_kernel_bitmask
(in early_mem_hotplug_init) fail, the numa regions have all been merged together
and memblock_reserve_hotpluggable is called for all memory.

With a correct bios (some part of initial memory is not hotplug-able) the kernel
can boot since the hotpluggable bit check works ok, but extra dimms on node 0
will still be allowed to be in MOVABLE_ZONE.

Actually this behaviour (being able to have MOVABLE memory on nodes with kernel
reserved memblocks) sort of matches the policy I requested in v2 :). But i
suspect that is not your intent i.e. you want memblock_kernel_nodemask_bitmap to
prevent movable reservations for the whole node where kernel has reserved
memblocks.

Is there a way to get accurate nid information for memblocks at early boot? I
suspect pfn_to_nid doesn't work yet at this stage (i got a panic when I
attempted iirc)

I used the hack below but it depends on CONFIG_NUMA, hopefully there is a
cleaner general way:

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index cfd8c2f..af8ad2a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -133,6 +133,19 @@ void __init setup_node_to_cpumask_map(void)
pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
}

+int __init numa_find_range_nid(u64 start, u64 size)
+{
+ unsigned int i;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (start >= mi->blk[i].start && start + size -1 <= mi->blk[i].end)
+ return mi->blk[i].nid;
+ }
+ return -1;
+}
+EXPORT_SYMBOL(numa_find_range_nid);
+
static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
bool hotpluggable,
struct numa_meminfo *mi)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77a71fb..194b7c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1600,6 +1600,9 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
#endif

+#ifdef CONFIG_NUMA
+int __init numa_find_range_nid(u64 start, u64 size);
+#endif
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/mm/memblock.c b/mm/memblock.c
index a6b7845..284aced 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -834,15 +834,26 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,

void __init_memblock memblock_mark_kernel_nodes()
{
- int i, nid;
+ int i, nid, tmpnid;
struct memblock_type *reserved = &memblock.reserved;

for (i = 0; i < reserved->cnt; i++)
if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
nid = memblock_get_region_node(&reserved->regions[i]);
+ if (nid == MAX_NUMNODES) {
+ tmpnid = numa_find_range_nid(reserved->regions[i].base,
+ reserved->regions[i].size);
+ if (tmpnid >= 0)
+ nid = tmpnid;
+ }

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e862311..84d6e64 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -667,11 +667,7 @@ static bool srat_used __initdata;
*/
static void __init early_x86_numa_init(void)
{
- /*
- * Need to find out which nodes the kernel resides in, and arrange
- * them as un-hotpluggable when parsing SRAT.
- */
- memblock_mark_kernel_nodes();

if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
@@ -779,6 +775,12 @@ void __init early_initmem_init(void)
load_cr3(swapper_pg_dir);
__flush_tlb_all();

+ /*
+ * Need to find out which nodes the kernel resides in, and arrange
+ * them as un-hotpluggable when parsing SRAT.
+ */
+
+ memblock_mark_kernel_nodes();
early_mem_hotplug_init();

early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
--

2013-05-31 16:25:55

by Vasilis Liaskovitis

[permalink] [raw]

Subject: Re: [PATCH v2 07/13] x86, numa, mem-hotplug: Mark nodes which the kernel resides in.

> Hi,

sorry, I meant to reply to v3, not v2. Please continue discussion in v3 (i sent
the mail there as well)

- Vasilis