2013-06-13 13:36:35

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 00/15] Arrange hotpluggable memory in SRAT as ZONE_MOVABLE.

In memory hotplug situation, the hotpluggable memory should be
arranged in ZONE_MOVABLE because memory in ZONE_NORMAL may be
used by kernel, and Linux cannot migrate pages used by kernel.

So we need a way to specify hotpluggable memory as movable. It
should be as easy as possible.

According to ACPI spec 5.0, SRAT table has memory affinity
structure and the structure has Hot Pluggable Filed.
See "5.2.16.2 Memory Affinity Structure".

If we use the information, we might be able to specify hotpluggable
memory by firmware. For example, if Hot Pluggable Filed is enabled,
kernel sets the memory as movable memory.

To achieve this goal, we need to do the following:
1. Prevent memblock from allocating hotpluggable memroy for kernel.
This is done by reserving hotpluggable memory in memblock as the
folowing steps:
1) Parse SRAT early enough so that memblock knows which memory
is hotpluggable.
2) Add a "flags" member to memblock so that it is able to tell
which memory is hotpluggable when freeing it to buddy.

2. Free hotpluggable memory to buddy system when memory initialization
is done.

3. Arrange hotpluggable memory in ZONE_MOVABLE.
(This will cause NUMA performance decreased)

4. Provide a user interface to enable/disable this functionality.
(This is useful for those who don't use memory hotplug and who don't
want to lose their NUMA performance.)


This patch-set does the following:
patch1: Fix a little problem.
patch2: Have Hot-Pluggable Field in SRAT printed when parsing SRAT.
patch4,5: Introduce hotpluggable field to numa_meminfo.
patch6~9: Introduce flags to memblock, and keep the public APIs prototype
unmodified.
patch10,11: Reserve node-life-cycle memory as MEMBLK_LOCAL_NODE with memblock.
patch12,13: Reserve hotpluggable memory as MEMBLK_HOTPLUGGABLE with memblock,
and free it to buddy when memory initialization is done.
patch3,14,15: Improve "movablecore" boot option to support "movablecore=acpi".


Change log v3 -> v4:
1. Define flags in memblock as macro directly instead of bit shift.
2. Fix a bug found by Vasilis Liaskovitis, mark nodes which the
kernel resides in correctly.

Change log v2 -> v3:
1. As Chen Gong <[email protected]> noticed that
memblock_alloc_try_nid() will call panic() if it fails to
allocate memory, so remove the return value check in
setup_node_data() in patch1.
2. Did not movable find_usable_zone_for_movable() forward
to initialize movable_zone. Fixed in patch12.
3. Did not transform reserved->regions[i].base to its PFN
in find_zone_movable_pfns_for_nodes(). Fixed in patch12.

Change log v1 -> v2:
1. Fix a bug in patch10: forgot to update start and end value.
2. Add new patch8: make alloc_low_pages be able to call
memory_add_physaddr_to_nid().


This patch-set is based on Yinghai's
"x86, ACPI, numa: Parse numa info early" patch-set.
Please refer to:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47
v3: https://lkml.org/lkml/2013/4/4/639
v4: https://lkml.org/lkml/2013/4/11/829

And Yinghai's patch did the following things:
1) Parse SRAT early enough.
2)Allocate pagetable pages in local node.

Tang Chen (14):
acpi: Print Hot-Pluggable Field in SRAT.
page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using
SRAT.
x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct
numa_meminfo.
x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup
numa_meminfo.
memblock, numa: Introduce flag into memblock.
x86, numa: Synchronize nid info in memblock.reserve with
numa_meminfo.
x86, numa: Save nid when reserve memory into memblock.reserved[].
x86, numa, mem-hotplug: Mark nodes which the kernel resides in.
x86, numa: Move memory_add_physaddr_to_nid() to CONFIG_NUMA.
x86, numa, memblock: Introduce MEMBLK_LOCAL_NODE to mark and reserve
node-life-cycle data.
x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark
and reserve hotpluggable memory.
x86, memblock, mem-hotplug: Free hotpluggable memory reserved by
memblock.
x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher
priority.
doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot
option.

Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node

Documentation/kernel-parameters.txt | 8 ++
arch/x86/include/asm/numa.h | 3 +-
arch/x86/kernel/apic/numaq_32.c | 2 +-
arch/x86/mm/amdtopology.c | 3 +-
arch/x86/mm/init.c | 16 +++-
arch/x86/mm/numa.c | 118 +++++++++++++++++++++++++++++---
arch/x86/mm/numa_internal.h | 1 +
arch/x86/mm/srat.c | 13 ++--
include/linux/memblock.h | 13 ++++
include/linux/memory_hotplug.h | 3 +
include/linux/mm.h | 9 +++
mm/memblock.c | 129 +++++++++++++++++++++++++++++++----
mm/nobootmem.c | 3 +
mm/page_alloc.c | 44 +++++++++++-
14 files changed, 325 insertions(+), 40 deletions(-)


2013-06-13 13:28:16

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 14/15] x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher priority.

Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance decreased
because the kernel cannot use movable memory.

For users who don't use memory hotplug and who don't want to lose their NUMA
performance, they need a way to disable this functionality.

So, if users specify "movablecore=acpi" in kernel commandline, the kernel will
use SRAT to arrange ZONE_MOVABLE, and it has higher priority then original
movablecore and kernelcore boot option.

For those who don't want this, just specify nothing.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 +++++
mm/page_alloc.c | 31 +++++++++++++++++++++++++++++--
3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d113175..a85ced9 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -66,6 +66,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
+bool memblock_is_hotpluggable(struct memblock_region *region);
void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);
diff --git a/mm/memblock.c b/mm/memblock.c
index 9df0b5f..0c4a709 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -626,6 +626,11 @@ int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
return memblock_reserve_region(base, size, nid, MEMBLK_HOTPLUGGABLE);
}

+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+ return region->flags & MEMBLK_HOTPLUGGABLE;
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee5ae49..10c85b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4830,9 +4830,37 @@ static void __init find_zone_movable_pfns_for_nodes(void)
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+ struct memblock_type *reserved = &memblock.reserved;

/*
- * If movablecore was specified, calculate what size of
+ * Need to find movable_zone earlier in case movablecore=acpi is
+ * specified.
+ */
+ find_usable_zone_for_movable();
+
+ /*
+ * If movablecore=acpi was specified, then zone_movable_pfn[] has been
+ * initialized, and no more work needs to do.
+ * NOTE: In this case, we ignore kernelcore option.
+ */
+ if (movablecore_enable_srat) {
+ for (i = 0; i < reserved->cnt; i++) {
+ if (!memblock_is_hotpluggable(&reserved->regions[i]))
+ continue;
+
+ nid = reserved->regions[i].nid;
+
+ usable_startpfn = PFN_DOWN(reserved->regions[i].base);
+ zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+ min(usable_startpfn, zone_movable_pfn[nid]) :
+ usable_startpfn;
+ }
+
+ goto out;
+ }
+
+ /*
+ * If movablecore=nn[KMG] was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
@@ -4858,7 +4886,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;

/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
- find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

restart:
--
1.7.1

2013-06-13 13:28:24

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 10/15] x86, numa: Move memory_add_physaddr_to_nid() to CONFIG_NUMA.

memory_add_physaddr_to_nid() is declared in include/linux/memory_hotplug.h,
protected by CONFIG_NUMA. And in x86, the definitions are protected by
CONFIG_MEMORY_HOTPLUG.

memory_add_physaddr_to_nid() uses numa_meminfo to find the physical address's
nid. It has nothing to do with memory hotplug. And also, it can be used by
alloc_low_pages() to obtain nid of the allocated memory.

So in x86, also use CONFIG_NUMA to protect it.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1242190..2b5057f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -1009,7 +1009,7 @@ EXPORT_SYMBOL(cpumask_of_node);

#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */

-#ifdef CONFIG_MEMORY_HOTPLUG
+#ifdef CONFIG_NUMA
int memory_add_physaddr_to_nid(u64 start)
{
struct numa_meminfo *mi = &numa_meminfo;
--
1.7.1

2013-06-13 13:28:56

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 03/15] page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using SRAT.

The Hot-Pluggable Fired in SRAT specified which memory ranges are hotpluggable.
We will arrange hotpluggable memory as ZONE_MOVABLE for users who want to use
memory hotplug functionality. But this will cause NUMA performance decreased
because kernel cannot use ZONE_MOVABLE.

So we improve movablecore boot option to allow those who want to use memory
hotplug functionality to enable using SRAT info to arrange movable memory.

Users can specify "movablecore=acpi" in kernel commandline to enable this
functionality.

For those who don't use memory hotplug or who don't want to lose their NUMA
performance, just don't specify anything. The kernel will work as before.

Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
include/linux/memory_hotplug.h | 3 +++
mm/page_alloc.c | 13 +++++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3e622c6..0b21e54 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
ONLINE_MOVABLE,
};

+/* Enable/disable SRAT in movablecore boot option */
+extern bool movablecore_enable_srat;
+
/*
* pgdat resizing functions
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ba7703..ee5ae49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -209,6 +209,8 @@ static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];

+bool __initdata movablecore_enable_srat;
+
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
EXPORT_SYMBOL(movable_zone);
@@ -5062,6 +5064,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
}
}

+static void __init cmdline_movablecore_srat(char *p)
+{
+ if (p && !strcmp(p, "acpi"))
+ movablecore_enable_srat = true;
+}
+
static int __init cmdline_parse_core(char *p, unsigned long *core)
{
unsigned long long coremem;
@@ -5092,6 +5100,11 @@ static int __init cmdline_parse_kernelcore(char *p)
*/
static int __init cmdline_parse_movablecore(char *p)
{
+ cmdline_movablecore_srat(p);
+
+ if (movablecore_enable_srat)
+ return 0;
+
return cmdline_parse_core(p, &required_movablecore);
}

--
1.7.1

2013-06-13 13:28:55

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 12/15] x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to mark and reserve hotpluggable memory.

We mark out movable memory ranges and reserve them with MEMBLK_HOTPLUGGABLE flag in
memblock.reserved. This should be done after the memory mapping is initialized
because the kernel now supports allocate pagetable pages on local node, which
are kernel pages.

The reserved hotpluggable will be freed to buddy when memory initialization
is done.

And also, ensure all the nodes which the kernel resides in are un-hotpluggable.

This idea is from Wen Congyang <[email protected]> and Jiang Liu <[email protected]>.

Suggested-by: Jiang Liu <[email protected]>
Suggested-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Vasilis Liaskovitis <[email protected]>
---
arch/x86/mm/numa.c | 29 +++++++++++++++++++++++++++++
include/linux/memblock.h | 3 +++
mm/memblock.c | 18 ++++++++++++++++++
3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2b5057f..31595c5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -771,6 +771,33 @@ static void __init early_x86_numa_init_mapping(void)
}
#endif

+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+static void __init early_mem_hotplug_init()
+{
+ int i, nid;
+ phys_addr_t start, end;
+
+ if (!movablecore_enable_srat)
+ return;
+
+ for (i = 0; i < numa_meminfo.nr_blks; i++) {
+ nid = numa_meminfo.blk[i].nid;
+ start = numa_meminfo.blk[i].start;
+ end = numa_meminfo.blk[i].end;
+
+ if (!numa_meminfo.blk[i].hotpluggable ||
+ memblock_is_kernel_node(nid))
+ continue;
+
+ memblock_reserve_hotpluggable(start, end - start, nid);
+ }
+}
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+static inline void early_mem_hotplug_init()
+{
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
@@ -790,6 +817,8 @@ void __init early_initmem_init(void)
load_cr3(swapper_pg_dir);
__flush_tlb_all();

+ early_mem_hotplug_init();
+
early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 517c027..ce315b2 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
/* Definition of memblock flags. */
#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
#define MEMBLK_LOCAL_NODE 0x1 /* node-life-cycle data */
+#define MEMBLK_HOTPLUGGABLE 0x2 /* hotpluggable region */

struct memblock_region {
phys_addr_t base;
@@ -64,8 +65,10 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);
+bool memblock_is_kernel_node(int nid);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index e747bc6..51f0264 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -603,6 +603,12 @@ int __init_memblock memblock_reserve_local_node(phys_addr_t base,
return memblock_reserve_region(base, size, nid, MEMBLK_LOCAL_NODE);
}

+int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
+ phys_addr_t size, int nid)
+{
+ return memblock_reserve_region(base, size, nid, MEMBLK_HOTPLUGGABLE);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
@@ -816,11 +822,23 @@ void __init_memblock memblock_mark_kernel_nodes()
node_set(nid, memblock_kernel_nodemask);
}
}
+
+bool __init_memblock memblock_is_kernel_node(int nid)
+{
+ if (node_isset(nid, memblock_kernel_nodemask))
+ return true;
+ return false;
+}
#else
void __init_memblock memblock_mark_kernel_nodes()
{
node_set(0, memblock_kernel_nodemask);
}
+
+bool __init_memblock memblock_is_kernel_node(int nid)
+{
+ return true;
+}
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
--
1.7.1

2013-06-13 13:34:55

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 06/15] memblock, numa: Introduce flag into memblock.

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
For example, as Yinghai did in his patch, allocate pagetables on local
node before all the memory on the node is mapped.
Please refer to Yinghai's patch:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47
v3: https://lkml.org/lkml/2013/4/4/639
v4: https://lkml.org/lkml/2013/4/11/829

In hotplug environment, there could be some problems when we hot-remove
memory if we do so. Pagetable pages are kernel memory, which we cannot
migrate. But we can put them in local node because their life-cycle is
the same as the node. So we need to free them all before memory hot-removing.

Actually, data whose life cycle is the same as a node, such as pagetable
pages, vmemmap pages, page_cgroup pages, all could be put on local node.
They can be freed when we hot-removing a whole node.

In order to do so, we need to mark out these special pages in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};

This patch does the following things:
1) Add "flags" member to memblock_region, and MEMBLK_DEFAULT flag for common usage.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>.

v3 -> v4: Define the flags with macro directly instead of bit shift.

Suggested-by: Wen Congyang <[email protected]>
Suggested-by: Liu Jiang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 4 +++
mm/memblock.c | 55 +++++++++++++++++++++++++++++++++------------
2 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..93f3453 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,9 +19,13 @@

#define INIT_MEMBLOCK_REGIONS 128

+/* Definition of memblock flags. */
+#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
+ unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
diff --git a/mm/memblock.c b/mm/memblock.c
index c5fad93..9e871e9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
type->cnt = 1;
type->regions[0].base = 0;
type->regions[0].size = 0;
+ type->regions[0].flags = 0;
memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
}
}
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)

if (this->base + this->size != next->base ||
memblock_get_region_node(this) !=
- memblock_get_region_node(next)) {
+ memblock_get_region_node(next) ||
+ this->flags != next->flags) {
BUG_ON(this->base + this->size > next->base);
i++;
continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
* @base: base address of the new region
* @size: size of the new region
* @nid: node id of the new region
+ * @flags: flags of the new region
*
* Insert new memblock region [@base,@base+@size) into @type at @idx.
* @type must already have extra room to accomodate the new region.
*/
static void __init_memblock memblock_insert_region(struct memblock_type *type,
int idx, phys_addr_t base,
- phys_addr_t size, int nid)
+ phys_addr_t size,
+ int nid, unsigned long flags)
{
struct memblock_region *rgn = &type->regions[idx];

@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
rgn->base = base;
rgn->size = size;
+ rgn->flags = flags;
memblock_set_region_node(rgn, nid);
type->cnt++;
type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
+ * @flags: flags of the new region
*
* Add new memblock region [@base,@base+@size) into @type. The new region
* is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* 0 on success, -errno on failure.
*/
static int __init_memblock memblock_add_region(struct memblock_type *type,
- phys_addr_t base, phys_addr_t size, int nid)
+ phys_addr_t base, phys_addr_t size,
+ int nid, unsigned long flags)
{
bool insert = false;
phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
+ type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
@@ -407,7 +415,8 @@ repeat:
nr_new++;
if (insert)
memblock_insert_region(type, i++, base,
- rbase - base, nid);
+ rbase - base, nid,
+ flags);
}
/* area below @rend is dealt with, forget about it */
base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
if (base < end) {
nr_new++;
if (insert)
- memblock_insert_region(type, i, base, end - base, nid);
+ memblock_insert_region(type, i, base, end - base,
+ nid, flags);
}

/*
@@ -439,12 +449,14 @@ repeat:
int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
int nid)
{
- return memblock_add_region(&memblock.memory, base, size, nid);
+ return memblock_add_region(&memblock.memory, base, size,
+ nid, MEMBLK_FLAGS_DEFAULT);
}

int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
- return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+ return memblock_add_region(&memblock.memory, base, size,
+ MAX_NUMNODES, MEMBLK_FLAGS_DEFAULT);
}

/**
@@ -499,7 +511,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else if (rend > end) {
/*
* @rgn intersects from above. Split and redo the
@@ -509,7 +522,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else {
/* @rgn is fully contained, record it */
if (!*end_rgn)
@@ -551,16 +565,25 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}

-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+ phys_addr_t size,
+ int nid,
+ unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;

- memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] with flags %#016lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size,
- (void *)_RET_IP_);
+ flags, (void *)_RET_IP_);
+
+ return memblock_add_region(_rgn, base, size, nid, flags);
+}

- return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_reserve_region(base, size, MAX_NUMNODES,
+ MEMBLK_FLAGS_DEFAULT);
}

/**
@@ -985,6 +1008,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
{
unsigned long long base, size;
+ unsigned long flags;
int i;

pr_info(" %s.cnt = 0x%lx\n", name, type->cnt);
@@ -995,13 +1019,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name

base = rgn->base;
size = rgn->size;
+ flags = rgn->flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
if (memblock_get_region_node(rgn) != MAX_NUMNODES)
snprintf(nid_buf, sizeof(nid_buf), " on node %d",
memblock_get_region_node(rgn));
#endif
- pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
- name, i, base, base + size - 1, size, nid_buf);
+ pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+ name, i, base, base + size - 1, size, nid_buf, flags);
}
}

--
1.7.1

2013-06-13 13:36:50

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 04/15] x86, numa, acpi, memory-hotplug: Introduce hotplug info into struct numa_meminfo.

Since Yinghai has implement "Allocate pagetable pages in local node", for a
node with hotpluggable memory, we have to allocate pagetable pages first, and
then reserve the rest as hotpluggable memory in memblock.

But the kernel parse SRAT first, and then initialize memory mapping. So we have
to remember the which memory ranges are hotpluggable for future usage.

When parsing SRAT, we added each memory range to numa_meminfo. So we can store
hotpluggable info in numa_meminfo.

This patch introduces a "bool hotpluggable" member into struct
numa_meminfo.

And modifies the following APIs' prototypes to support it:
- numa_add_memblk()
- numa_add_memblk_to()

And the following callers:
- numaq_register_node()
- dummy_numa_init()
- amd_numa_init()
- acpi_numa_memory_affinity_init() in x86

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/numa.h | 3 ++-
arch/x86/kernel/apic/numaq_32.c | 2 +-
arch/x86/mm/amdtopology.c | 3 ++-
arch/x86/mm/numa.c | 10 +++++++---
arch/x86/mm/numa_internal.h | 1 +
arch/x86/mm/srat.c | 2 +-
6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 1b99ee5..73096b2 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,7 +31,8 @@ extern int numa_off;
extern s16 __apicid_to_node[MAX_LOCAL_APIC];
extern nodemask_t numa_nodes_parsed __initdata;

-extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+extern int __init numa_add_memblk(int nodeid, u64 start, u64 end,
+ bool hotpluggable);
extern void __init numa_set_distance(int from, int to, int distance);

static inline void set_apicid_to_node(int apicid, s16 node)
diff --git a/arch/x86/kernel/apic/numaq_32.c b/arch/x86/kernel/apic/numaq_32.c
index d661ee9..7a9c542 100644
--- a/arch/x86/kernel/apic/numaq_32.c
+++ b/arch/x86/kernel/apic/numaq_32.c
@@ -82,7 +82,7 @@ static inline void numaq_register_node(int node, struct sys_cfg_data *scd)
int ret;

node_set(node, numa_nodes_parsed);
- ret = numa_add_memblk(node, start, end);
+ ret = numa_add_memblk(node, start, end, false);
BUG_ON(ret < 0);
}

diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index 2ca15b5..64a94ad 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -166,7 +166,8 @@ int __init amd_numa_init(void)
nodeid, base, limit);

prevbase = base;
- numa_add_memblk(nodeid, base, limit);
+ /* Do not support memory hotplug for AMD cpu. */
+ numa_add_memblk(nodeid, base, limit, false);
node_set(nodeid, numa_nodes_parsed);
}

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index bea597a..bf610f8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -131,6 +131,7 @@ void __init setup_node_to_cpumask_map(void)
}

static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+ bool hotpluggable,
struct numa_meminfo *mi)
{
/* ignore zero length blks */
@@ -152,6 +153,7 @@ static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
mi->blk[mi->nr_blks].start = start;
mi->blk[mi->nr_blks].end = end;
mi->blk[mi->nr_blks].nid = nid;
+ mi->blk[mi->nr_blks].hotpluggable = hotpluggable;
mi->nr_blks++;
return 0;
}
@@ -176,15 +178,17 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
* @nid: NUMA node ID of the new memblk
* @start: Start address of the new memblk
* @end: End address of the new memblk
+ * @hotpluggable: True if memblk is hotpluggable
*
* Add a new memblk to the default numa_meminfo.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
-int __init numa_add_memblk(int nid, u64 start, u64 end)
+int __init numa_add_memblk(int nid, u64 start, u64 end,
+ bool hotpluggable)
{
- return numa_add_memblk_to(nid, start, end, &numa_meminfo);
+ return numa_add_memblk_to(nid, start, end, hotpluggable, &numa_meminfo);
}

/* Initialize NODE_DATA for a node on the local memory */
@@ -628,7 +632,7 @@ static int __init dummy_numa_init(void)
0LLU, PFN_PHYS(max_pfn) - 1);

node_set(0, numa_nodes_parsed);
- numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
+ numa_add_memblk(0, 0, PFN_PHYS(max_pfn), false);

return 0;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index bb2fbcc..1ce4e6b 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -8,6 +8,7 @@ struct numa_memblk {
u64 start;
u64 end;
int nid;
+ bool hotpluggable;
};

struct numa_meminfo {
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 134a79d..90600ac 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -171,7 +171,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
}

- if (numa_add_memblk(node, start, end) < 0)
+ if (numa_add_memblk(node, start, end, hotpluggable) < 0)
goto out_err_bad_srat;

node_set(node, numa_nodes_parsed);
--
1.7.1

2013-06-13 13:38:28

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 09/15] x86, numa, mem-hotplug: Mark nodes which the kernel resides in.

If all the memory ranges in SRAT are hotpluggable, we should not
arrange them all in ZONE_MOVABLE. Otherwise the kernel won't have
enough memory to boot.

This patch introduce a global variable kernel_nodemask to mark
all the nodes the kernel resides in. And no matter if they are
hotpluggable, we arrange them as un-hotpluggable.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 10 ++++++++++
include/linux/memblock.h | 1 +
mm/memblock.c | 20 ++++++++++++++++++++
3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 005a422..1242190 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -775,6 +775,16 @@ void __init early_initmem_init(void)
{
early_x86_numa_init();

+ /*
+ * Need to find out which nodes the kernel resides in, and arrange
+ * them as un-hotpluggable when parsing SRAT.
+ *
+ * This should be done after numa_init() is called because we
+ * synchronized the nid info in memblock.reserve[] to numa_meminfo
+ * in numa_init().
+ */
+ memblock_mark_kernel_nodes();
+
early_x86_numa_init_mapping();

load_cr3(swapper_pg_dir);
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f558590..5a52f37 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -63,6 +63,7 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);
+void memblock_mark_kernel_nodes(void);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index cc55ff0..bb53c54 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -35,6 +35,9 @@ struct memblock memblock __initdata_memblock = {
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};

+/* Mark which nodes the kernel resides in. */
+static nodemask_t memblock_kernel_nodemask __initdata_memblock;
+
int memblock_debug __initdata_memblock;
static int memblock_can_resize __initdata_memblock;
static int memblock_memory_in_slab __initdata_memblock = 0;
@@ -795,6 +798,23 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
memblock_merge_regions(type);
return 0;
}
+
+void __init_memblock memblock_mark_kernel_nodes()
+{
+ int i, nid;
+ struct memblock_type *reserved = &memblock.reserved;
+
+ for (i = 0; i < reserved->cnt; i++)
+ if (reserved->regions[i].flags == MEMBLK_FLAGS_DEFAULT) {
+ nid = memblock_get_region_node(&reserved->regions[i]);
+ node_set(nid, memblock_kernel_nodemask);
+ }
+}
+#else
+void __init_memblock memblock_mark_kernel_nodes()
+{
+ node_set(0, memblock_kernel_nodemask);
+}
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
--
1.7.1

2013-06-13 13:38:30

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 11/15] x86, numa, memblock: Introduce MEMBLK_LOCAL_NODE to mark and reserve node-life-cycle data.

node-life-cycle data (whose life cycle is the same as a node)
allocated by memblock should be marked so that when we free usable
memory to buddy system, we can skip them.

This patch introduces a flag MEMBLK_LOCAL_NODE for memblock to reserve
node-life-cycle data. For now, it is only kernel direct mapping pagetable
pages, based on Yinghai's patch.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/init.c | 16 ++++++++++++----
include/linux/memblock.h | 2 ++
mm/memblock.c | 6 ++++++
3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 9ff71ff..63abb46 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -62,14 +62,22 @@ __ref void *alloc_low_pages(unsigned int num)
low_min_pfn_mapped << PAGE_SHIFT,
low_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
- } else
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+
+ memblock_reserve(ret, PAGE_SIZE * num);
+ } else {
ret = memblock_find_in_range(
local_min_pfn_mapped << PAGE_SHIFT,
local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
- if (!ret)
- panic("alloc_low_page: can not alloc memory");
- memblock_reserve(ret, PAGE_SIZE * num);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+
+ memblock_reserve_local_node(ret, PAGE_SIZE * num,
+ memory_add_physaddr_to_nid(ret));
+ }
+
pfn = ret >> PAGE_SHIFT;
} else {
pfn = pgt_buf_end;
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5a52f37..517c027 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -21,6 +21,7 @@

/* Definition of memblock flags. */
#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
+#define MEMBLK_LOCAL_NODE 0x1 /* node-life-cycle data */

struct memblock_region {
phys_addr_t base;
@@ -62,6 +63,7 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);

diff --git a/mm/memblock.c b/mm/memblock.c
index bb53c54..e747bc6 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -597,6 +597,12 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
MEMBLK_FLAGS_DEFAULT);
}

+int __init_memblock memblock_reserve_local_node(phys_addr_t base,
+ phys_addr_t size, int nid)
+{
+ return memblock_reserve_region(base, size, nid, MEMBLK_LOCAL_NODE);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
--
1.7.1

2013-06-13 13:38:55

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 07/15] x86, numa: Synchronize nid info in memblock.reserve with numa_meminfo.

Vasilis Liaskovitis found that before we parse SRAT and fulfill
numa_meminfo, the nids of all the regions in memblock.reserve[]
are MAX_NUMNODES. In this case, we cannot mark the nodes which
the kernel resides in correctly.

So after we parse SRAT and fulfill nume_meminfo, synchronize the
nid info to memblock.reserve[] immediately.

Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Vasilis Liaskovitis <[email protected]>
---
arch/x86/mm/numa.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 05e4443..005a422 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -595,6 +595,48 @@ static void __init numa_init_array(void)
}
}

+/*
+ * early_numa_find_range_nid - Find nid for a memory range at early time.
+ * @start: start address of the memory range (physaddr)
+ * @size: size of the memory range
+ *
+ * Return nid of the memory range, or MAX_NUMNODES if it failed to find the nid.
+ *
+ * NOTE: This function uses numa_meminfo to find the range's nid, so it should
+ * be called after numa_meminfo has been initialized.
+ */
+int __init early_numa_find_range_nid(u64 start, u64 size)
+{
+ int i;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ for (i = 0; i < mi->nr_blks; i++)
+ if (start >= mi->blk[i].start &&
+ (start + size - 1) <= mi->blk[i].end)
+ return mi->blk[i].nid;
+
+ return MAX_NUMNODES;
+}
+
+/*
+ * numa_sync_memblock_nid - Synchronize nid info in memblock.reserve[] to
+ * numa_meminfo.
+ *
+ * This function will synchronize the nid fields of regions in
+ * memblock.reserve[] to numa_meminfo.
+ */
+static void __init numa_sync_memblock_nid()
+{
+ int i, nid;
+ struct memblock_type *res = &memblock.reserved;
+
+ for (i = 0; i < res->cnt; i++) {
+ nid = early_numa_find_range_nid(res->regions[i].base,
+ res->regions[i].size);
+ memblock_set_region_node(&res->regions[i], nid);
+ }
+}
+
static int __init numa_init(int (*init_func)(void))
{
int i;
@@ -617,6 +659,13 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

+ /*
+ * Before fulfilling numa_meminfo, all regions allocated by memblock
+ * are reserved with nid MAX_NUMNODES because there is no numa node
+ * info at such an early time. Now, fill the correct nid into memblock.
+ */
+ numa_sync_memblock_nid();
+
return 0;
}

--
1.7.1

2013-06-13 13:28:00

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 05/15] x86, numa, acpi, memory-hotplug: Consider hotplug info when cleanup numa_meminfo.

Since we have introduced hotplug info into struct numa_meminfo, we need
to consider it when cleanup numa_meminfo.

The original logic in numa_cleanup_meminfo() is:
Merge blocks on the same node, holes between which don't overlap with
memory on other nodes.

This patch modifies numa_cleanup_meminfo() logic like this:
Merge blocks with the same hotpluggable type on the same node, holes
between which don't overlap with memory on other nodes.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index bf610f8..05e4443 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -293,18 +293,22 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
}

/*
- * Join together blocks on the same node, holes
- * between which don't overlap with memory on other
- * nodes.
+ * Join together blocks on the same node, with the same
+ * hotpluggable flags, holes between which don't overlap
+ * with memory on other nodes.
*/
if (bi->nid != bj->nid)
continue;
+ if (bi->hotpluggable != bj->hotpluggable)
+ continue;
+
start = min(bi->start, bj->start);
end = max(bi->end, bj->end);
for (k = 0; k < mi->nr_blks; k++) {
struct numa_memblk *bk = &mi->blk[k];

- if (bi->nid == bk->nid)
+ if (bi->nid == bk->nid &&
+ bi->hotpluggable == bk->hotpluggable)
continue;
if (start < bk->end && end > bk->start)
break;
@@ -324,6 +328,7 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
mi->blk[i].start = mi->blk[i].end = 0;
mi->blk[i].nid = NUMA_NO_NODE;
+ mi->blk[i].hotpluggable = false;
}

return 0;
--
1.7.1

2013-06-13 13:39:30

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 13/15] x86, memblock, mem-hotplug: Free hotpluggable memory reserved by memblock.

We reserved hotpluggable memory in memblock. And when memory initialization
is done, we have to free it to buddy system.

This patch free memory reserved by memblock with flag MEMBLK_HOTPLUGGABLE.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 17 +++++++++++++++++
mm/nobootmem.c | 3 +++
3 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ce315b2..d113175 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -66,6 +66,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_local_node(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
+void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
void memblock_mark_kernel_nodes(void);
bool memblock_is_kernel_node(int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index 51f0264..9df0b5f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -568,6 +568,23 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}

+static void __init_memblock memblock_free_flags(unsigned long flags)
+{
+ int i;
+ struct memblock_type *reserved = &memblock.reserved;
+
+ for (i = 0; i < reserved->cnt; i++) {
+ if (reserved->regions[i].flags == flags)
+ memblock_remove_region(reserved, i);
+ }
+}
+
+void __init_memblock memblock_free_hotpluggable()
+{
+ memblock_dbg("memblock: free all hotpluggable memory");
+ memblock_free_flags(MEMBLK_HOTPLUGGABLE);
+}
+
static int __init_memblock memblock_reserve_region(phys_addr_t base,
phys_addr_t size,
int nid,
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bdd3fa2..dbfbcb9 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -165,6 +165,9 @@ unsigned long __init free_all_bootmem(void)
for_each_online_pgdat(pgdat)
reset_node_lowmem_managed_pages(pgdat);

+ /* Hotpluggable memory reserved by memblock should also be freed. */
+ memblock_free_hotpluggable();
+
/*
* We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
--
1.7.1

2013-06-13 13:39:27

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 08/15] x86, numa: Save nid when reserve memory into memblock.reserved[].

Since we introduced numa_sync_memblock_nid synchronize nid info in
memblock.reserved[] and numa_meminfo, when numa_meminfo has been
initialized, we need to save the nid into memblock.reserved[] when
we reserve memory.

Reported-by: Vasilis Liaskovitis <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 1 +
include/linux/mm.h | 9 +++++++++
mm/memblock.c | 10 +++++++++-
3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 93f3453..f558590 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -61,6 +61,7 @@ int memblock_add(phys_addr_t base, phys_addr_t size);
int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
+int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b827743..4a94b56 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1662,6 +1662,15 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
#endif

+#ifdef CONFIG_NUMA
+int __init early_numa_find_range_nid(u64 start, u64 size);
+#else
+static inline int __init early_numa_find_range_nid(u64 start, u64 size)
+{
+ return 0;
+}
+#endif
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/mm/memblock.c b/mm/memblock.c
index 9e871e9..cc55ff0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -580,9 +580,17 @@ static int __init_memblock memblock_reserve_region(phys_addr_t base,
return memblock_add_region(_rgn, base, size, nid, flags);
}

+int __init_memblock memblock_reserve_node(phys_addr_t base,
+ phys_addr_t size, int nid)
+{
+ return memblock_reserve_region(base, size, nid,
+ MEMBLK_FLAGS_DEFAULT);
+}
+
int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
{
- return memblock_reserve_region(base, size, MAX_NUMNODES,
+ int nid = early_numa_find_range_nid(base, size);
+ return memblock_reserve_region(base, size, nid,
MEMBLK_FLAGS_DEFAULT);
}

--
1.7.1

2013-06-13 13:27:57

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 15/15] doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot option.

Since we modify movablecore boot option to support
"movablecore=acpi", this patch adds doc for it.

Signed-off-by: Tang Chen <[email protected]>
---
Documentation/kernel-parameters.txt | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 2fe6e76..615ca4b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1714,6 +1714,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.

+ movablecore=acpi [KNL,X86] This parameter will enable the
+ kernel to arrange ZONE_MOVABLE with the help of
+ Hot-Pluggable Field in SRAT. All the hotpluggable
+ memory will be arranged in ZONE_MOVABLE.
+ NOTE: Any node which the kernel resides in will
+ always be un-hotpluggable so that the kernel
+ will always have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

--
1.7.1

2013-06-13 13:41:33

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 02/15] acpi: Print Hot-Pluggable Field in SRAT.

The Hot-Pluggable field in SRAT suggests if the memory could be
hotplugged while the system is running. Print it as well when
parsing SRAT will help users to know which memory is hotpluggable.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
arch/x86/mm/srat.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 443f9ef..134a79d 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -146,6 +146,7 @@ int __init
acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
+ u32 hotpluggable;
int node, pxm;

if (srat_disabled())
@@ -154,7 +155,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
goto out_err;
- if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
+ hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
+ if (hotpluggable && !save_add_info())
goto out_err;

start = ma->base_address;
@@ -174,9 +176,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)

node_set(node, numa_nodes_parsed);

- printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
- node, pxm,
- (unsigned long long) start, (unsigned long long) end - 1);
+ pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
+ node, pxm,
+ (unsigned long long) start, (unsigned long long) end - 1,
+ hotpluggable ? "Hot Pluggable" : "");

return 0;
out_err_bad_srat:
--
1.7.1

2013-06-13 13:42:10

by Tang Chen

[permalink] [raw]
Subject: [Part2 PATCH v4 01/15] x86: get pg_data_t's memory from other node

From: Yasuaki Ishimatsu <[email protected]>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
arch/x86/mm/numa.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5adf803..bea597a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -211,10 +211,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/
- nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
+ pr_err("Cannot find %zu bytes in any node\n", nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.1

2013-06-18 16:58:01

by Vasilis Liaskovitis

[permalink] [raw]
Subject: Re: [Part2 PATCH v4 08/15] x86, numa: Save nid when reserve memory into memblock.reserved[].

Hi Tang,

On Thu, Jun 13, 2013 at 09:03:32PM +0800, Tang Chen wrote:
> Since we introduced numa_sync_memblock_nid synchronize nid info in
> memblock.reserved[] and numa_meminfo, when numa_meminfo has been
> initialized, we need to save the nid into memblock.reserved[] when
> we reserve memory.

thanks for the updated patches.
I tested linux-next next-20130706 +part1+part2+part3 in a VM, hot-plugging
memory and rebooting with movablecore=acpi. I think with this patch and 9/15
we get the correct nids and the expected behaviour for the "movablecore=acpi"
option.

However, patches 21,22 of part1 and all part3 patches increase kernel usage
of local node memory by putting pagetables local to those nodes. Are these
pagetable pages accounted in part2's memblock_kernel_nodemask? It looks like
part1 and part3 of these patchsets contradict or make the goal of part2 more
difficult to achieve. (I will send more comments for part3 separately).

thanks,

- Vasilis

2013-06-19 06:12:02

by Tang Chen

[permalink] [raw]
Subject: Re: [Part2 PATCH v4 08/15] x86, numa: Save nid when reserve memory into memblock.reserved[].

Hi Vasilis,

Thanks for reviewing. :)

On 06/19/2013 12:57 AM, Vasilis Liaskovitis wrote:
......
>
> However, patches 21,22 of part1 and all part3 patches increase kernel usage
> of local node memory by putting pagetables local to those nodes. Are these
> pagetable pages accounted in part2's memblock_kernel_nodemask? It looks like

No, they are not. What I wanted to acheve was that the local pagetable pages
are transparent to users. For a movable node (all memory is hotpluggable),
seeing from users level, they think all the node's memory is not used by
the
kernel. Actually pagetable pages are used by the kernel, but users don't
know
it, and they don't care about it.

And also, memblock_kernel_nodemask is only used at very early time. When
the
system is up, it is useless.

This is just my approach for this problem. It is not good enough, and we can
improve it.

> part1 and part3 of these patchsets contradict or make the goal of part2 more
> difficult to achieve. (I will send more comments for part3 separately).
>

I think allocating pagetable to local node really makes thing a little more
difficult than before. But I also think Yinghai's work is reasonable
because
it helps to improve the performance.

What I am thinking now is to allocate things like pagetable pages to local
device. (Seems I mentioned this before.)

If a node has more than one memory device, and all the pagetable pages are
allocated in one device. Then this device cannot be hot-removed unless all
the other memory devices are hot-removed.

So I think allocating pagetable pages to local device is more reasonable.
But as you said, this could be more complex.

Thanks. :)