Hello, here is the v6 version. Any comments are welcome!
The v6 version is based on linus's tree (3.12-rc3)
HEAD is:
commit 15c03dd4859ab16f9212238f29dd315654aa94f6
Author: Linus Torvalds <[email protected]>
Date: Sun Sep 29 15:02:38 2013 -0700
Linux 3.12-rc3
[Problem]
The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.
There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.
When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.
Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.
[What we are doing]
In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.
In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.
In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)
With the help of SRAT, we have to do the following two things to achieve our
goal:
1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)
2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.
The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.
[Preparation]
Bootloader has to load the kernel image into memory. And this memory must be
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system,
we can assume any node the kernel resides in is not hotpluggable.
Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates
the following memory before SRAT is parsed:
setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */
According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.
[About this patch-set]
So this patch-set is the preparation for the problem 2 that we want to solve. It
does the following:
1. Make memblock be able to allocate memory bottom up.
1) Keep all the memblock APIs' prototype unmodified.
2) When the direction is bottom up, keep the start address greater than the
end of kernel image.
2. Improve init_mem_mapping() to support allocate page tables in bottom up direction.
3. Introduce "movable_node" boot option to enable and disable this functionality.
Change log v5 -> v6:
1. Add tejun and toshi's ack in several patches.
2. Change movablenode to movable_node boot option and update the description
for movable_node and CONFIG_MOVABLE_NODE. Thanks Ingo!
3. Fix the __pa_symbol() issue pointed by Andrew Morton.
4. Update some functions' comments and names.
Change log v4 -> v5:
1. Change memblock.current_direction to a boolean memblock.bottom_up. And remove
the direction enum.
2. Update and add some comments to explain things clearer.
3. Misc fixes, such as removing unnecessary #ifdef
Change log v3 -> v4:
1. Use bottom-up/top-down to unify things. Thanks tj.
2. Factor out of current top-down implementation and then introduce bottom-up mode,
not mixing them in one patch. Thanks tj.
3. Changes function naming: memblock_direction_bottom_up -> memblock_bottom_up
4. Use memblock_set_bottom_up to replace memblock_set_current_direction, which makes
the code simpler. Thanks tj.
5. Define two implementions of function memblock_bottom_up and memblock_set_bottom_up
in order not to use #ifdef in the boot code. Thanks tj.
6. Add comments to explain why retry top-down allocation when bottom_up allocation
failed. Thanks tj and toshi!
Change log v2 -> v3:
1. According to Toshi's suggestion, move the direction checking logic into memblock.
And simply the code more.
Change log v1 -> v2:
1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up()
to allocate memory from bottom upwards, whihc can simplify the code.
Tang Chen (6):
memblock: Factor out of top-down allocation
memblock: Introduce bottom-up allocation mode
x86/mm: Factor out of top-down direct mapping setup
x86/mem-hotplug: Support initialize page tables in bottom-up
x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
parsed.
mem-hotplug: Introduce movable_node boot option
Documentation/kernel-parameters.txt | 3 +
arch/x86/kernel/setup.c | 9 ++-
arch/x86/mm/init.c | 127 ++++++++++++++++++++++++++++------
arch/x86/mm/numa.c | 11 +++
include/linux/memblock.h | 24 +++++++
mm/Kconfig | 17 +++--
mm/memblock.c | 130 +++++++++++++++++++++++++++++++----
mm/memory_hotplug.c | 31 ++++++++
8 files changed, 311 insertions(+), 41 deletions(-)
From: Tang Chen <[email protected]>
This patch creates a new function __memblock_find_range_top_down
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.
Acked-by: Tejun Heo <[email protected]>
Acked-by: Toshi Kani <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
mm/memblock.c | 47 ++++++++++++++++++++++++++++++++++-------------
1 files changed, 34 insertions(+), 13 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..accff10 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
}
/**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_top_down - find free area utility, in top-down
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
* @size: size of free area to find
* @align: alignment of free area to find
* @nid: nid of the free area to find, %MAX_NUMNODES for any node
*
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
*
* RETURNS:
* Found address on success, %0 on failure.
*/
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
- phys_addr_t end, phys_addr_t size,
- phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align, int nid)
{
phys_addr_t this_start, this_end, cand;
u64 i;
- /* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
-
- /* avoid allocating the first page */
- start = max_t(phys_addr_t, start, PAGE_SIZE);
- end = max(start, end);
-
for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
if (cand >= this_start)
return cand;
}
+
return 0;
}
/**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+ phys_addr_t end, phys_addr_t size,
+ phys_addr_t align, int nid)
+{
+ /* pump up @end */
+ if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+ end = memblock.current_limit;
+
+ /* avoid allocating the first page */
+ start = max_t(phys_addr_t, start, PAGE_SIZE);
+ end = max(start, end);
+
+ return __memblock_find_range_top_down(start, end, size, align, nid);
+}
+
+/**
* memblock_find_in_range - find free area in given range
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
--
1.7.1
From: Tang Chen <[email protected]>
The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.
ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.
In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.
So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.
The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 24 +++++++++++++
mm/memblock.c | 87 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 108 insertions(+), 3 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..77c60e5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
};
struct memblock {
+ bool bottom_up; /* is bottom up direction? */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -148,6 +149,29 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)
phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * Set the allocation direction to bottom-up or top-down.
+ */
+static inline void memblock_set_bottom_up(bool enable)
+{
+ memblock.bottom_up = enable;
+}
+
+/*
+ * Check if the allocation direction is bottom-up or not.
+ * if this is true, that said, memblock will allocate memory
+ * in bottom-up direction.
+ */
+static inline bool memblock_bottom_up(void)
+{
+ return memblock.bottom_up;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
/* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
#define MEMBLOCK_ALLOC_ACCESSIBLE 0
diff --git a/mm/memblock.c b/mm/memblock.c
index accff10..04f20f4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <asm-generic/sections.h>
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,
+ .bottom_up = false,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};
@@ -82,6 +85,38 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
return (i < type->cnt) ? i : -1;
}
+/*
+ * __memblock_find_range_bottom_up - find free area utility in bottom-up
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, 0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align, int nid)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+
+ for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+ cand = round_up(this_start, align);
+ if (cand < this_end && this_end - cand >= size)
+ return cand;
+ }
+
+ return 0;
+}
+
/**
* __memblock_find_range_top_down - find free area utility, in top-down
* @start: start of candidate range
@@ -93,7 +128,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
* Utility called from memblock_find_in_range_node(), find free area top-down.
*
* RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
*/
static phys_addr_t __init_memblock
__memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
@@ -127,13 +162,24 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
*
* Find @size free area aligned to @align in the specified range and node.
*
+ * When allocation direction is bottom-up, the @start should be greater
+ * than the end of the kernel image. Otherwise, it will be trimmed. The
+ * reason is that we want the bottom-up allocation just near the kernel
+ * image so it is highly likely that the allocated memory and the kernel
+ * will reside in the same node.
+ *
+ * If bottom-up allocation failed, will try to allocate memory top-down.
+ *
* RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
*/
phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
phys_addr_t align, int nid)
{
+ int ret;
+ phys_addr_t kernel_end;
+
/* pump up @end */
if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
end = memblock.current_limit;
@@ -141,6 +187,41 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
end = max(start, end);
+#ifdef CONFIG_X86
+ kernel_end = __pa_symbol(_end);
+#else
+ kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+
+ /*
+ * try bottom-up allocation only when bottom-up mode
+ * is set and @end is above the kernel image.
+ */
+ if (memblock_bottom_up() && end > kernel_end) {
+ phys_addr_t bottom_up_start;
+
+ /* make sure we will allocate above the kernel */
+ bottom_up_start = max(start, kernel_end);
+
+ /* ok, try bottom-up allocation first */
+ ret = __memblock_find_range_bottom_up(bottom_up_start, end,
+ size, align, nid);
+ if (ret)
+ return ret;
+
+ /*
+ * we always limit bottom-up allocation above the kernel,
+ * but top-down allocation doesn't have the limit, so
+ * retrying top-down allocation may succeed when bottom-up
+ * allocation failed.
+ *
+ * bottom-up allocation is expected to be fail very rarely,
+ * so we use WARN_ONCE() here to see the stack trace if
+ * fail happens.
+ */
+ WARN_ONCE(1, "memblock: bottom-up allocation failed, "
+ "memory hotunplug may be affected\n");
+ }
return __memblock_find_range_top_down(start, end, size, align, nid);
}
@@ -155,7 +236,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
* Find @size free area aligned to @align in the specified range.
*
* RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
*/
phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
--
1.7.1
From: Tang Chen <[email protected]>
This patch creates a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.
Acked-by: Tejun Heo <[email protected]>
Acked-by: Toshi Kani <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/init.c | 60 ++++++++++++++++++++++++++++++++++-----------------
1 files changed, 40 insertions(+), 20 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..ea2be79 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,28 @@ static unsigned long __init init_range_memory_mapping(
/* (PUD_SHIFT-PMD_SHIFT)/2 */
#define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down. That said, the page tables
+ * will be allocated at the end of the memory, and we map the
+ * memory in top-down.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+ unsigned long map_end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
-
/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;
/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -436,13 +437,13 @@ void __init init_mem_mapping(void)
* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
* for page table.
*/
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > map_start) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < map_start)
+ start = map_start;
} else
- start = ISA_END_ADDRESS;
+ start = map_start;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
last_start = start;
@@ -453,8 +454,27 @@ void __init init_mem_mapping(void)
mapped_ram_size += new_mapped_ram_size;
}
- if (real_end < end)
- init_range_memory_mapping(real_end, end);
+ if (real_end < map_end)
+ init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+ unsigned long end;
+
+ probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+ end = max_pfn << PAGE_SHIFT;
+#else
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
+ /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
+ memory_map_top_down(ISA_END_ADDRESS, end);
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
--
1.7.1
From: Tang Chen <[email protected]>
The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.
In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.
ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.
So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 69 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..5cea9ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long map_start,
init_range_memory_mapping(real_end, map_end);
}
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+ unsigned long map_end)
+{
+ unsigned long next, new_mapped_ram_size, start;
+ unsigned long mapped_ram_size = 0;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+
+ start = map_start;
+ min_pfn_mapped = start >> PAGE_SHIFT;
+
+ /*
+ * We start from the bottom (@map_start) and go to the top (@map_end).
+ * The memblock_find_in_range() gets us a block of RAM from the
+ * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+ * for page table.
+ */
+ while (start < map_end) {
+ if (map_end - start > step_size) {
+ next = round_up(start + 1, step_size);
+ if (next > map_end)
+ next = map_end;
+ } else
+ next = map_end;
+
+ new_mapped_ram_size = init_range_memory_mapping(start, next);
+ start = next;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+
void __init init_mem_mapping(void)
{
unsigned long end;
@@ -473,8 +518,30 @@ void __init init_mem_mapping(void)
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
- /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
- memory_map_top_down(ISA_END_ADDRESS, end);
+ /*
+ * If the allocation is in bottom-up direction, we setup direct mapping
+ * in bottom-up, otherwise we setup direct mapping in top-down.
+ */
+ if (memblock_bottom_up()) {
+ unsigned long kernel_end;
+
+#ifdef CONFIG_X86
+ kernel_end = __pa_symbol(_end);
+#else
+ kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+ /*
+ * we need two separate calls here. This is because we want to
+ * allocate page tables above the kernel. So we first map
+ * [kernel_end, end) to make memory above the kernel be mapped
+ * as soon as possible. And then use page tables allocated above
+ * the kernel to map [ISA_END_ADDRESS, kernel_end).
+ */
+ memory_map_bottom_up(kernel_end, end);
+ memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+ } else {
+ memory_map_top_down(ISA_END_ADDRESS, end);
+ }
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
--
1.7.1
From: Tang Chen <[email protected]>
Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.
When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/setup.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..b5e350d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
#endif
- reserve_crashkernel();
-
vsmp_init();
io_delay_init();
@@ -1134,6 +1132,13 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();
initmem_init();
+
+ /*
+ * Reserve memory for crash kernel after SRAT is parsed so that it
+ * won't consume hotpluggable memory.
+ */
+ reserve_crashkernel();
+
memblock_find_dma_reserve();
#ifdef CONFIG_KVM_GUEST
--
1.7.1
From: Tang Chen <[email protected]>
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Suggested-by: Kamezawa Hiroyuki <[email protected]>
Suggested-by: Ingo Molnar <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
Documentation/kernel-parameters.txt | 3 +++
arch/x86/mm/numa.c | 11 +++++++++++
mm/Kconfig | 17 ++++++++++++-----
mm/memory_hotplug.c | 31 +++++++++++++++++++++++++++++++
4 files changed, 57 insertions(+), 5 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.
+ movable_node [KNL,X86] Boot-time switch to disable the effects
+ of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
ret = init_func();
if (ret < 0)
return ret;
+
+ /*
+ * We reset memblock back to the top-down direction
+ * here because if we configured ACPI_NUMA, we have
+ * parsed SRAT in init_func(). It is ok to have the
+ * reset here even if we did't configure ACPI_NUMA
+ * or acpi numa init fails and fallbacks to dummy
+ * numa init.
+ */
+ memblock_set_bottom_up(false);
+
ret = numa_cleanup_meminfo(&numa_meminfo);
if (ret < 0)
return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
help
Allow a node to have only movable memory. Pages used by the kernel,
such as direct mapping pages cannot be migrated. So the corresponding
- memory device cannot be hotplugged. This option allows users to
- online all the memory of a node as movable memory so that the whole
- node can be hotplugged. Users who don't use the memory hotplug
- feature are fine with this option on since they don't online memory
- as movable.
+ memory device cannot be hotplugged. This option allows the following
+ two things:
+ - When the system is booting, node full of hotpluggable memory can
+ be arranged to have only movable memory so that the whole node can
+ be hotplugged. (need movable_node boot option specified).
+ - After the system is up, the option allows users to online all the
+ memory of a node as movable memory so that the whole node can be
+ hotplugged.
+
+ Users who don't use the memory hotplug feature are fine with this
+ option on since they don't specify movable_node boot option or they
+ don't online memory as movable.
Say Y here if you want to hotplug a whole node.
Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..6874c31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
#include <linux/hugetlb.h>
+#include <linux/memblock.h>
#include <asm/tlbflush.h>
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
}
#endif /* CONFIG_MOVABLE_NODE */
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+ /*
+ * Memory used by the kernel cannot be hot-removed because Linux
+ * cannot migrate the kernel pages. When memory hotplug is
+ * enabled, we should prevent memblock from allocating memory
+ * for the kernel.
+ *
+ * ACPI SRAT records all hotpluggable memory ranges. But before
+ * SRAT is parsed, we don't know about it.
+ *
+ * The kernel image is loaded into memory at very early time. We
+ * cannot prevent this anyway. So on NUMA system, we set any
+ * node the kernel resides in as un-hotpluggable.
+ *
+ * Since on modern servers, one node could have double-digit
+ * gigabytes memory, we can assume the memory around the kernel
+ * image is also un-hotpluggable. So before SRAT is parsed, just
+ * allocate memory near the kernel image to try the best to keep
+ * the kernel away from hotpluggable memory.
+ */
+ memblock_set_bottom_up(true);
+#else
+ pr_warn("movable_node option not supported");
+#endif
+ return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
/* check which state of node_states will be changed when offline memory */
static void node_states_check_changes_offline(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
--
1.7.1
Hello tejun
CC: Peter
On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen <[email protected]>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo <[email protected]>
>> Signed-off-by: Tang Chen <[email protected]>
>> Signed-off-by: Zhang Yanfei <[email protected]>
>
> I'm still seriously concerned about this. This unconditionally
> introduces new behavior which may very well break some classes of
> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.
>
After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel,
> so if we allocate memory close to the kernel image,
> it's likely that we don't contaminate hotpluggable node. We're
> talking about few megs at most right after the kernel image. I
> can't see how that would make any noticeable difference.
You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().
So do you still have any objection to the pagetable setup reorder?
--
Thanks.
Zhang Yanfei
Hello guys,
On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
> Hello Peter,
>
> On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
>> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>>
>>>> I would also argue that in the VM scenario -- and arguable even in the
>>>> hardware scenario -- the right thing is to not expose the flexible
>>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>>> *immediately* so) on boot. This avoids both the boot time funnies as
>>>> well as the scaling issues with metadata.
>>>>
>>>
>>> So in this kind of scenario, hotpluggable memory will not be detected
>>> at boot time, and admin should not use this movable_node boot option
>>> and the kernel will act as before, using top-down allocation always.
>>>
>>
>> Yes. The idea is that the kernel will boot up without the hotplug
>> memory, but if desired, will immediately see a hotplug-add event for the
>> movable memory.
>
> Yeah, this is good.
>
> But in the scenario that boot with hotplug memory, we need the movable_node
> option. So as tejun has explained a lot about this patchset, do you still
> have objection to it or could I ask andrew to merge it into -mm tree for
> more tests?
>
Since tejun has explained a lot about this approach, could we come to
an agreement on this one?
Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.
--
Thanks.
Zhang Yanfei