2013-09-24 10:01:56

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v4 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed

The v4 version is based on today's linus tree (3.12-rc2)
HEAD is:
commit 4a10c2ac2f368583138b774ca41fac4207911983
Author: Linus Torvalds <[email protected]>
Date: Mon Sep 23 15:41:09 2013 -0700

Linux 3.12-rc2


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system,
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates
the following memory before SRAT is parsed:

setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set does the following:

1. Make memblock be able to allocate memory bottom up.
1) Keep all the memblock APIs' prototype unmodified.
2) When the direction is bottom up, keep the start address greater than the
end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in bottom up direction.

3. Introduce "movablenode" boot option to enable and disable this functionality.

PS: Reordering of relocate_initrd() has not been done yet. acpi_initrd_override()
needs to access initrd with virtual address. So relocate_initrd() must be done
before acpi_initrd_override().

Change log v3 -> v4:
1. Use bottom-up/top-down to unify things. Thanks tj.
2. Factor out of current top-down implementation and then introduce bottom-up mode,
not mixing them in one patch. Thanks tj.
3. Changes function naming: memblock_direction_bottom_up -> memblock_bottom_up
4. Use memblock_set_bottom_up to replace memblock_set_current_direction, which makes
the code simpler. Thanks tj.
5. Define two implementions of function memblock_bottom_up and memblock_set_bottom_up
in order not to use #ifdef in the boot code. Thanks tj.
6. Add comments to explain why retry top-down allocation when bottom_up allocation
failed. Thanks tj and toshi!

Change log v2 -> v3:
1. According to Toshi's suggestion, move the direction checking logic into memblock.
And simply the code more.

Change log v1 -> v2:
1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up()
to allocate memory from bottom upwards, whihc can simplify the code.


Tang Chen (6):
memblock: Factor out of top-down allocation
memblock: Introduce bottom-up allocation mode
x86/mm: Factor out of top-down direct mapping setup
x86/mem-hotplug: Support initialize page tables bottom up
x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
parsed.
mem-hotplug: Introduce movablenode boot option

Documentation/kernel-parameters.txt | 15 ++++
arch/x86/kernel/setup.c | 16 ++++-
arch/x86/mm/init.c | 125 +++++++++++++++++++++++++++-------
include/linux/memblock.h | 26 +++++++
mm/memblock.c | 106 ++++++++++++++++++++++++++---
mm/memory_hotplug.c | 27 ++++++++
6 files changed, 276 insertions(+), 39 deletions(-)


2013-09-24 10:04:13

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH 1/6] memblock: Factor out of top-down allocation

From: Tang Chen <[email protected]>

This patch imports a new function __memblock_find_range_rev
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.

Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
mm/memblock.c | 47 ++++++++++++++++++++++++++++++++++-------------
1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..3d80c74 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
}

/**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_rev - find free area utility, in reverse order
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
* @size: size of free area to find
* @align: alignment of free area to find
* @nid: nid of the free area to find, %MAX_NUMNODES for any node
*
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
*
* RETURNS:
* Found address on success, %0 on failure.
*/
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
- phys_addr_t end, phys_addr_t size,
- phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align, int nid)
{
phys_addr_t this_start, this_end, cand;
u64 i;

- /* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
-
- /* avoid allocating the first page */
- start = max_t(phys_addr_t, start, PAGE_SIZE);
- end = max(start, end);
-
for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
if (cand >= this_start)
return cand;
}
+
return 0;
}

/**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+ phys_addr_t end, phys_addr_t size,
+ phys_addr_t align, int nid)
+{
+ /* pump up @end */
+ if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+ end = memblock.current_limit;
+
+ /* avoid allocating the first page */
+ start = max_t(phys_addr_t, start, PAGE_SIZE);
+ end = max(start, end);
+
+ return __memblock_find_range_rev(start, end, size, align, nid);
+}
+
+/**
* memblock_find_in_range - find free area in given range
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
--
1.7.1

2013-09-24 10:06:07

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

From: Tang Chen <[email protected]>

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.

Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 26 +++++++++++++++++++
mm/memblock.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..c14bca5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,13 @@

#define INIT_MEMBLOCK_REGIONS 128

+/* Allocation direction */
+enum {
+ MEMBLOCK_DIRECTION_TOP_DOWN,
+ MEMBLOCK_DIRECTION_BOTTOM_UP,
+ NR_MEMLBOCK_DIRECTIONS
+};
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
@@ -35,6 +42,7 @@ struct memblock_type {
};

struct memblock {
+ int current_direction; /* current allocation direction */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -148,6 +156,24 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)

phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);

+#ifdef CONFIG_MOVABLE_NODE
+static inline void memblock_set_bottom_up(bool enable)
+{
+ if (enable)
+ memblock.current_direction = MEMBLOCK_DIRECTION_BOTTOM_UP;
+ else
+ memblock.current_direction = MEMBLOCK_DIRECTION_TOP_DOWN;
+}
+
+static inline bool memblock_bottom_up(void)
+{
+ return memblock.current_direction == MEMBLOCK_DIRECTION_BOTTOM_UP;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
/* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
#define MEMBLOCK_ALLOC_ACCESSIBLE 0
diff --git a/mm/memblock.c b/mm/memblock.c
index 3d80c74..5859b8e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
#include <linux/seq_file.h>
#include <linux/memblock.h>

+#include <asm-generic/sections.h>
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;

@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,

+ .current_direction = MEMBLOCK_DIRECTION_TOP_DOWN,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};

@@ -83,6 +86,38 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
}

/**
+ * __memblock_find_range - find free area utility
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range(phys_addr_t start, phys_addr_t end, phys_addr_t size,
+ phys_addr_t align, int nid)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+
+ for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+ cand = round_up(this_start, align);
+ if (cand < this_end && this_end - cand >= size)
+ return cand;
+ }
+
+ return 0;
+}
+
+/**
* __memblock_find_range_rev - find free area utility, in reverse order
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
@@ -127,6 +162,10 @@ __memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
*
* Find @size free area aligned to @align in the specified range and node.
*
+ * When allocation direction is bottom-up, the @start should be greater
+ * than the end of the kernel image. Otherwise, it will be trimmed. And also,
+ * if bottom-up allocation failed, will try to allocate memory top-down.
+ *
* RETURNS:
* Found address on success, %0 on failure.
*/
@@ -134,6 +173,8 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
phys_addr_t align, int nid)
{
+ int ret;
+
/* pump up @end */
if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
end = memblock.current_limit;
@@ -142,6 +183,28 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
start = max_t(phys_addr_t, start, PAGE_SIZE);
end = max(start, end);

+ if (memblock_bottom_up()) {
+ phys_addr_t bottom_up_start;
+
+ /* make sure we will allocate above the kernel */
+ bottom_up_start = max_t(phys_addr_t, start, __pa_symbol(_end));
+
+ /* ok, try bottom-up allocation first */
+ ret = __memblock_find_range(bottom_up_start, end,
+ size, align, nid);
+ if (ret)
+ return ret;
+
+ /*
+ * we always limit bottom-up allocation above the kernel,
+ * but top-down allocation doesn't have the limit, so
+ * retrying top-down allocation may succeed when bottom-up
+ * allocation failed.
+ */
+ pr_warn("memblock: Failed to allocate memory in bottom up "
+ "direction. Now try top down direction.\n");
+ }
+
return __memblock_find_range_rev(start, end, size, align, nid);
}

--
1.7.1

2013-09-24 10:07:29

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH 3/6] x86/mm: Factor out of top-down direct mapping setup

From: Tang Chen <[email protected]>

This patch introduces a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.

Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/init.c | 70 ++++++++++++++++++++++++++++++++-------------------
1 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..73e79e6 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,27 @@ static unsigned long __init init_range_memory_mapping(

/* (PUD_SHIFT-PMD_SHIFT)/2 */
#define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [map_start, map_end)
+ * in a heuristic way. In the beginning, step_size is small. The more memory we
+ * map memory in the next loop.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+ unsigned long map_end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;

- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
-
/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -430,19 +430,13 @@ void __init init_mem_mapping(void)
min_pfn_mapped = real_end >> PAGE_SHIFT;
last_start = start = real_end;

- /*
- * We start from the top (end of memory) and go to the bottom.
- * The memblock_find_in_range() gets us a block of RAM from the
- * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
- * for page table.
- */
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > map_start) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < map_start)
+ start = map_start;
} else
- start = ISA_END_ADDRESS;
+ start = map_start;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
last_start = start;
@@ -453,8 +447,32 @@ void __init init_mem_mapping(void)
mapped_ram_size += new_mapped_ram_size;
}

- if (real_end < end)
- init_range_memory_mapping(real_end, end);
+ if (real_end < map_end)
+ init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+ unsigned long end;
+
+ probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+ end = max_pfn << PAGE_SHIFT;
+#else
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
+ /*
+ * We start from the top (end of memory) and go to the bottom.
+ * The memblock_find_in_range() gets us a block of RAM from the
+ * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+ * for page table.
+ */
+ memory_map_top_down(ISA_END_ADDRESS, end);

#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
--
1.7.1

2013-09-24 10:09:18

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

From: Tang Chen <[email protected]>

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/init.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 73e79e6..7441865 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -451,6 +451,50 @@ static void __init memory_map_top_down(unsigned long map_start,
init_range_memory_mapping(real_end, map_end);
}

+
+#ifdef CONFIG_MOVABLE_NODE
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [map_start, map_end)
+ * in a heuristic way. In the beginning, step_size is small. The more memory we
+ * map memory in the next loop.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+ unsigned long map_end)
+{
+ unsigned long next, new_mapped_ram_size, start;
+ unsigned long mapped_ram_size = 0;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+ start = map_start;
+
+ while (start < map_end) {
+ if (map_end - start > step_size) {
+ next = round_up(start + 1, step_size);
+ if (next > map_end)
+ next = map_end;
+ } else
+ next = map_end;
+
+ new_mapped_ram_size = init_range_memory_mapping(start, next);
+ min_pfn_mapped = start >> PAGE_SHIFT;
+ start = next;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+#else
+static void __init memory_map_bottom_up(unsigned long map_start,
+ unsigned long map_end)
+{
+}
+#endif /* CONFIG_MOVABLE_NODE */
+
void __init init_mem_mapping(void)
{
unsigned long end;
@@ -467,12 +511,23 @@ void __init init_mem_mapping(void)
init_memory_mapping(0, ISA_END_ADDRESS);

/*
- * We start from the top (end of memory) and go to the bottom.
- * The memblock_find_in_range() gets us a block of RAM from the
- * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
- * for page table.
+ * If the allocation is in bottom-up direction, we start from the
+ * bottom and go to the top: first [kernel_end, end) and then
+ * [ISA_END_ADDRESS, kernel_end). Otherwise, we start from the top
+ * (end of memory) and go to the bottom.
+ *
+ * The memblock_find_in_range() gets us a block of RAM in
+ * [min_pfn_mapped, max_pfn_mapped) used as new pages for page table.
*/
- memory_map_top_down(ISA_END_ADDRESS, end);
+ if (memblock_bottom_up()) {
+ unsigned long kernel_end;
+
+ kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
+ memory_map_bottom_up(kernel_end, end);
+ memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+ } else {
+ memory_map_top_down(ISA_END_ADDRESS, end);
+ }

#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
--
1.7.1

2013-09-24 10:10:38

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

From: Tang Chen <[email protected]>

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/setup.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..36cfce3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
#endif

- reserve_crashkernel();
-
vsmp_init();

io_delay_init();
@@ -1136,6 +1134,12 @@ void __init setup_arch(char **cmdline_p)
initmem_init();
memblock_find_dma_reserve();

+ /*
+ * Reserve memory for crash kernel after SRAT is parsed so that it
+ * won't consume hotpluggable memory.
+ */
+ reserve_crashkernel();
+
#ifdef CONFIG_KVM_GUEST
kvmclock_init();
#endif
--
1.7.1

2013-09-24 10:12:32

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

From: Tang Chen <[email protected]>

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movablenode boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movablenode boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
Documentation/kernel-parameters.txt | 15 +++++++++++++++
arch/x86/kernel/setup.c | 8 ++++++++
mm/memory_hotplug.c | 27 +++++++++++++++++++++++++++
3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1a036cd..8c056c4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.

+ movablenode [KNL,X86] This parameter enables/disables the
+ kernel to arrange hotpluggable memory ranges recorded
+ in ACPI SRAT(System Resource Affinity Table) as
+ ZONE_MOVABLE. And these memory can be hot-removed when
+ the system is up.
+ By specifying this option, all the hotpluggable memory
+ will be in ZONE_MOVABLE, which the kernel cannot use.
+ This will cause NUMA performance down. For users who
+ care about NUMA performance, just don't use it.
+ If all the memory ranges in the system are hotpluggable,
+ then the ones used by the kernel at early time, such as
+ kernel code and data segments, initrd file and so on,
+ won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+ Otherwise the kernel won't have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36cfce3..2cf04fd 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1132,6 +1132,14 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();

initmem_init();
+
+ /*
+ * When ACPI SRAT is parsed, which is done in initmem_init(),
+ * set memblock back to the top-down direction.
+ */
+ if (memblock_bottom_up())
+ memblock_set_bottom_up(false);
+
memblock_find_dma_reserve();

/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..9d36fee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
#include <linux/hugetlb.h>
+#include <linux/memblock.h>

#include <asm/tlbflush.h>

@@ -1386,6 +1387,32 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
{
return true;
}
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+ /*
+ * Memory used by the kernel cannot be hot-removed because Linux
+ * cannot migrate the kernel pages. When memory hotplug is
+ * enabled, we should prevent memblock from allocating memory
+ * for the kernel.
+ *
+ * ACPI SRAT records all hotpluggable memory ranges. But before
+ * SRAT is parsed, we don't know about it.
+ *
+ * The kernel image is loaded into memory at very early time. We
+ * cannot prevent this anyway. So on NUMA system, we set any
+ * node the kernel resides in as un-hotpluggable.
+ *
+ * Since on modern servers, one node could have double-digit
+ * gigabytes memory, we can assume the memory around the kernel
+ * image is also un-hotpluggable. So before SRAT is parsed, just
+ * allocate memory near the kernel image to try the best to keep
+ * the kernel away from hotpluggable memory.
+ */
+ memblock_set_bottom_up(true);
+ return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
#else /* CONFIG_MOVABLE_NODE */
/* ensure the node has NORMAL memory if it is still online */
static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
--
1.7.1

2013-09-24 12:10:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/6] memblock: Factor out of top-down allocation

On Tue, Sep 24, 2013 at 06:03:23PM +0800, Zhang Yanfei wrote:
> From: Tang Chen <[email protected]>
>
> This patch imports a new function __memblock_find_range_rev

"imports" sounds a bit weird. I think "creates" or "this patch
factors out __memblock_find_range_rev() from
memblock_find_in_range_node()" would be better.

> to factor out of top-down allocation from memblock_find_in_range_node.
> This is a preparation because we will introduce a new bottom-up
> allocation mode in the following patch.
>
> Signed-off-by: Tang Chen <[email protected]>
> Signed-off-by: Zhang Yanfei <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-09-24 12:17:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

On Tue, Sep 24, 2013 at 06:05:03PM +0800, Zhang Yanfei wrote:
> +/* Allocation direction */
> +enum {
> + MEMBLOCK_DIRECTION_TOP_DOWN,
> + MEMBLOCK_DIRECTION_BOTTOM_UP,
> + NR_MEMLBOCK_DIRECTIONS
> +};
> +
> struct memblock_region {
> phys_addr_t base;
> phys_addr_t size;
> @@ -35,6 +42,7 @@ struct memblock_type {
> };
>
> struct memblock {
> + int current_direction; /* current allocation direction */

Just use boolean bottom_up here too? No need for the constants.

> @@ -20,6 +20,8 @@
> #include <linux/seq_file.h>
> #include <linux/memblock.h>
>
> +#include <asm-generic/sections.h>
> +

Why is the above added by this patch?

> /**
> + * __memblock_find_range - find free area utility
> + * @start: start of candidate range
> + * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> + * @size: size of free area to find
> + * @align: alignment of free area to find
> + * @nid: nid of the free area to find, %MAX_NUMNODES for any node
> + *
> + * Utility called from memblock_find_in_range_node(), find free area bottom-up.
> + *
> + * RETURNS:
> + * Found address on success, %0 on failure.

I don't think we prefix numeric literals with %.

...
> @@ -127,6 +162,10 @@ __memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
> *
> * Find @size free area aligned to @align in the specified range and node.
> *
> + * When allocation direction is bottom-up, the @start should be greater
> + * than the end of the kernel image. Otherwise, it will be trimmed. And also,
> + * if bottom-up allocation failed, will try to allocate memory top-down.

It'd be nice to explain that bottom-up allocation is limited to above
kernel image and what it's used for here.

> + *
> * RETURNS:
> * Found address on success, %0 on failure.
> */
> @@ -134,6 +173,8 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> phys_addr_t end, phys_addr_t size,
> phys_addr_t align, int nid)
> {
> + int ret;
> +
> /* pump up @end */
> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> end = memblock.current_limit;
> @@ -142,6 +183,28 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> start = max_t(phys_addr_t, start, PAGE_SIZE);
> end = max(start, end);
>
> + if (memblock_bottom_up()) {
> + phys_addr_t bottom_up_start;
> +
> + /* make sure we will allocate above the kernel */
> + bottom_up_start = max_t(phys_addr_t, start, __pa_symbol(_end));
> +
> + /* ok, try bottom-up allocation first */
> + ret = __memblock_find_range(bottom_up_start, end,
> + size, align, nid);
> + if (ret)
> + return ret;
> +
> + /*
> + * we always limit bottom-up allocation above the kernel,
> + * but top-down allocation doesn't have the limit, so
> + * retrying top-down allocation may succeed when bottom-up
> + * allocation failed.
> + */
> + pr_warn("memblock: Failed to allocate memory in bottom up "
> + "direction. Now try top down direction.\n");

Maybe just print warning only on the first failure?

Otherwise, looks good to me.

Thanks.

--
tejun

2013-09-24 12:27:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mm: Factor out of top-down direct mapping setup

On Tue, Sep 24, 2013 at 06:06:41PM +0800, Zhang Yanfei wrote:
> +/**
> + * memory_map_top_down - Map [map_start, map_end) top down
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range [map_start, map_end)
> + * in a heuristic way. In the beginning, step_size is small. The more memory we
> + * map memory in the next loop.
> + */

The comment reads a bit weird to me. The step size is increased
gradually but that really isn't really a heuristic and it doesn't
mention mapping direction.

...
> @@ -430,19 +430,13 @@ void __init init_mem_mapping(void)
> min_pfn_mapped = real_end >> PAGE_SHIFT;
> last_start = start = real_end;
>
> - /*
> - * We start from the top (end of memory) and go to the bottom.
> - * The memblock_find_in_range() gets us a block of RAM from the
> - * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> - * for page table.
> - */

I think this comment should stay here with the variable names
updated.

> - while (last_start > ISA_END_ADDRESS) {
> + while (last_start > map_start) {
> if (last_start > step_size) {
> start = round_down(last_start - 1, step_size);
> - if (start < ISA_END_ADDRESS)
> - start = ISA_END_ADDRESS;
> + if (start < map_start)
> + start = map_start;
> } else
> - start = ISA_END_ADDRESS;
> + start = map_start;
> new_mapped_ram_size = init_range_memory_mapping(start,
> last_start);
> last_start = start;
> @@ -453,8 +447,32 @@ void __init init_mem_mapping(void)
> mapped_ram_size += new_mapped_ram_size;
> }
>
> - if (real_end < end)
> - init_range_memory_mapping(real_end, end);
> + if (real_end < map_end)
> + init_range_memory_mapping(real_end, map_end);
> +}
> +
> +void __init init_mem_mapping(void)
> +{
> + unsigned long end;
> +
> + probe_page_size_mask();
> +
> +#ifdef CONFIG_X86_64
> + end = max_pfn << PAGE_SHIFT;
> +#else
> + end = max_low_pfn << PAGE_SHIFT;
> +#endif
> +
> + /* the ISA range is always mapped regardless of memory holes */
> + init_memory_mapping(0, ISA_END_ADDRESS);
> +
> + /*
> + * We start from the top (end of memory) and go to the bottom.
> + * The memblock_find_in_range() gets us a block of RAM from the
> + * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> + * for page table.
> + */

And just mention the range and direction in the comment here?

> + memory_map_top_down(ISA_END_ADDRESS, end);

Thanks.

--
tejun

2013-09-24 12:33:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

Hello,

On Tue, Sep 24, 2013 at 06:08:27PM +0800, Zhang Yanfei wrote:
> +#ifdef CONFIG_MOVABLE_NODE

You don't need the above ifdef. The compiler will be able to cull the
code as long as memblock_bottom_up() is defined as constant
expression when !MOVABLE_NODE.

> +/**
> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range [map_start, map_end)
> + * in a heuristic way. In the beginning, step_size is small. The more memory we
> + * map memory in the next loop.

The same comment as before. Now we have two function with the
identical comment but behaving differently, which isn't nice.

...
> + * If the allocation is in bottom-up direction, we start from the
> + * bottom and go to the top: first [kernel_end, end) and then
> + * [ISA_END_ADDRESS, kernel_end). Otherwise, we start from the top
> + * (end of memory) and go to the bottom.
> + *
> + * The memblock_find_in_range() gets us a block of RAM in
> + * [min_pfn_mapped, max_pfn_mapped) used as new pages for page table.
> */
> - memory_map_top_down(ISA_END_ADDRESS, end);
> + if (memblock_bottom_up()) {
> + unsigned long kernel_end;
> +
> + kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
> + memory_map_bottom_up(kernel_end, end);
> + memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);

Hmm... so, this is kinda weird. We're doing it in two chunks and
mapping memory between ISA_END_ADDRESS and kernel_end right on top of
ISA_END_ADDRESS? Can't you give enough information to the mapping
function so that it can map everything on top of kernel_end in single
go?

Thanks.

--
tejun

2013-09-24 12:34:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

On Tue, Sep 24, 2013 at 06:09:49PM +0800, Zhang Yanfei wrote:
> From: Tang Chen <[email protected]>
>
> Memory reserved for crashkernel could be large. So we should not allocate
> this memory bottom up from the end of kernel image.
>
> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
> and we can avoid allocating this memory for the kernel. So reorder
> reserve_crashkernel() after SRAT is parsed.
>
> Signed-off-by: Tang Chen <[email protected]>
> Reviewed-by: Zhang Yanfei <[email protected]>

Assuming this was tested to work.

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-09-24 12:41:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

Hello,

On Tue, Sep 24, 2013 at 06:11:42PM +0800, Zhang Yanfei wrote:
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 36cfce3..2cf04fd 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1132,6 +1132,14 @@ void __init setup_arch(char **cmdline_p)
> early_acpi_boot_init();
>
> initmem_init();
> +
> + /*
> + * When ACPI SRAT is parsed, which is done in initmem_init(),
> + * set memblock back to the top-down direction.
> + */
> + if (memblock_bottom_up())
> + memblock_set_bottom_up(false);

I don't think you need the if (). Just call
memblock_set_bottom_up(false).

> +static int __init cmdline_parse_movablenode(char *p)
> +{
> + /*
> + * Memory used by the kernel cannot be hot-removed because Linux
> + * cannot migrate the kernel pages. When memory hotplug is
> + * enabled, we should prevent memblock from allocating memory
> + * for the kernel.
> + *
> + * ACPI SRAT records all hotpluggable memory ranges. But before
> + * SRAT is parsed, we don't know about it.
> + *
> + * The kernel image is loaded into memory at very early time. We
> + * cannot prevent this anyway. So on NUMA system, we set any
> + * node the kernel resides in as un-hotpluggable.
> + *
> + * Since on modern servers, one node could have double-digit
> + * gigabytes memory, we can assume the memory around the kernel
> + * image is also un-hotpluggable. So before SRAT is parsed, just
> + * allocate memory near the kernel image to try the best to keep
> + * the kernel away from hotpluggable memory.
> + */
> + memblock_set_bottom_up(true);
> + return 0;
> +}
> +early_param("movablenode", cmdline_parse_movablenode);

This came up during earlier review but never was addressed. Is
"movablenode" the right name? Shouldn't it be something which
explicitly shows that it's to prepare for memory hotplug? Also, maybe
the above param should generate warning if CONFIG_MOVABLE_NODE isn't
enabled?

Thanks.

--
tejun

2013-09-24 13:05:39

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 1/6] memblock: Factor out of top-down allocation

Hello tejun,

On 09/24/2013 08:10 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 06:03:23PM +0800, Zhang Yanfei wrote:
>> From: Tang Chen <[email protected]>
>>
>> This patch imports a new function __memblock_find_range_rev
>
> "imports" sounds a bit weird. I think "creates" or "this patch
> factors out __memblock_find_range_rev() from
> memblock_find_in_range_node()" would be better.

Okay, will update it in next version. Thanks.

>
>> to factor out of top-down allocation from memblock_find_in_range_node.
>> This is a preparation because we will introduce a new bottom-up
>> allocation mode in the following patch.
>>
>> Signed-off-by: Tang Chen <[email protected]>
>> Signed-off-by: Zhang Yanfei <[email protected]>
>
> Acked-by: Tejun Heo <[email protected]>

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:17:59

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

Hello tejun,

On 09/24/2013 08:17 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 06:05:03PM +0800, Zhang Yanfei wrote:
>> +/* Allocation direction */
>> +enum {
>> + MEMBLOCK_DIRECTION_TOP_DOWN,
>> + MEMBLOCK_DIRECTION_BOTTOM_UP,
>> + NR_MEMLBOCK_DIRECTIONS
>> +};
>> +
>> struct memblock_region {
>> phys_addr_t base;
>> phys_addr_t size;
>> @@ -35,6 +42,7 @@ struct memblock_type {
>> };
>>
>> struct memblock {
>> + int current_direction; /* current allocation direction */
>
> Just use boolean bottom_up here too? No need for the constants.

OKay. Will try this way.

>
>> @@ -20,6 +20,8 @@
>> #include <linux/seq_file.h>
>> #include <linux/memblock.h>
>>
>> +#include <asm-generic/sections.h>
>> +
>
> Why is the above added by this patch?

Oh, we use _end in this file which is defined in that header file.

>
>> /**
>> + * __memblock_find_range - find free area utility
>> + * @start: start of candidate range
>> + * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
>> + * @size: size of free area to find
>> + * @align: alignment of free area to find
>> + * @nid: nid of the free area to find, %MAX_NUMNODES for any node
>> + *
>> + * Utility called from memblock_find_in_range_node(), find free area bottom-up.
>> + *
>> + * RETURNS:
>> + * Found address on success, %0 on failure.
>
> I don't think we prefix numeric literals with %.

OKay. Will remove %

>
> ...
>> @@ -127,6 +162,10 @@ __memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
>> *
>> * Find @size free area aligned to @align in the specified range and node.
>> *
>> + * When allocation direction is bottom-up, the @start should be greater
>> + * than the end of the kernel image. Otherwise, it will be trimmed. And also,
>> + * if bottom-up allocation failed, will try to allocate memory top-down.
>
> It'd be nice to explain that bottom-up allocation is limited to above
> kernel image and what it's used for here.

OK. Will add the comment.

>
>> + *
>> * RETURNS:
>> * Found address on success, %0 on failure.
>> */
>> @@ -134,6 +173,8 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> phys_addr_t end, phys_addr_t size,
>> phys_addr_t align, int nid)
>> {
>> + int ret;
>> +
>> /* pump up @end */
>> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> end = memblock.current_limit;
>> @@ -142,6 +183,28 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> start = max_t(phys_addr_t, start, PAGE_SIZE);
>> end = max(start, end);
>>
>> + if (memblock_bottom_up()) {
>> + phys_addr_t bottom_up_start;
>> +
>> + /* make sure we will allocate above the kernel */
>> + bottom_up_start = max_t(phys_addr_t, start, __pa_symbol(_end));
>> +
>> + /* ok, try bottom-up allocation first */
>> + ret = __memblock_find_range(bottom_up_start, end,
>> + size, align, nid);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * we always limit bottom-up allocation above the kernel,
>> + * but top-down allocation doesn't have the limit, so
>> + * retrying top-down allocation may succeed when bottom-up
>> + * allocation failed.
>> + */
>> + pr_warn("memblock: Failed to allocate memory in bottom up "
>> + "direction. Now try top down direction.\n");
>
> Maybe just print warning only on the first failure?

Hmmm... This message is for each memblock allocation, that said, if the
allocation this time fails, it prints the message and we use so called top-down.
But next time, we still use bottom up first again.

Did you mean if we fail on one bottom-up allocation, then we never try
bottom-up again and will always use top-down?

Thanks.

>
> Otherwise, looks good to me.
>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:21:14

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mm: Factor out of top-down direct mapping setup

Hello tejun,

On 09/24/2013 08:27 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 06:06:41PM +0800, Zhang Yanfei wrote:
>> +/**
>> + * memory_map_top_down - Map [map_start, map_end) top down
>> + * @map_start: start address of the target memory range
>> + * @map_end: end address of the target memory range
>> + *
>> + * This function will setup direct mapping for memory range [map_start, map_end)
>> + * in a heuristic way. In the beginning, step_size is small. The more memory we
>> + * map memory in the next loop.
>> + */
>
> The comment reads a bit weird to me. The step size is increased
> gradually but that really isn't really a heuristic and it doesn't
> mention mapping direction.

Ok, will change the words and add the comment of direction.

>
> ...
>> @@ -430,19 +430,13 @@ void __init init_mem_mapping(void)
>> min_pfn_mapped = real_end >> PAGE_SHIFT;
>> last_start = start = real_end;
>>
>> - /*
>> - * We start from the top (end of memory) and go to the bottom.
>> - * The memblock_find_in_range() gets us a block of RAM from the
>> - * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
>> - * for page table.
>> - */
>
> I think this comment should stay here with the variable names
> updated.

OK, agreed

>
>> - while (last_start > ISA_END_ADDRESS) {
>> + while (last_start > map_start) {
>> if (last_start > step_size) {
>> start = round_down(last_start - 1, step_size);
>> - if (start < ISA_END_ADDRESS)
>> - start = ISA_END_ADDRESS;
>> + if (start < map_start)
>> + start = map_start;
>> } else
>> - start = ISA_END_ADDRESS;
>> + start = map_start;
>> new_mapped_ram_size = init_range_memory_mapping(start,
>> last_start);
>> last_start = start;
>> @@ -453,8 +447,32 @@ void __init init_mem_mapping(void)
>> mapped_ram_size += new_mapped_ram_size;
>> }
>>
>> - if (real_end < end)
>> - init_range_memory_mapping(real_end, end);
>> + if (real_end < map_end)
>> + init_range_memory_mapping(real_end, map_end);
>> +}
>> +
>> +void __init init_mem_mapping(void)
>> +{
>> + unsigned long end;
>> +
>> + probe_page_size_mask();
>> +
>> +#ifdef CONFIG_X86_64
>> + end = max_pfn << PAGE_SHIFT;
>> +#else
>> + end = max_low_pfn << PAGE_SHIFT;
>> +#endif
>> +
>> + /* the ISA range is always mapped regardless of memory holes */
>> + init_memory_mapping(0, ISA_END_ADDRESS);
>> +
>> + /*
>> + * We start from the top (end of memory) and go to the bottom.
>> + * The memblock_find_in_range() gets us a block of RAM from the
>> + * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
>> + * for page table.
>> + */
>
> And just mention the range and direction in the comment here?

OK, agreed.

Thanks

>
>> + memory_map_top_down(ISA_END_ADDRESS, end);
>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:23:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

On Tue, Sep 24, 2013 at 09:17:16PM +0800, Zhang Yanfei wrote:
> > Maybe just print warning only on the first failure?
>
> Hmmm... This message is for each memblock allocation, that said, if the
> allocation this time fails, it prints the message and we use so called top-down.
> But next time, we still use bottom up first again.
>
> Did you mean if we fail on one bottom-up allocation, then we never try
> bottom-up again and will always use top-down?

Nope, it's just that it might end up printing something for each alloc
which can end up flooding console / log. The first failure is the
most interesting and pretty much defeats the purpose of the whole
thing after all. If it's expected to fail very rarely, I'd just stick
in WARN_ONCE() there as the stack trace would be interesting too.

Thanks.

--
tejun

2013-09-24 13:24:17

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

On 09/24/2013 08:33 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 24, 2013 at 06:08:27PM +0800, Zhang Yanfei wrote:
>> +#ifdef CONFIG_MOVABLE_NODE
>
> You don't need the above ifdef. The compiler will be able to cull the
> code as long as memblock_bottom_up() is defined as constant
> expression when !MOVABLE_NODE.

yeah, will remove the #if. thanks.

>
>> +/**
>> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
>> + * @map_start: start address of the target memory range
>> + * @map_end: end address of the target memory range
>> + *
>> + * This function will setup direct mapping for memory range [map_start, map_end)
>> + * in a heuristic way. In the beginning, step_size is small. The more memory we
>> + * map memory in the next loop.
>
> The same comment as before. Now we have two function with the
> identical comment but behaving differently, which isn't nice.

OK, will change them.

>
> ...
>> + * If the allocation is in bottom-up direction, we start from the
>> + * bottom and go to the top: first [kernel_end, end) and then
>> + * [ISA_END_ADDRESS, kernel_end). Otherwise, we start from the top
>> + * (end of memory) and go to the bottom.
>> + *
>> + * The memblock_find_in_range() gets us a block of RAM in
>> + * [min_pfn_mapped, max_pfn_mapped) used as new pages for page table.
>> */
>> - memory_map_top_down(ISA_END_ADDRESS, end);
>> + if (memblock_bottom_up()) {
>> + unsigned long kernel_end;
>> +
>> + kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
>> + memory_map_bottom_up(kernel_end, end);
>> + memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
>
> Hmm... so, this is kinda weird. We're doing it in two chunks and
> mapping memory between ISA_END_ADDRESS and kernel_end right on top of
> ISA_END_ADDRESS? Can't you give enough information to the mapping
> function so that it can map everything on top of kernel_end in single
> go?

You mean we should call memory_map_bottom_up like this:

memory_map_bottom_up(ISA_END_ADDRESS, end)

right?

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:24:29

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

On 09/24/2013 08:34 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 06:09:49PM +0800, Zhang Yanfei wrote:
>> From: Tang Chen <[email protected]>
>>
>> Memory reserved for crashkernel could be large. So we should not allocate
>> this memory bottom up from the end of kernel image.
>>
>> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
>> and we can avoid allocating this memory for the kernel. So reorder
>> reserve_crashkernel() after SRAT is parsed.
>>
>> Signed-off-by: Tang Chen <[email protected]>
>> Reviewed-by: Zhang Yanfei <[email protected]>
>
> Assuming this was tested to work.
>
> Acked-by: Tejun Heo <[email protected]>

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:27:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

On Tue, Sep 24, 2013 at 09:23:48PM +0800, Zhang Yanfei wrote:
> > Hmm... so, this is kinda weird. We're doing it in two chunks and
> > mapping memory between ISA_END_ADDRESS and kernel_end right on top of
> > ISA_END_ADDRESS? Can't you give enough information to the mapping
> > function so that it can map everything on top of kernel_end in single
> > go?
>
> You mean we should call memory_map_bottom_up like this:
>
> memory_map_bottom_up(ISA_END_ADDRESS, end)
>
> right?

But that wouldn't be ideal as we want the page tables above kernel
image and the above would allocate it above ISA_END_ADDRESS, right?
Maybe memory_map_bottom_up() should take extra parameters for where to
allocate page tables at separately from the mapping range and treat it
specially? Would that make the function a lot more complex?

Thanks.

--
tejun

2013-09-24 13:32:21

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

On 09/24/2013 08:41 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 24, 2013 at 06:11:42PM +0800, Zhang Yanfei wrote:
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 36cfce3..2cf04fd 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -1132,6 +1132,14 @@ void __init setup_arch(char **cmdline_p)
>> early_acpi_boot_init();
>>
>> initmem_init();
>> +
>> + /*
>> + * When ACPI SRAT is parsed, which is done in initmem_init(),
>> + * set memblock back to the top-down direction.
>> + */
>> + if (memblock_bottom_up())
>> + memblock_set_bottom_up(false);
>
> I don't think you need the if (). Just call
> memblock_set_bottom_up(false).

OK, will remove it.

>
>> +static int __init cmdline_parse_movablenode(char *p)
>> +{
>> + /*
>> + * Memory used by the kernel cannot be hot-removed because Linux
>> + * cannot migrate the kernel pages. When memory hotplug is
>> + * enabled, we should prevent memblock from allocating memory
>> + * for the kernel.
>> + *
>> + * ACPI SRAT records all hotpluggable memory ranges. But before
>> + * SRAT is parsed, we don't know about it.
>> + *
>> + * The kernel image is loaded into memory at very early time. We
>> + * cannot prevent this anyway. So on NUMA system, we set any
>> + * node the kernel resides in as un-hotpluggable.
>> + *
>> + * Since on modern servers, one node could have double-digit
>> + * gigabytes memory, we can assume the memory around the kernel
>> + * image is also un-hotpluggable. So before SRAT is parsed, just
>> + * allocate memory near the kernel image to try the best to keep
>> + * the kernel away from hotpluggable memory.
>> + */
>> + memblock_set_bottom_up(true);
>> + return 0;
>> +}
>> +early_param("movablenode", cmdline_parse_movablenode);
>
> This came up during earlier review but never was addressed. Is
> "movablenode" the right name? Shouldn't it be something which
> explicitly shows that it's to prepare for memory hotplug? Also, maybe
> the above param should generate warning if CONFIG_MOVABLE_NODE isn't
> enabled?

hmmm...as for the option name, if this option is set, it means, the kernel
could support the functionality that a whole node is the so called
movable node, which only has ZONE MOVABLE zone in it. So we choose
to name the parameter "movablenode".

As for the warning, will add it.

Thanks

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:35:20

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

On 09/24/2013 09:27 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 09:23:48PM +0800, Zhang Yanfei wrote:
>>> Hmm... so, this is kinda weird. We're doing it in two chunks and
>>> mapping memory between ISA_END_ADDRESS and kernel_end right on top of
>>> ISA_END_ADDRESS? Can't you give enough information to the mapping
>>> function so that it can map everything on top of kernel_end in single
>>> go?
>>
>> You mean we should call memory_map_bottom_up like this:
>>
>> memory_map_bottom_up(ISA_END_ADDRESS, end)
>>
>> right?
>
> But that wouldn't be ideal as we want the page tables above kernel
> image and the above would allocate it above ISA_END_ADDRESS, right?

The original idea is we will allocate everything above the kernel. So
the pagetables for [ISA_END_ADDRESS, kernel_end) will be also located
above the kernel.

> Maybe memory_map_bottom_up() should take extra parameters for where to
> allocate page tables at separately from the mapping range and treat it
> specially? Would that make the function a lot more complex?

Hmmmm...I will try to see if it is complex.

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 13:39:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

Hello,

On Tue, Sep 24, 2013 at 09:34:46PM +0800, Zhang Yanfei wrote:
> > But that wouldn't be ideal as we want the page tables above kernel
> > image and the above would allocate it above ISA_END_ADDRESS, right?
>
> The original idea is we will allocate everything above the kernel. So
> the pagetables for [ISA_END_ADDRESS, kernel_end) will be also located
> above the kernel.

I'm a bit confused why we need two separate calls then. What's the
difference from calling with the whole range?

Thanks.

--
tejun

2013-09-24 13:54:24

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

Hi tejun,

On 09/24/2013 09:39 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 24, 2013 at 09:34:46PM +0800, Zhang Yanfei wrote:
>>> But that wouldn't be ideal as we want the page tables above kernel
>>> image and the above would allocate it above ISA_END_ADDRESS, right?
>>
>> The original idea is we will allocate everything above the kernel. So
>> the pagetables for [ISA_END_ADDRESS, kernel_end) will be also located
>> above the kernel.
>
> I'm a bit confused why we need two separate calls then. What's the
> difference from calling with the whole range?

OK, this is just because we allocate pagtables just above the kernel.
And if we use up the BRK space that is reserved for inital pagetables,
we will use memblock to allocate memory for pagetables, and the memory
allocated here should be mapped already. So we first map [kernel_end, end)
to make memory above the kernel be mapped as soon as possible. And then
use pagetables allocated above the kernel to map [ISA_END_ADDRESS, kernel_end).

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 14:05:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

On Tue, Sep 24, 2013 at 09:53:47PM +0800, Zhang Yanfei wrote:
> OK, this is just because we allocate pagtables just above the kernel.
> And if we use up the BRK space that is reserved for inital pagetables,
> we will use memblock to allocate memory for pagetables, and the memory
> allocated here should be mapped already. So we first map [kernel_end, end)
> to make memory above the kernel be mapped as soon as possible. And then
> use pagetables allocated above the kernel to map [ISA_END_ADDRESS, kernel_end).

I see. The code seems fine to me then. Can you please add comment
explaining why the split calls are necessary?

Thanks.

--
tejun

2013-09-24 14:07:58

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mem-hotplug: Support initialize page tables bottom up

On 09/24/2013 10:05 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 09:53:47PM +0800, Zhang Yanfei wrote:
>> OK, this is just because we allocate pagtables just above the kernel.
>> And if we use up the BRK space that is reserved for inital pagetables,
>> we will use memblock to allocate memory for pagetables, and the memory
>> allocated here should be mapped already. So we first map [kernel_end, end)
>> to make memory above the kernel be mapped as soon as possible. And then
>> use pagetables allocated above the kernel to map [ISA_END_ADDRESS, kernel_end).
>
> I see. The code seems fine to me then. Can you please add comment
> explaining why the split calls are necessary?

Okay, will add the comment.

Thanks.

--
Thanks.
Zhang Yanfei

2013-09-24 14:12:47

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

On 09/24/2013 09:23 PM, Tejun Heo wrote:
> On Tue, Sep 24, 2013 at 09:17:16PM +0800, Zhang Yanfei wrote:
>>> Maybe just print warning only on the first failure?
>>
>> Hmmm... This message is for each memblock allocation, that said, if the
>> allocation this time fails, it prints the message and we use so called top-down.
>> But next time, we still use bottom up first again.
>>
>> Did you mean if we fail on one bottom-up allocation, then we never try
>> bottom-up again and will always use top-down?
>
> Nope, it's just that it might end up printing something for each alloc
> which can end up flooding console / log. The first failure is the
> most interesting and pretty much defeats the purpose of the whole
> thing after all. If it's expected to fail very rarely, I'd just stick
> in WARN_ONCE() there as the stack trace would be interesting too.
>

I see. I think it is rarely to fail. But here is case that it must
fail in the current bottom-up implementation. For example, we allocate
memory in reserve_real_mode() by calling this:
memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);

Both the start and end is below the kernel, so trying bottom-up for
this must fail. So I am now thinking that if we should take this as
the special case for bottom-up. That said, if we limit start and end
both below the kernel, we should allocate memory below the kernel instead
of make it fail. The cases are also rare, in early boot time, only
these two:

|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */

How do you think?

--
Thanks.
Zhang Yanfei

2013-09-24 14:16:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

Hello,

On Tue, Sep 24, 2013 at 10:12:22PM +0800, Zhang Yanfei wrote:
> I see. I think it is rarely to fail. But here is case that it must
> fail in the current bottom-up implementation. For example, we allocate
> memory in reserve_real_mode() by calling this:
> memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
>
> Both the start and end is below the kernel, so trying bottom-up for
> this must fail. So I am now thinking that if we should take this as
> the special case for bottom-up. That said, if we limit start and end
> both below the kernel, we should allocate memory below the kernel instead
> of make it fail. The cases are also rare, in early boot time, only
> these two:
>
> |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
> |->reserve_real_mode() /* allocate memory under 1MB */
>
> How do you think?

They need to be special cased regardless, right? It's wrong to print
out warning messages for things which are expected to behave that way.
Just skip bottom-up allocs if @end is under kernel image?

Thanks.

--
tejun

2013-09-24 14:19:41

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 2/6] memblock: Introduce bottom-up allocation mode

On 09/24/2013 10:16 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 24, 2013 at 10:12:22PM +0800, Zhang Yanfei wrote:
>> I see. I think it is rarely to fail. But here is case that it must
>> fail in the current bottom-up implementation. For example, we allocate
>> memory in reserve_real_mode() by calling this:
>> memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
>>
>> Both the start and end is below the kernel, so trying bottom-up for
>> this must fail. So I am now thinking that if we should take this as
>> the special case for bottom-up. That said, if we limit start and end
>> both below the kernel, we should allocate memory below the kernel instead
>> of make it fail. The cases are also rare, in early boot time, only
>> these two:
>>
>> |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
>> |->reserve_real_mode() /* allocate memory under 1MB */
>>
>> How do you think?
>
> They need to be special cased regardless, right? It's wrong to print
> out warning messages for things which are expected to behave that way.
> Just skip bottom-up allocs if @end is under kernel image?
>

Good idea. Will do this way.

--
Thanks.
Zhang Yanfei

2013-09-24 15:25:26

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

Hello tejun,

On 09/24/2013 09:31 PM, Zhang Yanfei wrote:
>> This came up during earlier review but never was addressed. Is
>> > "movablenode" the right name? Shouldn't it be something which
>> > explicitly shows that it's to prepare for memory hotplug? Also, maybe
>> > the above param should generate warning if CONFIG_MOVABLE_NODE isn't
>> > enabled?
> hmmm...as for the option name, if this option is set, it means, the kernel
> could support the functionality that a whole node is the so called
> movable node, which only has ZONE MOVABLE zone in it. So we choose
> to name the parameter "movablenode".
>
> As for the warning, will add it.

I am now preparing the v5 version. Only in this patch we haven't come to an
agreement. So as for the boot option name, after my explanation, do you still
have the objection? Or you could suggest a good name for us, that'll be
very thankful:)

Thanks.

--
Thanks.
Zhang Yanfei

2013-09-24 15:33:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

Hello,

On Tue, Sep 24, 2013 at 11:24:48PM +0800, Zhang Yanfei wrote:
> I am now preparing the v5 version. Only in this patch we haven't come to an
> agreement. So as for the boot option name, after my explanation, do you still
> have the objection? Or you could suggest a good name for us, that'll be
> very thankful:)

No particular idea and my concern about the name isn't very strong.
It just doesn't seem like a good name to me. It's a bit obscure and
we may want to do more things for hotplug later which may not fit the
param name. If you guys think it's good enough, please go ahead.

Thanks.

--
tejun

2013-09-24 15:44:10

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

On 09/24/2013 11:32 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 24, 2013 at 11:24:48PM +0800, Zhang Yanfei wrote:
>> I am now preparing the v5 version. Only in this patch we haven't come to an
>> agreement. So as for the boot option name, after my explanation, do you still
>> have the objection? Or you could suggest a good name for us, that'll be
>> very thankful:)
>
> No particular idea and my concern about the name isn't very strong.
> It just doesn't seem like a good name to me. It's a bit obscure and
> we may want to do more things for hotplug later which may not fit the
> param name. If you guys think it's good enough, please go ahead.
>

Thanks very much!

I have addressed your comments and finish the v5 version. Please help
checking them again after I send them to the community. Thanks.

--
Thanks.
Zhang Yanfei

2013-09-24 16:02:16

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

On Tue, 2013-09-24 at 23:24 +0800, Zhang Yanfei wrote:
> Hello tejun,
>
> On 09/24/2013 09:31 PM, Zhang Yanfei wrote:
> >> This came up during earlier review but never was addressed. Is
> >> > "movablenode" the right name? Shouldn't it be something which
> >> > explicitly shows that it's to prepare for memory hotplug? Also, maybe
> >> > the above param should generate warning if CONFIG_MOVABLE_NODE isn't
> >> > enabled?
> > hmmm...as for the option name, if this option is set, it means, the kernel
> > could support the functionality that a whole node is the so called
> > movable node, which only has ZONE MOVABLE zone in it. So we choose
> > to name the parameter "movablenode".
> >
> > As for the warning, will add it.
>
> I am now preparing the v5 version. Only in this patch we haven't come to an
> agreement. So as for the boot option name, after my explanation, do you still
> have the objection? Or you could suggest a good name for us, that'll be
> very thankful:)

I do not think the granularity has to stay as a node, and this option
does nothing to with other devices that may be included in a node. So,
how about using "movablemem"?

Thanks,
-Toshi

2013-09-24 16:09:01

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

Hello toshi-san

On 09/25/2013 12:00 AM, Toshi Kani wrote:
> On Tue, 2013-09-24 at 23:24 +0800, Zhang Yanfei wrote:
>> Hello tejun,
>>
>> On 09/24/2013 09:31 PM, Zhang Yanfei wrote:
>>>> This came up during earlier review but never was addressed. Is
>>>>> "movablenode" the right name? Shouldn't it be something which
>>>>> explicitly shows that it's to prepare for memory hotplug? Also, maybe
>>>>> the above param should generate warning if CONFIG_MOVABLE_NODE isn't
>>>>> enabled?
>>> hmmm...as for the option name, if this option is set, it means, the kernel
>>> could support the functionality that a whole node is the so called
>>> movable node, which only has ZONE MOVABLE zone in it. So we choose
>>> to name the parameter "movablenode".
>>>
>>> As for the warning, will add it.
>>
>> I am now preparing the v5 version. Only in this patch we haven't come to an
>> agreement. So as for the boot option name, after my explanation, do you still
>> have the objection? Or you could suggest a good name for us, that'll be
>> very thankful:)
>
> I do not think the granularity has to stay as a node, and this option
> does nothing to with other devices that may be included in a node. So,
> how about using "movablemem"?
>

As I explained before, we use movablenode to mean a node could only have
a MOVABLE zone from the memory aspect. So I still think movablenode seems
better than movablemem. movablemem seems vaguer here....

--
Thanks.
Zhang Yanfei

2013-09-24 16:36:26

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 6/6] mem-hotplug: Introduce movablenode boot option

On Wed, 2013-09-25 at 00:08 +0800, Zhang Yanfei wrote:
> Hello toshi-san
>
> On 09/25/2013 12:00 AM, Toshi Kani wrote:
> > On Tue, 2013-09-24 at 23:24 +0800, Zhang Yanfei wrote:
> >> Hello tejun,
> >>
> >> On 09/24/2013 09:31 PM, Zhang Yanfei wrote:
> >>>> This came up during earlier review but never was addressed. Is
> >>>>> "movablenode" the right name? Shouldn't it be something which
> >>>>> explicitly shows that it's to prepare for memory hotplug? Also, maybe
> >>>>> the above param should generate warning if CONFIG_MOVABLE_NODE isn't
> >>>>> enabled?
> >>> hmmm...as for the option name, if this option is set, it means, the kernel
> >>> could support the functionality that a whole node is the so called
> >>> movable node, which only has ZONE MOVABLE zone in it. So we choose
> >>> to name the parameter "movablenode".
> >>>
> >>> As for the warning, will add it.
> >>
> >> I am now preparing the v5 version. Only in this patch we haven't come to an
> >> agreement. So as for the boot option name, after my explanation, do you still
> >> have the objection? Or you could suggest a good name for us, that'll be
> >> very thankful:)
> >
> > I do not think the granularity has to stay as a node, and this option
> > does nothing to with other devices that may be included in a node. So,
> > how about using "movablemem"?
> >
>
> As I explained before, we use movablenode to mean a node could only have
> a MOVABLE zone from the memory aspect. So I still think movablenode seems
> better than movablemem. movablemem seems vaguer here....

But a node may contain other devices, such CPUs and PCI bridges. To me,
movablenode does not clarify that this option is from the memory
aspect...

Thanks,
-Toshi