2013-09-13 09:28:46

by Tang Chen

[permalink] [raw]
Subject: [PATCH v3 0/5] x86, memblock: Allocate memory near kernel image before SRAT parsed.

This patch-set is based on tj's suggestion, and not fully tested.
Just for review and discussion.

This patch-set is based on the latest kernel (3.11)
HEAD is:
commit d5d04bb48f0eb89c14e76779bb46212494de0bec
Author: Linus Torvalds <[email protected]>
Date: Wed Sep 11 19:55:12 2013 -0700


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system,
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates
the following memory before SRAT is parsed:

setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set does the following:

1. Make memblock be able to allocate memory from low address to high address.
1) Keep all the memblock APIs' prototype unmodified.
2) When the direction is bottom up, keep the start address greater than the
end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in bottom up direction.

3. Introduce "movablenode" boot option to enable and disable this functionality.

PS: Reordering of relocate_initrd() has not been done yet. acpi_initrd_override()
needs to access initrd with virtual address. So relocate_initrd() must be done
before acpi_initrd_override().


Change log v2 -> v3:
1. According to Toshi's suggestion, move the direction checking logic into memblock.
And simply the code more.

Change log v1 -> v2:
1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up()
to allocate memory from bottom upwards, whihc can simplify the code.


Tang Chen (5):
memblock: Introduce allocation direction to memblock.
memblock: Improve memblock to support allocation from lower address.
x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
parsed.
x86, mem-hotplug: Support initialize page tables from low to high.
mem-hotplug: Introduce movablenode boot option to control memblock
allocation direction.

Documentation/kernel-parameters.txt | 15 ++++
arch/x86/kernel/setup.c | 44 ++++++++++++-
arch/x86/mm/init.c | 121 ++++++++++++++++++++++++++--------
include/linux/memblock.h | 22 ++++++
include/linux/memory_hotplug.h | 5 ++
mm/memblock.c | 120 +++++++++++++++++++++++++++++++----
mm/memory_hotplug.c | 9 +++
7 files changed, 293 insertions(+), 43 deletions(-)


2013-09-13 09:28:52

by Tang Chen

[permalink] [raw]
Subject: [PATCH v3 5/5] mem-hotplug: Introduce movablenode boot option to control memblock allocation direction.

The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

After memblock is ready, before SRAT is parsed, we should allocate memory
near the kernel image. So this patch does the following:

1. After memblock is ready, make memblock allocate memory from low address
to high.
2. After SRAT is parsed, make memblock behave as default, allocate memory
from high address to low.

This behavior is controlled by movablenode boot option.

Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
Documentation/kernel-parameters.txt | 15 ++++++++++++++
arch/x86/kernel/setup.c | 36 +++++++++++++++++++++++++++++++++++
include/linux/memory_hotplug.h | 5 ++++
mm/memory_hotplug.c | 9 ++++++++
4 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1a036cd..8c056c4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.

+ movablenode [KNL,X86] This parameter enables/disables the
+ kernel to arrange hotpluggable memory ranges recorded
+ in ACPI SRAT(System Resource Affinity Table) as
+ ZONE_MOVABLE. And these memory can be hot-removed when
+ the system is up.
+ By specifying this option, all the hotpluggable memory
+ will be in ZONE_MOVABLE, which the kernel cannot use.
+ This will cause NUMA performance down. For users who
+ care about NUMA performance, just don't use it.
+ If all the memory ranges in the system are hotpluggable,
+ then the ones used by the kernel at early time, such as
+ kernel code and data segments, initrd file and so on,
+ won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+ Otherwise the kernel won't have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36cfce3..617af9a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1094,6 +1094,31 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();

+#ifdef CONFIG_MOVABLE_NODE
+ if (movablenode_enable_srat) {
+ /*
+ * Memory used by the kernel cannot be hot-removed because Linux
+ * cannot migrate the kernel pages. When memory hotplug is
+ * enabled, we should prevent memblock from allocating memory
+ * for the kernel.
+ *
+ * ACPI SRAT records all hotpluggable memory ranges. But before
+ * SRAT is parsed, we don't know about it.
+ *
+ * The kernel image is loaded into memory at very early time. We
+ * cannot prevent this anyway. So on NUMA system, we set any
+ * node the kernel resides in as un-hotpluggable.
+ *
+ * Since on modern servers, one node could have double-digit
+ * gigabytes memory, we can assume the memory around the kernel
+ * image is also un-hotpluggable. So before SRAT is parsed, just
+ * allocate memory near the kernel image to try the best to keep
+ * the kernel away from hotpluggable memory.
+ */
+ memblock_set_current_direction(MEMBLOCK_DIRECTION_LOW_TO_HIGH);
+ }
+#endif /* CONFIG_MOVABLE_NODE */
+
init_mem_mapping();

early_trap_pf_init();
@@ -1132,6 +1157,17 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();

initmem_init();
+
+#ifdef CONFIG_MOVABLE_NODE
+ if (movablenode_enable_srat) {
+ /*
+ * When ACPI SRAT is parsed, which is done in initmem_init(),
+ * set memblock back to the default behavior.
+ */
+ memblock_set_current_direction(MEMBLOCK_DIRECTION_DEFAULT);
+ }
+#endif /* CONFIG_MOVABLE_NODE */
+
memblock_find_dma_reserve();

/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index dd38e62..5d2c07b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,11 @@ enum {
ONLINE_MOVABLE,
};

+#ifdef CONFIG_MOVABLE_NODE
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+#endif /* CONFIG_MOVABLE_NODE */
+
/*
* pgdat resizing functions
*/
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0eb1a1d..8a4c8ff 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1390,6 +1390,15 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
{
return true;
}
+
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+ movablenode_enable_srat = true;
+ return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
#else /* CONFIG_MOVABLE_NODE */
/* ensure the node has NORMAL memory if it is still online */
static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
--
1.7.1

2013-09-13 09:29:09

by Tang Chen

[permalink] [raw]
Subject: [PATCH v3 4/5] x86, mem-hotplug: Support initialize page tables from low to high.

init_mem_mapping() is called before SRAT is parsed. And memblock will allocate
memory for page tables. To prevent page tables being allocated within hotpluggable
memory, we will allocate page tables from the end of kernel image to the higher
memory.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/init.c | 121 +++++++++++++++++++++++++++++++++++++++------------
1 files changed, 92 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..bf7b732 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,13 +401,79 @@ static unsigned long __init init_range_memory_mapping(

/* (PUD_SHIFT-PMD_SHIFT)/2 */
#define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+#ifdef CONFIG_MOVABLE_NODE
+/**
+ * memory_map_from_low - Map [start, end) from low to high
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [start, end) in a
+ * heuristic way. In the beginning, step_size is small. The more memory we map
+ * memory in the next loop.
+ */
+static void __init memory_map_from_low(unsigned long start, unsigned long end)
+{
+ unsigned long next, new_mapped_ram_size;
+ unsigned long mapped_ram_size = 0;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+
+ while (start < end) {
+ if (end - start > step_size) {
+ next = round_up(start + 1, step_size);
+ if (next > end)
+ next = end;
+ } else
+ next = end;
+
+ new_mapped_ram_size = init_range_memory_mapping(start, next);
+ min_pfn_mapped = start >> PAGE_SHIFT;
+ start = next;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+#endif /* CONFIG_MOVABLE_NODE */
+
+/**
+ * memory_map_from_high - Map [start, end) from high to low
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function is similar to memory_map_from_low() except it maps memory
+ * from high to low.
+ */
+static void __init memory_map_from_high(unsigned long start, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
- unsigned long step_size;
- unsigned long addr;
+ unsigned long prev, new_mapped_ram_size;
unsigned long mapped_ram_size = 0;
- unsigned long new_mapped_ram_size;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+
+ while (start < end) {
+ if (end > step_size) {
+ prev = round_down(end - 1, step_size);
+ if (prev < start)
+ prev = start;
+ } else
+ prev = start;
+
+ new_mapped_ram_size = init_range_memory_mapping(prev, end);
+ min_pfn_mapped = prev >> PAGE_SHIFT;
+ end = prev;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+
+void __init init_mem_mapping(void)
+{
+ unsigned long end;

probe_page_size_mask();

@@ -417,45 +483,42 @@ void __init init_mem_mapping(void)
end = max_low_pfn << PAGE_SHIFT;
#endif

- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ max_pfn_mapped = 0; /* will get exact value next */
+ min_pfn_mapped = end >> PAGE_SHIFT;
+
+#ifdef CONFIG_MOVABLE_NODE
+ unsigned long kernel_end;
+
+ if (memblock_direction_bottom_up()) {
+ kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
+
+ memory_map_from_low(kernel_end, end);
+ memory_map_from_low(ISA_END_ADDRESS, kernel_end);
+ goto out;
+ }
+#endif /* CONFIG_MOVABLE_NODE */
+
+ unsigned long addr, real_end;

/* xen has big range in reserved near end of ram, skip it at first.*/
addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

- /* step_size need to be small so pgt_buf from BRK could cover it */
- step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
- last_start = start = real_end;
-
/*
* We start from the top (end of memory) and go to the bottom.
* The memblock_find_in_range() gets us a block of RAM from the
* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
* for page table.
*/
- while (last_start > ISA_END_ADDRESS) {
- if (last_start > step_size) {
- start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
- } else
- start = ISA_END_ADDRESS;
- new_mapped_ram_size = init_range_memory_mapping(start,
- last_start);
- last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
- /* only increase step_size after big range get mapped */
- if (new_mapped_ram_size > mapped_ram_size)
- step_size <<= STEP_SIZE_SHIFT;
- mapped_ram_size += new_mapped_ram_size;
- }
+ memory_map_from_high(ISA_END_ADDRESS, real_end);

if (real_end < end)
init_range_memory_mapping(real_end, end);

+out:
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
/* can we preseve max_low_pfn ?*/
--
1.7.1

2013-09-13 09:29:40

by Tang Chen

[permalink] [raw]
Subject: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

This patch modifies the memblock_find_in_range_node() to support two
different allocation directions. After this patch, memblock will check
memblock.current_direction, and decide in which direction to allocate
memory.

Now it supports two allocation directions: bottom up and top down.
When direction is top down, it acts as before.
When direction is bottom up, the start address should be greater than
the end of the kernel image. Otherwise, it will be trimmed to kernel
image end address.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
mm/memblock.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index f24ca2e..87a7f04 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
#include <linux/seq_file.h>
#include <linux/memblock.h>

+#include <asm-generic/sections.h>
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;

@@ -84,8 +86,81 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
}

/**
+ * __memblock_find_range - find free area utility
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area from
+ * lower address to higher address.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align, int nid)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+
+ for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+ cand = round_up(this_start, align);
+ if (cand < this_end && this_end - cand >= size)
+ return cand;
+ }
+
+ return 0;
+}
+
+/**
+ * __memblock_find_range_rev - find free area utility, in reverse order
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area from
+ * higher address to lower address.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align, int nid)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+
+ for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+ /*
+ * Just in case that (this_end - size) underflows and cause
+ * (cand >= this_start) to be true incorrectly.
+ */
+ if (this_end < size)
+ break;
+
+ cand = round_down(this_end - size, align);
+ if (cand >= this_start)
+ return cand;
+ }
+
+ return 0;
+}
+
+/**
* memblock_find_in_range_node - find free area in given range and node
- * @start: start of candidate range
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
* @size: size of free area to find
* @align: alignment of free area to find
@@ -93,6 +168,11 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
*
* Find @size free area aligned to @align in the specified range and node.
*
+ * When allocation direction is from low to high, the @start should be greater
+ * than the end of the kernel image. Otherwise, it will be trimmed. And also,
+ * if allocation from low to high failed, will try to allocate memory from high
+ * to low again.
+ *
* RETURNS:
* Found address on success, %0 on failure.
*/
@@ -100,8 +180,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
phys_addr_t align, int nid)
{
- phys_addr_t this_start, this_end, cand;
- u64 i;
+ phys_addr_t ret;

/* pump up @end */
if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -111,18 +190,22 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
start = max_t(phys_addr_t, start, PAGE_SIZE);
end = max(start, end);

- for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
- this_start = clamp(this_start, start, end);
- this_end = clamp(this_end, start, end);
+ if (memblock_direction_bottom_up()) {
+ /*
+ * MEMBLOCK_ALLOC_ACCESSIBLE is 0, which is less than the end
+ * of kernel image. So callers specify MEMBLOCK_ALLOC_ACCESSIBLE
+ * as @start is OK.
+ */
+ start = max(start, __pa_symbol(_end)); /* End of kernel image. */

- if (this_end < size)
- continue;
+ ret = __memblock_find_range(start, end, size, align, nid);
+ if (ret)
+ return ret;

- cand = round_down(this_end - size, align);
- if (cand >= this_start)
- return cand;
+ pr_warn("memblock: Failed to allocate memory in bottom up direction. Now try top down direction.\n");
}
- return 0;
+
+ return __memblock_find_range_rev(start, end, size, align, nid);
}

/**
--
1.7.1

2013-09-13 09:29:36

by Tang Chen

[permalink] [raw]
Subject: [PATCH v3 1/5] memblock: Introduce allocation direction to memblock.

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory from high address to low.
So this patch introduces the allocation direct to memblock. It could be
used to tell memblock to allocate memory from high to low or from low
to high.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 22 ++++++++++++++++++++++
mm/memblock.c | 13 +++++++++++++
2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..a7d3436 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,11 @@

#define INIT_MEMBLOCK_REGIONS 128

+/* Allocation order. */
+#define MEMBLOCK_DIRECTION_HIGH_TO_LOW 0
+#define MEMBLOCK_DIRECTION_LOW_TO_HIGH 1
+#define MEMBLOCK_DIRECTION_DEFAULT MEMBLOCK_DIRECTION_HIGH_TO_LOW
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
@@ -35,6 +40,7 @@ struct memblock_type {
};

struct memblock {
+ int current_direction; /* allocate from higher or lower address */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -148,6 +154,12 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)

phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);

+static inline bool memblock_direction_bottom_up(void)
+{
+ return memblock.current_direction == MEMBLOCK_DIRECTION_LOW_TO_HIGH;
+}
+
+
/* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
#define MEMBLOCK_ALLOC_ACCESSIBLE 0
@@ -175,6 +187,16 @@ static inline void memblock_dump_all(void)
}

/**
+ * memblock_set_current_direction - Set current allocation direction to allow
+ * allocating memory from higher to lower
+ * address or from lower to higher address
+ *
+ * @direction: In which order to allocate memory. Could be
+ * MEMBLOCK_DIRECTION_{HIGH_TO_LOW|LOW_TO_HIGH}
+ */
+void memblock_set_current_direction(int direction);
+
+/**
* memblock_set_current_limit - Set the current allocation limit to allow
* limiting allocations to what is currently
* accessible during boot
diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..f24ca2e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -32,6 +32,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,

+ .current_direction = MEMBLOCK_DIRECTION_DEFAULT,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};

@@ -995,6 +996,18 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
}
}

+void __init_memblock memblock_set_current_direction(int direction)
+{
+ if (direction != MEMBLOCK_DIRECTION_HIGH_TO_LOW &&
+ direction != MEMBLOCK_DIRECTION_LOW_TO_HIGH) {
+ pr_warn("memblock: Failed to set allocation order. "
+ "Invalid order type: %d\n", direction);
+ return;
+ }
+
+ memblock.current_direction = direction;
+}
+
void __init_memblock memblock_set_current_limit(phys_addr_t limit)
{
memblock.current_limit = limit;
--
1.7.1

2013-09-13 09:29:34

by Tang Chen

[permalink] [raw]
Subject: [PATCH v3 3/5] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/setup.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..36cfce3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
#endif

- reserve_crashkernel();
-
vsmp_init();

io_delay_init();
@@ -1136,6 +1134,12 @@ void __init setup_arch(char **cmdline_p)
initmem_init();
memblock_find_dma_reserve();

+ /*
+ * Reserve memory for crash kernel after SRAT is parsed so that it
+ * won't consume hotpluggable memory.
+ */
+ reserve_crashkernel();
+
#ifdef CONFIG_KVM_GUEST
kvmclock_init();
#endif
--
1.7.1

2013-09-13 21:55:20

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

On Fri, 2013-09-13 at 17:30 +0800, Tang Chen wrote:
:
> @@ -100,8 +180,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> phys_addr_t end, phys_addr_t size,
> phys_addr_t align, int nid)
> {
> - phys_addr_t this_start, this_end, cand;
> - u64 i;
> + phys_addr_t ret;
>
> /* pump up @end */
> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -111,18 +190,22 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> start = max_t(phys_addr_t, start, PAGE_SIZE);
> end = max(start, end);
>
> - for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
> - this_start = clamp(this_start, start, end);
> - this_end = clamp(this_end, start, end);
> + if (memblock_direction_bottom_up()) {
> + /*
> + * MEMBLOCK_ALLOC_ACCESSIBLE is 0, which is less than the end
> + * of kernel image. So callers specify MEMBLOCK_ALLOC_ACCESSIBLE
> + * as @start is OK.
> + */
> + start = max(start, __pa_symbol(_end)); /* End of kernel image. */
>
> - if (this_end < size)
> - continue;
> + ret = __memblock_find_range(start, end, size, align, nid);
> + if (ret)
> + return ret;
>
> - cand = round_down(this_end - size, align);
> - if (cand >= this_start)
> - return cand;
> + pr_warn("memblock: Failed to allocate memory in bottom up direction. Now try top down direction.\n");

Is there any chance that this retry will succeed given that start and
end are still the same?

Thanks,
-Toshi


> }
> - return 0;
> +
> + return __memblock_find_range_rev(start, end, size, align, nid);
> }
>
> /**

2013-09-14 02:43:53

by Jianguo Wu

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] memblock: Introduce allocation direction to memblock.

Hi Tang,

On 2013/9/13 17:30, Tang Chen wrote:

> The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
> pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
> the kernel.
>
> ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
> But before SRAT is parsed, memblock has already started to allocate memory
> for the kernel. So we need to prevent memblock from doing this.
>
> In a memory hotplug system, any numa node the kernel resides in should
> be unhotpluggable. And for a modern server, each node could have at least
> 16GB memory. So memory around the kernel image is highly likely unhotpluggable.
>
> So the basic idea is: Allocate memory from the end of the kernel image and
> to the higher memory. Since memory allocation before SRAT is parsed won't
> be too much, it could highly likely be in the same node with kernel image.
>
> The current memblock can only allocate memory from high address to low.
> So this patch introduces the allocation direct to memblock. It could be
> used to tell memblock to allocate memory from high to low or from low
> to high.
>
> Signed-off-by: Tang Chen <[email protected]>
> Reviewed-by: Zhang Yanfei <[email protected]>
> ---
> include/linux/memblock.h | 22 ++++++++++++++++++++++
> mm/memblock.c | 13 +++++++++++++
> 2 files changed, 35 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 31e95ac..a7d3436 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -19,6 +19,11 @@
>
> #define INIT_MEMBLOCK_REGIONS 128
>
> +/* Allocation order. */

s/order/direction/

> +#define MEMBLOCK_DIRECTION_HIGH_TO_LOW 0
> +#define MEMBLOCK_DIRECTION_LOW_TO_HIGH 1
> +#define MEMBLOCK_DIRECTION_DEFAULT MEMBLOCK_DIRECTION_HIGH_TO_LOW
> +
> struct memblock_region {
> phys_addr_t base;
> phys_addr_t size;
> @@ -35,6 +40,7 @@ struct memblock_type {
> };
>
> struct memblock {
> + int current_direction; /* allocate from higher or lower address */
> phys_addr_t current_limit;
> struct memblock_type memory;
> struct memblock_type reserved;
> @@ -148,6 +154,12 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)
>
> phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
>
> +static inline bool memblock_direction_bottom_up(void)
> +{
> + return memblock.current_direction == MEMBLOCK_DIRECTION_LOW_TO_HIGH;
> +}
> +
> +
> /* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
> #define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
> #define MEMBLOCK_ALLOC_ACCESSIBLE 0
> @@ -175,6 +187,16 @@ static inline void memblock_dump_all(void)
> }
>
> /**
> + * memblock_set_current_direction - Set current allocation direction to allow
> + * allocating memory from higher to lower
> + * address or from lower to higher address
> + *
> + * @direction: In which order to allocate memory. Could be

s/order/direction/

> + * MEMBLOCK_DIRECTION_{HIGH_TO_LOW|LOW_TO_HIGH}
> + */
> +void memblock_set_current_direction(int direction);
> +
> +/**
> * memblock_set_current_limit - Set the current allocation limit to allow
> * limiting allocations to what is currently
> * accessible during boot
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 0ac412a..f24ca2e 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -32,6 +32,7 @@ struct memblock memblock __initdata_memblock = {
> .reserved.cnt = 1, /* empty dummy entry */
> .reserved.max = INIT_MEMBLOCK_REGIONS,
>
> + .current_direction = MEMBLOCK_DIRECTION_DEFAULT,
> .current_limit = MEMBLOCK_ALLOC_ANYWHERE,
> };
>
> @@ -995,6 +996,18 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> }
> }
>
> +void __init_memblock memblock_set_current_direction(int direction)
> +{
> + if (direction != MEMBLOCK_DIRECTION_HIGH_TO_LOW &&
> + direction != MEMBLOCK_DIRECTION_LOW_TO_HIGH) {
> + pr_warn("memblock: Failed to set allocation order. "
> + "Invalid order type: %d\n", direction);

s/order/direction/

> + return;
> + }
> +
> + memblock.current_direction = direction;
> +}
> +
> void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> {
> memblock.current_limit = limit;


2013-09-16 01:28:45

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello toshi-san,

On 09/14/2013 05:53 AM, Toshi Kani wrote:
> On Fri, 2013-09-13 at 17:30 +0800, Tang Chen wrote:
> :
>> @@ -100,8 +180,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> phys_addr_t end, phys_addr_t size,
>> phys_addr_t align, int nid)
>> {
>> - phys_addr_t this_start, this_end, cand;
>> - u64 i;
>> + phys_addr_t ret;
>>
>> /* pump up @end */
>> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -111,18 +190,22 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> start = max_t(phys_addr_t, start, PAGE_SIZE);
>> end = max(start, end);
>>
>> - for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
>> - this_start = clamp(this_start, start, end);
>> - this_end = clamp(this_end, start, end);
>> + if (memblock_direction_bottom_up()) {
>> + /*
>> + * MEMBLOCK_ALLOC_ACCESSIBLE is 0, which is less than the end
>> + * of kernel image. So callers specify MEMBLOCK_ALLOC_ACCESSIBLE
>> + * as @start is OK.
>> + */
>> + start = max(start, __pa_symbol(_end)); /* End of kernel image. */
>>
>> - if (this_end < size)
>> - continue;
>> + ret = __memblock_find_range(start, end, size, align, nid);
>> + if (ret)
>> + return ret;
>>
>> - cand = round_down(this_end - size, align);
>> - if (cand >= this_start)
>> - return cand;
>> + pr_warn("memblock: Failed to allocate memory in bottom up direction. Now try top down direction.\n");
>
> Is there any chance that this retry will succeed given that start and
> end are still the same?

Thanks for pointing this. We've made a mistake here. If the original start
is less than __pa_symbol(_end), and if the bottom up search fails, then
the top down search deserves a try with the original start argument.

>
> Thanks,
> -Toshi
>
>
>> }
>> - return 0;
>> +
>> + return __memblock_find_range_rev(start, end, size, align, nid);
>> }
>>
>> /**
>
>
>


--
Thanks.
Zhang Yanfei

2013-09-16 11:01:25

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] x86, memblock: Allocate memory near kernel image before SRAT parsed.

Hello tejun,

Could you please help reviewing the patchset? As you suggested,
we've make the patchset much simpler and cleaner.

Thanks in advance!

On 09/13/2013 05:30 PM, Tang Chen wrote:
> This patch-set is based on tj's suggestion, and not fully tested.
> Just for review and discussion.
>
> This patch-set is based on the latest kernel (3.11)
> HEAD is:
> commit d5d04bb48f0eb89c14e76779bb46212494de0bec
> Author: Linus Torvalds <[email protected]>
> Date: Wed Sep 11 19:55:12 2013 -0700
>
>
> [Problem]
>
> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
>
> There are also some other issues will cause the kernel pages not migratable.
> For example, the physical address may be cached somewhere and will be used.
> It is not to update all the caches.
>
> When doing memory hotplug in Linux, we first migrate all the pages in one
> memory device somewhere else, and then remove the device. But if pages are
> used by the kernel, they are not migratable. As a result, memory used by
> the kernel cannot be hot-removed.
>
> Modifying the kernel direct mapping mechanism is too difficult to do. And
> it may cause the kernel performance down and unstable. So we use the following
> way to do memory hotplug.
>
>
> [What we are doing]
>
> In Linux, memory in one numa node is divided into several zones. One of the
> zones is ZONE_MOVABLE, which the kernel won't use.
>
> In order to implement memory hotplug in Linux, we are going to arrange all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> To do this, we need ACPI's help.
>
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
>
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
>
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
> ZONE_MOVABLE.
> (This has been done by the MOVABLE_NODE functionality in Linux.)
>
> 2. when the system is booting, prevent bootmem allocator from allocating
> hotpluggable memory for the kernel before the memory initialization
> finishes.
>
> The problem 2 is the key problem we are going to solve. But before solving it,
> we need some preparation. Please see below.
>
>
> [Preparation]
>
> Bootloader has to load the kernel image into memory. And this memory must be
> unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system,
> we can assume any node the kernel resides in is not hotpluggable.
>
> Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
> memblock has already started to work. In the current kernel, memblock allocates
> the following memory before SRAT is parsed:
>
> setup_arch()
> |->memblock_x86_fill() /* memblock is ready */
> |......
> |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
> |->reserve_real_mode() /* allocate memory under 1MB */
> |->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
> |->dma_contiguous_reserve() /* specified by user, should be low */
> |->setup_log_buf() /* specified by user, several mega bytes */
> |->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
> |->acpi_initrd_override() /* several mega bytes */
> |->reserve_crashkernel() /* could be large, should reorder */
> |......
> |->initmem_init() /* Parse SRAT */
>
> According to Tejun's advice, before SRAT is parsed, we should try our best to
> allocate memory near the kernel image. Since the whole node the kernel resides
> in won't be hotpluggable, and for a modern server, a node may have at least 16GB
> memory, allocating several mega bytes memory around the kernel image won't cross
> to hotpluggable memory.
>
>
> [About this patch-set]
>
> So this patch-set does the following:
>
> 1. Make memblock be able to allocate memory from low address to high address.
> 1) Keep all the memblock APIs' prototype unmodified.
> 2) When the direction is bottom up, keep the start address greater than the
> end of kernel image.
>
> 2. Improve init_mem_mapping() to support allocate page tables in bottom up direction.
>
> 3. Introduce "movablenode" boot option to enable and disable this functionality.
>
> PS: Reordering of relocate_initrd() has not been done yet. acpi_initrd_override()
> needs to access initrd with virtual address. So relocate_initrd() must be done
> before acpi_initrd_override().
>
>
> Change log v2 -> v3:
> 1. According to Toshi's suggestion, move the direction checking logic into memblock.
> And simply the code more.
>
> Change log v1 -> v2:
> 1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up()
> to allocate memory from bottom upwards, whihc can simplify the code.
>
>
> Tang Chen (5):
> memblock: Introduce allocation direction to memblock.
> memblock: Improve memblock to support allocation from lower address.
> x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
> parsed.
> x86, mem-hotplug: Support initialize page tables from low to high.
> mem-hotplug: Introduce movablenode boot option to control memblock
> allocation direction.
>
> Documentation/kernel-parameters.txt | 15 ++++
> arch/x86/kernel/setup.c | 44 ++++++++++++-
> arch/x86/mm/init.c | 121 ++++++++++++++++++++++++++--------
> include/linux/memblock.h | 22 ++++++
> include/linux/memory_hotplug.h | 5 ++
> mm/memblock.c | 120 +++++++++++++++++++++++++++++++----
> mm/memory_hotplug.c | 9 +++
> 7 files changed, 293 insertions(+), 43 deletions(-)
>
>


--
Thanks.
Zhang Yanfei

2013-09-23 15:39:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] memblock: Introduce allocation direction to memblock.

Hello,

Sorry about the delay. Was traveling.

On Fri, Sep 13, 2013 at 05:30:51PM +0800, Tang Chen wrote:
> +/* Allocation order. */
> +#define MEMBLOCK_DIRECTION_HIGH_TO_LOW 0
> +#define MEMBLOCK_DIRECTION_LOW_TO_HIGH 1
> +#define MEMBLOCK_DIRECTION_DEFAULT MEMBLOCK_DIRECTION_HIGH_TO_LOW

Can we please settle on either top_down/bottom_up or
high_to_low/low_to_high? The two seem to be used interchangeably in
the patch series. Also, it'd be more customary to use enum for things
like above, but more on the interface below.

> +static inline bool memblock_direction_bottom_up(void)
> +{
> + return memblock.current_direction == MEMBLOCK_DIRECTION_LOW_TO_HIGH;
> +}

Maybe just memblock_bottom_up() would be enough?

Also, why not also have memblock_set_bottom_up(bool enable) as the
'set' interface?

> /**
> + * memblock_set_current_direction - Set current allocation direction to allow
> + * allocating memory from higher to lower
> + * address or from lower to higher address
> + *
> + * @direction: In which order to allocate memory. Could be
> + * MEMBLOCK_DIRECTION_{HIGH_TO_LOW|LOW_TO_HIGH}
> + */
> +void memblock_set_current_direction(int direction);

Function comments should go with the function definition. Dunno what
happened with set_current_limit but let's please not spread it.

> +void __init_memblock memblock_set_current_direction(int direction)
> +{
> + if (direction != MEMBLOCK_DIRECTION_HIGH_TO_LOW &&
> + direction != MEMBLOCK_DIRECTION_LOW_TO_HIGH) {
> + pr_warn("memblock: Failed to set allocation order. "
> + "Invalid order type: %d\n", direction);
> + return;
> + }
> +
> + memblock.current_direction = direction;
> +}

If set_bottom_up() style interface is used, the above will be a lot
simpler, right? Also, it's kinda weird to have two separate patches
to introduce the flag and actually implement bottom up allocation.

Thanks.

--
tejun

2013-09-23 15:50:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello,

Please separate out factoring out of top-down allocation. That change
is an equivalent conversion which shouldn't involve any functional
difference. Mixing that with introduction of new feature isn't a good
idea, so the patch split should be 1. split out top-down allocation
from memblock_find_in_range_node() 2. introduce bottom-up flag and
implement the feature.

On Fri, Sep 13, 2013 at 05:30:52PM +0800, Tang Chen wrote:
> +/**
> * memblock_find_in_range_node - find free area in given range and node
> - * @start: start of candidate range
> + * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE

The only reason @end has special ACCESSIBLE flag is because we don't
know how high is actually accessible and it needs to be distinguished
from ANYWHERE. We assume that the lower addresses are always mapped,
so using ACCESSIBLE for @start is weird. I think it'd be clearer to
make the memblock interface to set the direction explicitly state what
it's doing - ie. something like set_memblock_alloc_above_kernel(bool
enable). We clearly don't want pure bottom-up allocation and the
@start/@end params in memblock interface are used to impose extra
limitations for each allocation, not the overall allocator behavior.

> @@ -100,8 +180,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> phys_addr_t end, phys_addr_t size,
> phys_addr_t align, int nid)
> {
> - phys_addr_t this_start, this_end, cand;
> - u64 i;
> + phys_addr_t ret;
>
> /* pump up @end */
> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -111,18 +190,22 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> start = max_t(phys_addr_t, start, PAGE_SIZE);
> end = max(start, end);
>
> - for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
> - this_start = clamp(this_start, start, end);
> - this_end = clamp(this_end, start, end);
> + if (memblock_direction_bottom_up()) {
> + /*
> + * MEMBLOCK_ALLOC_ACCESSIBLE is 0, which is less than the end
> + * of kernel image. So callers specify MEMBLOCK_ALLOC_ACCESSIBLE
> + * as @start is OK.
> + */
> + start = max(start, __pa_symbol(_end)); /* End of kernel image. */
>
> - if (this_end < size)
> - continue;
> + ret = __memblock_find_range(start, end, size, align, nid);
> + if (ret)
> + return ret;
>
> - cand = round_down(this_end - size, align);
> - if (cand >= this_start)
> - return cand;
> + pr_warn("memblock: Failed to allocate memory in bottom up direction. Now try top down direction.\n");

You probably wanna explain why retrying top-down allocation may
succeed when bottom-up failed.

Thanks.

--
tejun

2013-09-23 15:53:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] x86, mem-hotplug: Support initialize page tables from low to high.

Hey,

On Fri, Sep 13, 2013 at 05:30:54PM +0800, Tang Chen wrote:
> init_mem_mapping() is called before SRAT is parsed. And memblock will allocate
> memory for page tables. To prevent page tables being allocated within hotpluggable
> memory, we will allocate page tables from the end of kernel image to the higher
> memory.

The same comment about patch split as before. Please make splitting
out memory_map_from_high() a separate patch. Also, please choose one
pair to describe the direction. The series is currently using four
variants - top_down/bottom_up, high_to_low/low_to_high,
from_high/from_low. rev/[none]. Please choose one and stick with it.

Thanks.

--
tejun

2013-09-23 15:57:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] mem-hotplug: Introduce movablenode boot option to control memblock allocation direction.

Hello,

On Fri, Sep 13, 2013 at 05:30:55PM +0800, Tang Chen wrote:
> +#ifdef CONFIG_MOVABLE_NODE
> + if (movablenode_enable_srat) {
> + /*
> + * When ACPI SRAT is parsed, which is done in initmem_init(),
> + * set memblock back to the default behavior.
> + */
> + memblock_set_current_direction(MEMBLOCK_DIRECTION_DEFAULT);
> + }
> +#endif /* CONFIG_MOVABLE_NODE */

It's kinda weird to have ifdef around the above when all the actual
code would be compiled and linked regardless of the above ifdef.
Wouldn't it make more sense to conditionalize
memblock_direction_bottom_up() so that it's constant false to allow
the compiler to drop unnecessary code?

Thanks.

--
tejun

2013-09-23 16:36:42

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] memblock: Introduce allocation direction to memblock.

Hello tejun,

On 09/23/2013 11:38 PM, Tejun Heo wrote:
> Hello,
>
> Sorry about the delay. Was traveling.

hoho~ I guess you did have a good time.

>
> On Fri, Sep 13, 2013 at 05:30:51PM +0800, Tang Chen wrote:
>> +/* Allocation order. */
>> +#define MEMBLOCK_DIRECTION_HIGH_TO_LOW 0
>> +#define MEMBLOCK_DIRECTION_LOW_TO_HIGH 1
>> +#define MEMBLOCK_DIRECTION_DEFAULT MEMBLOCK_DIRECTION_HIGH_TO_LOW
>
> Can we please settle on either top_down/bottom_up or
> high_to_low/low_to_high? The two seem to be used interchangeably in
> the patch series. Also, it'd be more customary to use enum for things
> like above, but more on the interface below.

OK. let's use top_down/bottom_up. And using enum is also ok.

>
>> +static inline bool memblock_direction_bottom_up(void)
>> +{
>> + return memblock.current_direction == MEMBLOCK_DIRECTION_LOW_TO_HIGH;
>> +}
>
> Maybe just memblock_bottom_up() would be enough?

Agreed.

>
> Also, why not also have memblock_set_bottom_up(bool enable) as the
> 'set' interface?

hmmm, ok. So we will use memblock_set_bottom_up to replace
memblock_set_current_direction below.

>
>> /**
>> + * memblock_set_current_direction - Set current allocation direction to allow
>> + * allocating memory from higher to lower
>> + * address or from lower to higher address
>> + *
>> + * @direction: In which order to allocate memory. Could be
>> + * MEMBLOCK_DIRECTION_{HIGH_TO_LOW|LOW_TO_HIGH}
>> + */
>> +void memblock_set_current_direction(int direction);
>
> Function comments should go with the function definition. Dunno what
> happened with set_current_limit but let's please not spread it.
>
>> +void __init_memblock memblock_set_current_direction(int direction)
>> +{
>> + if (direction != MEMBLOCK_DIRECTION_HIGH_TO_LOW &&
>> + direction != MEMBLOCK_DIRECTION_LOW_TO_HIGH) {
>> + pr_warn("memblock: Failed to set allocation order. "
>> + "Invalid order type: %d\n", direction);
>> + return;
>> + }
>> +
>> + memblock.current_direction = direction;
>> +}
>
> If set_bottom_up() style interface is used, the above will be a lot
> simpler, right? Also, it's kinda weird to have two separate patches
> to introduce the flag and actually implement bottom up allocation.

Yeah, right, that'd be much simpler. And it is ok to put the two in
one patch.

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-23 16:44:27

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello tejun,

On 09/23/2013 11:50 PM, Tejun Heo wrote:
> Hello,
>
> Please separate out factoring out of top-down allocation. That change
> is an equivalent conversion which shouldn't involve any functional
> difference. Mixing that with introduction of new feature isn't a good
> idea, so the patch split should be 1. split out top-down allocation
> from memblock_find_in_range_node() 2. introduce bottom-up flag and
> implement the feature.

Ok, will do the split.

>
> On Fri, Sep 13, 2013 at 05:30:52PM +0800, Tang Chen wrote:
>> +/**
>> * memblock_find_in_range_node - find free area in given range and node
>> - * @start: start of candidate range
>> + * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
>
> The only reason @end has special ACCESSIBLE flag is because we don't
> know how high is actually accessible and it needs to be distinguished
> from ANYWHERE. We assume that the lower addresses are always mapped,
> so using ACCESSIBLE for @start is weird. I think it'd be clearer to
> make the memblock interface to set the direction explicitly state what
> it's doing - ie. something like set_memblock_alloc_above_kernel(bool
> enable). We clearly don't want pure bottom-up allocation and the
> @start/@end params in memblock interface are used to impose extra
> limitations for each allocation, not the overall allocator behavior.
>
>> @@ -100,8 +180,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> phys_addr_t end, phys_addr_t size,
>> phys_addr_t align, int nid)
>> {
>> - phys_addr_t this_start, this_end, cand;
>> - u64 i;
>> + phys_addr_t ret;
>>
>> /* pump up @end */
>> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -111,18 +190,22 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> start = max_t(phys_addr_t, start, PAGE_SIZE);
>> end = max(start, end);
>>
>> - for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
>> - this_start = clamp(this_start, start, end);
>> - this_end = clamp(this_end, start, end);
>> + if (memblock_direction_bottom_up()) {
>> + /*
>> + * MEMBLOCK_ALLOC_ACCESSIBLE is 0, which is less than the end
>> + * of kernel image. So callers specify MEMBLOCK_ALLOC_ACCESSIBLE
>> + * as @start is OK.
>> + */
>> + start = max(start, __pa_symbol(_end)); /* End of kernel image. */
>>
>> - if (this_end < size)
>> - continue;
>> + ret = __memblock_find_range(start, end, size, align, nid);
>> + if (ret)
>> + return ret;
>>
>> - cand = round_down(this_end - size, align);
>> - if (cand >= this_start)
>> - return cand;
>> + pr_warn("memblock: Failed to allocate memory in bottom up direction. Now try top down direction.\n");
>
> You probably wanna explain why retrying top-down allocation may
> succeed when bottom-up failed.

ok. The reason is we always limit bottom-up allocation from
the kernel image end, but to-down allocation doesn't have the
limit, so retrying top-down allocation may succeed when
bottom-up failed.

Will add the comment to explain the retry.

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-23 16:46:36

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] x86, mem-hotplug: Support initialize page tables from low to high.

Hello tejun,

On 09/23/2013 11:53 PM, Tejun Heo wrote:
> Hey,
>
> On Fri, Sep 13, 2013 at 05:30:54PM +0800, Tang Chen wrote:
>> init_mem_mapping() is called before SRAT is parsed. And memblock will allocate
>> memory for page tables. To prevent page tables being allocated within hotpluggable
>> memory, we will allocate page tables from the end of kernel image to the higher
>> memory.
>
> The same comment about patch split as before. Please make splitting
> out memory_map_from_high() a separate patch. Also, please choose one
> pair to describe the direction. The series is currently using four
> variants - top_down/bottom_up, high_to_low/low_to_high,
> from_high/from_low. rev/[none]. Please choose one and stick with it.

OK. will do the split and choose one pair. Thanks for the reminding again.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-23 16:58:29

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] mem-hotplug: Introduce movablenode boot option to control memblock allocation direction.

Hello tejun,

On 09/23/2013 11:57 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Sep 13, 2013 at 05:30:55PM +0800, Tang Chen wrote:
>> +#ifdef CONFIG_MOVABLE_NODE
>> + if (movablenode_enable_srat) {
>> + /*
>> + * When ACPI SRAT is parsed, which is done in initmem_init(),
>> + * set memblock back to the default behavior.
>> + */
>> + memblock_set_current_direction(MEMBLOCK_DIRECTION_DEFAULT);
>> + }
>> +#endif /* CONFIG_MOVABLE_NODE */
>
> It's kinda weird to have ifdef around the above when all the actual
> code would be compiled and linked regardless of the above ifdef.
> Wouldn't it make more sense to conditionalize
> memblock_direction_bottom_up() so that it's constant false to allow
> the compiler to drop unnecessary code?

you mean we define memblock_set_bottom_up and memblock_bottom_up like below:

#ifdef CONFIG_MOVABLE_NODE
void memblock_set_bottom_up(bool enable)
{
/* do something */
}

bool memblock_bottom_up()
{
return direction == bottom_up;
}
#else
void memblock_set_bottom_up(bool enable)
{
/* empty */
}

bool memblock_bottom_up()
{
return false;
}
#endif

right?

thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-23 17:11:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] mem-hotplug: Introduce movablenode boot option to control memblock allocation direction.

Hello,

On Tue, Sep 24, 2013 at 12:58:03AM +0800, Zhang Yanfei wrote:
> you mean we define memblock_set_bottom_up and memblock_bottom_up like below:
>
> #ifdef CONFIG_MOVABLE_NODE
> void memblock_set_bottom_up(bool enable)
> {
> /* do something */
> }
>
> bool memblock_bottom_up()
> {
> return direction == bottom_up;
> }
> #else
> void memblock_set_bottom_up(bool enable)
> {
> /* empty */
> }
>
> bool memblock_bottom_up()
> {
> return false;
> }
> #endif
>
> right?

Yeah, the compiler would be able to drop bottom_up code if
!MOVABLE_NODE as long as the implementation functions are static.

Thanks.

--
tejun

2013-09-23 18:07:33

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello tejun,

On 09/23/2013 11:50 PM, Tejun Heo wrote:
> Hello,
>
> Please separate out factoring out of top-down allocation. That change
> is an equivalent conversion which shouldn't involve any functional
> difference. Mixing that with introduction of new feature isn't a good
> idea, so the patch split should be 1. split out top-down allocation
> from memblock_find_in_range_node() 2. introduce bottom-up flag and
> implement the feature.
>
> On Fri, Sep 13, 2013 at 05:30:52PM +0800, Tang Chen wrote:
>> +/**
>> * memblock_find_in_range_node - find free area in given range and node
>> - * @start: start of candidate range
>> + * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
>
> The only reason @end has special ACCESSIBLE flag is because we don't
> know how high is actually accessible and it needs to be distinguished
> from ANYWHERE. We assume that the lower addresses are always mapped,
> so using ACCESSIBLE for @start is weird. I think it'd be clearer to
> make the memblock interface to set the direction explicitly state what
> it's doing - ie. something like set_memblock_alloc_above_kernel(bool
> enable). We clearly don't want pure bottom-up allocation and the
> @start/@end params in memblock interface are used to impose extra
> limitations for each allocation, not the overall allocator behavior.

Forgot this one...

Yes, I am following your advice in principle but kind of confused by
something you said above. Where should the set_memblock_alloc_above_kernel
be used? IMO, the function is like:

find_in_range_node()
{
if (ok) {
/* bottom-up */
ret = __memblock_find_in_range(max(start, _end_of_kernel), end...);
if (!ret)
return ret;
}

/* top-down retry */
return __memblock_find_in_range_rev(start, end...)
}

For bottom-up allocation, we always start from max(start, _end_of_kernel).

Thanks.

>
>> @@ -100,8 +180,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> phys_addr_t end, phys_addr_t size,
>> phys_addr_t align, int nid)
>> {
>> - phys_addr_t this_start, this_end, cand;
>> - u64 i;
>> + phys_addr_t ret;
>>
>> /* pump up @end */
>> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -111,18 +190,22 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> start = max_t(phys_addr_t, start, PAGE_SIZE);
>> end = max(start, end);
>>
>> - for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
>> - this_start = clamp(this_start, start, end);
>> - this_end = clamp(this_end, start, end);
>> + if (memblock_direction_bottom_up()) {
>> + /*
>> + * MEMBLOCK_ALLOC_ACCESSIBLE is 0, which is less than the end
>> + * of kernel image. So callers specify MEMBLOCK_ALLOC_ACCESSIBLE
>> + * as @start is OK.
>> + */
>> + start = max(start, __pa_symbol(_end)); /* End of kernel image. */
>>
>> - if (this_end < size)
>> - continue;
>> + ret = __memblock_find_range(start, end, size, align, nid);
>> + if (ret)
>> + return ret;
>>
>> - cand = round_down(this_end - size, align);
>> - if (cand >= this_start)
>> - return cand;
>> + pr_warn("memblock: Failed to allocate memory in bottom up direction. Now try top down direction.\n");
>
> You probably wanna explain why retrying top-down allocation may
> succeed when bottom-up failed.
>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-23 20:21:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello,

On Tue, Sep 24, 2013 at 02:07:13AM +0800, Zhang Yanfei wrote:
> Yes, I am following your advice in principle but kind of confused by
> something you said above. Where should the set_memblock_alloc_above_kernel
> be used? IMO, the function is like:
>
> find_in_range_node()
> {
> if (ok) {
> /* bottom-up */
> ret = __memblock_find_in_range(max(start, _end_of_kernel), end...);
> if (!ret)
> return ret;
> }
>
> /* top-down retry */
> return __memblock_find_in_range_rev(start, end...)
> }
>
> For bottom-up allocation, we always start from max(start, _end_of_kernel).

Oh, I was talking about naming of the memblock_set_bottom_up()
function. We aren't really doing pure bottom up allocations, so I
think it probably would be clearer if the name clearly denotes that
we're doing above-kernel allocation.

Thanks.

--
tejun

2013-09-24 02:42:51

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello tejun,

On 09/24/2013 04:21 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 24, 2013 at 02:07:13AM +0800, Zhang Yanfei wrote:
>> Yes, I am following your advice in principle but kind of confused by
>> something you said above. Where should the set_memblock_alloc_above_kernel
>> be used? IMO, the function is like:
>>
>> find_in_range_node()
>> {
>> if (ok) {
>> /* bottom-up */
>> ret = __memblock_find_in_range(max(start, _end_of_kernel), end...);
>> if (!ret)
>> return ret;
>> }
>>
>> /* top-down retry */
>> return __memblock_find_in_range_rev(start, end...)
>> }
>>
>> For bottom-up allocation, we always start from max(start, _end_of_kernel).
>
> Oh, I was talking about naming of the memblock_set_bottom_up()
> function. We aren't really doing pure bottom up allocations, so I
> think it probably would be clearer if the name clearly denotes that
> we're doing above-kernel allocation.

I see. But I think memblock_set_alloc_above_kernel may lose the info
that we are doing bottom-up allocation. So my idea is we introduce
pure bottom-up allocation mode in previous patches and we use the
bottom-up allocation here and limit the start address above the kernel
, with explicit comments to indicate this.

How do you think?

Thanks.

>
> Thanks.
>


--
Thanks.
Zhang Yanfei

2013-09-24 02:47:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] memblock: Improve memblock to support allocation from lower address.

Hello,

On Tue, Sep 24, 2013 at 10:41:51AM +0800, Zhang Yanfei wrote:
> I see. But I think memblock_set_alloc_above_kernel may lose the info
> that we are doing bottom-up allocation. So my idea is we introduce
> pure bottom-up allocation mode in previous patches and we use the
> bottom-up allocation here and limit the start address above the kernel
> , with explicit comments to indicate this.

It probably doesn't matter either way. I was just a bit bothered that
it's called bottom-up when it implies more including the retry logic.
Its use of bottom-up allocation is really an implementation logic to
achieve the goal of allocating memory above kernel image after all,
but yeah minor point either way.

Thanks.

--
tejun