LinuxLists.cc - [PATCH v2 0/9] x86, memblock: Allocate memory near kernel image before SRAT parsed.

2013-09-11 10:05:22

Subject: [PATCH v2 0/9] x86, memblock: Allocate memory near kernel image before SRAT parsed.

This patch-set is based on tj's suggestion, and not fully tested.
Just for review and discussion. And according to tj's suggestion,
implemented a new function memblock_alloc_bottom_up() to allocate
memory from bottom upwards, whihc can simplify the code.

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.

[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.

[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system,
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates
the following memory before SRAT is parsed:

setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.

[About this patch-set]

So this patch-set does the following:

1. Make memblock be able to allocate memory from low address to high address.

2. Improve all functions who need to allocate memory before SRAT to support
allocating memory from low address to high address.

3. Introduce "movablenode" boot option to enable and disable this functionality.

PS: Reordering of relocate_initrd() has not been done yet. acpi_initrd_override()
needs to access initrd with virtual address. So relocate_initrd() must be done
before acpi_initrd_override().

Change log v1 -> v2:
1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up()
to allocate memory from bottom upwards, whihc can simplify the code.

Tang Chen (9):
memblock: Introduce allocation direction to memblock.
x86, memblock: Introduce memblock_alloc_bottom_up() to memblock.
x86, dma: Support allocate memory from bottom upwards in
dma_contiguous_reserve().
x86: Support allocate memory from bottom upwards in setup_log_buf().
x86: Support allocate memory from bottom upwards in
relocate_initrd().
x86, acpi: Support allocate memory from bottom upwards in
acpi_initrd_override().
x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
parsed.
x86, mem-hotplug: Support initialize page tables from low to high.
mem-hotplug: Introduce movablenode boot option to control memblock
allocation direction.

Documentation/kernel-parameters.txt | 15 ++++
arch/x86/kernel/setup.c | 54 ++++++++++++++-
arch/x86/mm/init.c | 133 +++++++++++++++++++++++++++--------
drivers/acpi/osl.c | 11 +++
drivers/base/dma-contiguous.c | 17 ++++-
include/linux/memblock.h | 24 ++++++
include/linux/memory_hotplug.h | 5 ++
kernel/printk/printk.c | 11 +++
mm/memblock.c | 51 +++++++++++++
mm/memory_hotplug.c | 9 +++
10 files changed, 296 insertions(+), 34 deletions(-)

2013-09-11 10:05:29

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 7/9] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/setup.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7372be7..fa56a57 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1130,8 +1130,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
#endif

- reserve_crashkernel();
-
vsmp_init();

io_delay_init();
@@ -1146,6 +1144,12 @@ void __init setup_arch(char **cmdline_p)
initmem_init();
memblock_find_dma_reserve();

+ /*
+ * Reserve memory for crash kernel after SRAT is parsed so that it
+ * won't consume hotpluggable memory.
+ */
+ reserve_crashkernel();
+
#ifdef CONFIG_KVM_GUEST
kvmclock_init();
#endif
--
1.7.1

2013-09-11 10:05:28

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 6/9] x86, acpi: Support allocate memory from bottom upwards in acpi_initrd_override().

During early boot, if the bottom up mode is set, just
try allocating bottom up from the end of kernel image,
and if that fails, do normal top down allocation.

So in function acpi_initrd_override(), we add the
above logic.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/osl.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e5f416c..978dcfa 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -632,6 +632,15 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

+ if (memblock_direction_bottom_up()) {
+ acpi_tables_addr = memblock_alloc_bottom_up(
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ max_low_pfn_mapped << PAGE_SHIFT,
+ all_tables_size, PAGE_SIZE);
+ if (acpi_tables_addr)
+ goto success;
+ }
+
acpi_tables_addr =
memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
all_tables_size, PAGE_SIZE);
@@ -639,6 +648,8 @@ void __init acpi_initrd_override(void *data, size_t size)
WARN_ON(1);
return;
}
+
+success:
/*
* Only calling e820_add_reserve does not work and the
* tables are invalid (memory got used) later.
--
1.7.1

2013-09-11 10:05:57

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 9/9] mem-hotplug: Introduce movablenode boot option to control memblock allocation direction.

The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

After memblock is ready, before SRAT is parsed, we should allocate memory
near the kernel image. So this patch does the following:

1. After memblock is ready, make memblock allocate memory from low address
to high.
2. After SRAT is parsed, make memblock behave as default, allocate memory
from high address to low.

This behavior is controlled by movablenode boot option.

Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
Documentation/kernel-parameters.txt | 15 ++++++++++++++
arch/x86/kernel/setup.c | 36 +++++++++++++++++++++++++++++++++++
include/linux/memory_hotplug.h | 5 ++++
mm/memory_hotplug.c | 9 ++++++++
4 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 479eeaf..f177c83 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.

+ movablenode [KNL,X86] This parameter enables/disables the
+ kernel to arrange hotpluggable memory ranges recorded
+ in ACPI SRAT(System Resource Affinity Table) as
+ ZONE_MOVABLE. And these memory can be hot-removed when
+ the system is up.
+ By specifying this option, all the hotpluggable memory
+ will be in ZONE_MOVABLE, which the kernel cannot use.
+ This will cause NUMA performance down. For users who
+ care about NUMA performance, just don't use it.
+ If all the memory ranges in the system are hotpluggable,
+ then the ones used by the kernel at early time, such as
+ kernel code and data segments, initrd file and so on,
+ won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+ Otherwise the kernel won't have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fa56a57..b87069b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1104,6 +1104,31 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();

+#ifdef CONFIG_MOVABLE_NODE
+ if (movablenode_enable_srat) {
+ /*
+ * Memory used by the kernel cannot be hot-removed because Linux
+ * cannot migrate the kernel pages. When memory hotplug is
+ * enabled, we should prevent memblock from allocating memory
+ * for the kernel.
+ *
+ * ACPI SRAT records all hotpluggable memory ranges. But before
+ * SRAT is parsed, we don't know about it.
+ *
+ * The kernel image is loaded into memory at very early time. We
+ * cannot prevent this anyway. So on NUMA system, we set any
+ * node the kernel resides in as un-hotpluggable.
+ *
+ * Since on modern servers, one node could have double-digit
+ * gigabytes memory, we can assume the memory around the kernel
+ * image is also un-hotpluggable. So before SRAT is parsed, just
+ * allocate memory near the kernel image to try the best to keep
+ * the kernel away from hotpluggable memory.
+ */
+ memblock_set_current_direction(MEMBLOCK_DIRECTION_LOW_TO_HIGH);
+ }
+#endif /* CONFIG_MOVABLE_NODE */
+
init_mem_mapping();

early_trap_pf_init();
@@ -1142,6 +1167,17 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();

initmem_init();
+
+#ifdef CONFIG_MOVABLE_NODE
+ if (movablenode_enable_srat) {
+ /*
+ * When ACPI SRAT is parsed, which is done in initmem_init(),
+ * set memblock back to the default behavior.
+ */
+ memblock_set_current_direction(MEMBLOCK_DIRECTION_DEFAULT);
+ }
+#endif /* CONFIG_MOVABLE_NODE */
+
memblock_find_dma_reserve();

/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index dd38e62..5d2c07b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,11 @@ enum {
ONLINE_MOVABLE,
};

+#ifdef CONFIG_MOVABLE_NODE
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+#endif /* CONFIG_MOVABLE_NODE */
+
/*
* pgdat resizing functions
*/
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca1dd3a..7252a7d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1345,6 +1345,15 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
{
return true;
}
+
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+ movablenode_enable_srat = true;
+ return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
#else /* CONFIG_MOVABLE_NODE */
/* ensure the node has NORMAL memory if it is still online */
static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
--
1.7.1

2013-09-11 10:05:26

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 2/9] x86, memblock: Introduce memblock_alloc_bottom_up() to memblock.

This patch introduces a new API memblock_alloc_bottom_up() to make memblock be
able to allocate from bottom upwards.

During early boot, if the bottom up mode is set, just try allocating bottom up
from the end of kernel image, and if that fails, do normal top down allocation.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 2 ++
mm/memblock.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index fbf6b6d..9a0958f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -151,6 +151,8 @@ phys_addr_t memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid);
phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid);

phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
+phys_addr_t memblock_alloc_bottom_up(phys_addr_t start, phys_addr_t end,
+ phys_addr_t size, phys_addr_t align);

static inline bool memblock_direction_bottom_up(void)
{
diff --git a/mm/memblock.c b/mm/memblock.c
index 7add615..d7485b9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
#include <linux/seq_file.h>
#include <linux/memblock.h>

+#include <asm-generic/sections.h>
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;

@@ -786,6 +788,42 @@ static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
return 0;
}

+/**
+ * memblock_alloc_bottom_up - allocate memory from bottom upwards
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
+ * @@end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to allocate
+ * @align: alignment of free area to allocate
+ *
+ * Allocate @size free area aligned to @align from the end of the kernel image
+ * upwards.
+ *
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_alloc_bottom_up(phys_addr_t start,
+ phys_addr_t end, phys_addr_t size,
+ phys_addr_t align)
+{
+ phys_addr_t this_start, this_end, cand;
+ u64 i;
+
+ if (start == MEMBLOCK_ALLOC_ACCESSIBLE)
+ start = __pa_symbol(_end); /* End of kernel image. */
+ if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+ end = memblock.current_limit;
+
+ for_each_free_mem_range(i, MAX_NUMNODES, &this_start, &this_end, NULL) {
+ this_start = clamp(this_start, start, end);
+ this_end = clamp(this_end, start, end);
+
+ cand = round_up(this_start, align);
+ if (cand < this_end && this_end - cand >= size)
+ return cand;
+ }
+
+ return 0;
+}
+
phys_addr_t __init memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid)
{
return memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
--
1.7.1

2013-09-11 10:06:47

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 8/9] x86, mem-hotplug: Support initialize page tables from low to high.

init_mem_mapping() is called before SRAT is parsed. And memblock will allocate
memory for page tables. To prevent page tables being allocated within hotpluggable
memory, we will allocate page tables from the end of kernel image to the higher
memory.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/init.c | 133 ++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 104 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..7dae4e3 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -54,11 +54,23 @@ __ref void *alloc_low_pages(unsigned int num)
unsigned long ret;
if (min_pfn_mapped >= max_pfn_mapped)
panic("alloc_low_page: ran out of memory");
+
+ if (memblock_direction_bottom_up()) {
+ ret = memblock_alloc_bottom_up(
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num, PAGE_SIZE);
+ if (ret)
+ goto reserve;
+ }
+
ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
+
+reserve:
memblock_reserve(ret, PAGE_SIZE * num);
pfn = ret >> PAGE_SHIFT;
} else {
@@ -401,13 +413,79 @@ static unsigned long __init init_range_memory_mapping(

/* (PUD_SHIFT-PMD_SHIFT)/2 */
#define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+#ifdef CONFIG_MOVABLE_NODE
+/**
+ * memory_map_from_low - Map [start, end) from low to high
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [start, end) in a
+ * heuristic way. In the beginning, step_size is small. The more memory we map
+ * memory in the next loop.
+ */
+static void __init memory_map_from_low(unsigned long start, unsigned long end)
+{
+ unsigned long next, new_mapped_ram_size;
+ unsigned long mapped_ram_size = 0;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+
+ while (start < end) {
+ if (end - start > step_size) {
+ next = round_up(start + 1, step_size);
+ if (next > end)
+ next = end;
+ } else
+ next = end;
+
+ new_mapped_ram_size = init_range_memory_mapping(start, next);
+ min_pfn_mapped = start >> PAGE_SHIFT;
+ start = next;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+#endif /* CONFIG_MOVABLE_NODE */
+
+/**
+ * memory_map_from_high - Map [start, end) from high to low
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function is similar to memory_map_from_low() except it maps memory
+ * from high to low.
+ */
+static void __init memory_map_from_high(unsigned long start, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
- unsigned long step_size;
- unsigned long addr;
+ unsigned long prev, new_mapped_ram_size;
unsigned long mapped_ram_size = 0;
- unsigned long new_mapped_ram_size;
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ unsigned long step_size = PMD_SIZE;
+
+ while (start < end) {
+ if (end > step_size) {
+ prev = round_down(end - 1, step_size);
+ if (prev < start)
+ prev = start;
+ } else
+ prev = start;
+
+ new_mapped_ram_size = init_range_memory_mapping(prev, end);
+ min_pfn_mapped = prev >> PAGE_SHIFT;
+ end = prev;
+
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+}
+
+void __init init_mem_mapping(void)
+{
+ unsigned long end;

probe_page_size_mask();

@@ -417,45 +495,42 @@ void __init init_mem_mapping(void)
end = max_low_pfn << PAGE_SHIFT;
#endif

- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ max_pfn_mapped = 0; /* will get exact value next */
+ min_pfn_mapped = end >> PAGE_SHIFT;
+
+#ifdef CONFIG_MOVABLE_NODE
+ unsigned long kernel_end;
+
+ if (memblock_direction_bottom_up()) {
+ kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
+
+ memory_map_from_low(kernel_end, end);
+ memory_map_from_low(ISA_END_ADDRESS, kernel_end);
+ goto out;
+ }
+#endif /* CONFIG_MOVABLE_NODE */
+
+ unsigned long addr, real_end;

/* xen has big range in reserved near end of ram, skip it at first.*/
addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

- /* step_size need to be small so pgt_buf from BRK could cover it */
- step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
- last_start = start = real_end;
-
/*
* We start from the top (end of memory) and go to the bottom.
* The memblock_find_in_range() gets us a block of RAM from the
* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
* for page table.
*/
- while (last_start > ISA_END_ADDRESS) {
- if (last_start > step_size) {
- start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
- } else
- start = ISA_END_ADDRESS;
- new_mapped_ram_size = init_range_memory_mapping(start,
- last_start);
- last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
- /* only increase step_size after big range get mapped */
- if (new_mapped_ram_size > mapped_ram_size)
- step_size <<= STEP_SIZE_SHIFT;
- mapped_ram_size += new_mapped_ram_size;
- }
+ memory_map_from_high(ISA_END_ADDRESS, real_end);

if (real_end < end)
init_range_memory_mapping(real_end, end);

+out:
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
/* can we preseve max_low_pfn ?*/
--
1.7.1

2013-09-11 10:05:24

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 1/9] memblock: Introduce allocation direction to memblock.

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory from high address to low.
So this patch introduces the allocation direct to memblock. It could be
used to tell memblock to allocate memory from high to low or from low
to high.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 22 ++++++++++++++++++++++
mm/memblock.c | 13 +++++++++++++
2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..fbf6b6d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,11 @@

#define INIT_MEMBLOCK_REGIONS 128

+/* Allocation order. */
+#define MEMBLOCK_DIRECTION_HIGH_TO_LOW 0
+#define MEMBLOCK_DIRECTION_LOW_TO_HIGH 1
+#define MEMBLOCK_DIRECTION_DEFAULT MEMBLOCK_DIRECTION_HIGH_TO_LOW
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
@@ -35,6 +40,7 @@ struct memblock_type {
};

struct memblock {
+ int current_direction; /* allocate from higher or lower address */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -146,6 +152,12 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)

phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);

+static inline bool memblock_direction_bottom_up(void)
+{
+ return memblock.current_direction == MEMBLOCK_DIRECTION_LOW_TO_HIGH;
+}
+
+
/* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
#define MEMBLOCK_ALLOC_ACCESSIBLE 0
@@ -173,6 +185,16 @@ static inline void memblock_dump_all(void)
}

/**
+ * memblock_set_current_direction - Set current allocation direction to allow
+ * allocating memory from higher to lower
+ * address or from lower to higher address
+ *
+ * @direction: In which order to allocate memory. Could be
+ * MEMBLOCK_DIRECTION_{HIGH_TO_LOW|LOW_TO_HIGH}
+ */
+void memblock_set_current_direction(int direction);
+
+/**
* memblock_set_current_limit - Set the current allocation limit to allow
* limiting allocations to what is currently
* accessible during boot
diff --git a/mm/memblock.c b/mm/memblock.c
index a847bfe..7add615 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -32,6 +32,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,

+ .current_direction = MEMBLOCK_DIRECTION_DEFAULT,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};

@@ -977,6 +978,18 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
}
}

+void __init_memblock memblock_set_current_direction(int direction)
+{
+ if (direction != MEMBLOCK_DIRECTION_HIGH_TO_LOW &&
+ direction != MEMBLOCK_DIRECTION_LOW_TO_HIGH) {
+ pr_warn("memblock: Failed to set allocation order. "
+ "Invalid order type: %d\n", direction);
+ return;
+ }
+
+ memblock.current_direction = direction;
+}
+
void __init_memblock memblock_set_current_limit(phys_addr_t limit)
{
memblock.current_limit = limit;
--
1.7.1

2013-09-11 10:07:32

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 4/9] x86: Support allocate memory from bottom upwards in setup_log_buf().

During early boot, if the bottom up mode is set, just
try allocating bottom up from the end of kernel image,
and if that fails, do normal top down allocation.

So in function setup_log_buf(), we add the above logic.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
kernel/printk/printk.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b4e8500..2958118 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -759,9 +759,20 @@ void __init setup_log_buf(int early)
if (early) {
unsigned long mem;

+ if (memblock_direction_bottom_up()) {
+ mem = memblock_alloc_bottom_up(
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ new_log_buf_len, PAGE_SIZE);
+ if (mem)
+ goto success;
+ }
+
mem = memblock_alloc(new_log_buf_len, PAGE_SIZE);
if (!mem)
return;
+
+success:
new_log_buf = __va(mem);
} else {
new_log_buf = alloc_bootmem_nopanic(new_log_buf_len);
--
1.7.1

2013-09-11 10:07:30

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 5/9] x86: Support allocate memory from bottom upwards in relocate_initrd().

During early boot, if the bottom up mode is set, just
try allocating bottom up from the end of kernel image,
and if that fails, do normal top down allocation.

So in function relocate_initrd(), we add the above logic.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/setup.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..7372be7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -326,6 +326,15 @@ static void __init relocate_initrd(void)
char *p, *q;

/* We need to move the initrd down into directly mapped mem */
+ if (memblock_direction_bottom_up()) {
+ ramdisk_here = memblock_alloc_bottom_up(
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ PFN_PHYS(max_pfn_mapped),
+ area_size, PAGE_SIZE);
+ if (ramdisk_here)
+ goto success;
+ }
+
ramdisk_here = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped),
area_size, PAGE_SIZE);

@@ -333,6 +342,7 @@ static void __init relocate_initrd(void)
panic("Cannot find place for new RAMDISK of size %lld\n",
ramdisk_size);

+success:
/* Note: this includes all the mem currently occupied by
the initrd, we rely on that fact to keep the data intact. */
memblock_reserve(ramdisk_here, area_size);
--
1.7.1

2013-09-11 10:08:07

by Tang Chen

[permalink] [raw]

Subject: [PATCH v2 3/9] x86, dma: Support allocate memory from bottom upwards in dma_contiguous_reserve().

During early boot, if the bottom up mode is set, just
try allocating bottom up from the end of kernel image,
and if that fails, do normal top down allocation.

So in function dma_contiguous_reserve(), we add the
above logic.

Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/base/dma-contiguous.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 6c9cdaa..3b4e031 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -260,17 +260,28 @@ int __init dma_declare_contiguous(struct device *dev, phys_addr_t size,
goto err;
}
} else {
+ phys_addr_t addr;
+
+ if (memblock_direction_bottom_up()) {
+ addr = memblock_alloc_bottom_up(
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ limit, size, alignment);
+ if (addr)
+ goto success;
+ }
+
/*
* Use __memblock_alloc_base() since
* memblock_alloc_base() panic()s.
*/
- phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);
+ addr = __memblock_alloc_base(size, alignment, limit);
if (!addr) {
base = -ENOMEM;
goto err;
- } else {
- base = addr;
}
+
+success:
+ base = addr;
}

/*
--
1.7.1

2013-09-11 12:51:09

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] x86, memblock: Allocate memory near kernel image before SRAT parsed.

On Wed, Sep 11, 2013 at 06:07:28PM +0800, Tang Chen wrote:
> This patch-set is based on tj's suggestion, and not fully tested.
> Just for review and discussion. And according to tj's suggestion,
> implemented a new function memblock_alloc_bottom_up() to allocate
> memory from bottom upwards, whihc can simplify the code.

For $DEITY's sake, can you please specify against which tree the
patches are? :(

--
tejun

2013-09-12 10:04:17

by Tang Chen

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] x86, memblock: Allocate memory near kernel image before SRAT parsed.

On 09/11/2013 08:51 PM, Tejun Heo wrote:
> On Wed, Sep 11, 2013 at 06:07:28PM +0800, Tang Chen wrote:
>> This patch-set is based on tj's suggestion, and not fully tested.
>> Just for review and discussion. And according to tj's suggestion,
>> implemented a new function memblock_alloc_bottom_up() to allocate
>> memory from bottom upwards, whihc can simplify the code.
>
> For $DEITY's sake, can you please specify against which tree the
> patches are? :(
>

Hi tj,

So sorry for the trouble. I have rebased the patches to the latest
kernel and resent it. I cannot access to my github now, so sorry for
that I cannot give you a tree with these patches.

Thanks. :)